Cheapest LLM API in 2026: Live Picks by Workload
The cheapest LLM API depends on your input/output ratio, quality bar, context window, and whether you can use caching or batch jobs. This guide ranks low-cost options from the live AI Pricing Hub database and explains how to choose without underestimating real usage cost.
Quick answer · Pricing data refreshed 2026-03-13 12:45:29
The cheapest LLM API is not one universal model.
For balanced text workloads, start with the lowest combined input plus output price. For code generation, output price matters more because generated code is usually longer than the prompt. For document analysis, input price and context length are the main constraints. For reasoning, a slightly higher per-token price can still be cheaper if it solves the task in fewer retries.
Balanced cost
GTE-BaseBest first check for general chat and support bots.
Chat workloads
GTE-BaseUse when requests are short and response length is predictable.
Code output
Qwen2.5 Coder 7B InstructOutput-heavy tasks should prioritize low output token cost.
Reasoning
Llama 3 8B LunarisFilter for reasoning capability before comparing price.
Live cheapest LLM APIs by blended price
Sorted by input + output price per 1M tokens. Use this list for balanced chat, support, and lightweight content workflows.
| Model | Provider | Input | Output | Total | Context | Best fit |
|---|---|---|---|---|---|---|
|
GTE-Base
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
E5-Base-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
paraphrase-MiniLM-L6-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
all-MiniLM-L12-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
bge-base-en-v1.5
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
multi-qa-mpnet-base-dot-v1
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
all-mpnet-base-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
all-MiniLM-L6-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
Qwen3 Embedding 8B
chat,tool_use,code
|
Alibaba Qwen
DeepInfra
|
$0.01 | Free | $0.01 | 32k | Code, agents, developer tools |
|
Qwen3 Embedding 8B
chat,tool_use,code
|
Alibaba Qwen
Nebius
|
$0.01 | Free | $0.01 | 32k | Code, agents, developer tools |
Cheapest LLM API for chat and support
Choose low total price, predictable latency, and enough context for conversation history.
| Model | Provider | Input | Output | Total | Context | Best fit |
|---|---|---|---|---|---|---|
|
GTE-Base
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
E5-Base-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
paraphrase-MiniLM-L6-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
all-MiniLM-L12-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
bge-base-en-v1.5
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
multi-qa-mpnet-base-dot-v1
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
all-mpnet-base-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
all-MiniLM-L6-v2
chat,tool_use
|
Other
DeepInfra
|
$0.0050 | Free | $0.0050 | 512 | Chat, classification, content ops |
|
Qwen3 Embedding 8B
chat,tool_use,code
|
Alibaba Qwen
DeepInfra
|
$0.01 | Free | $0.01 | 32k | Code, agents, developer tools |
|
Qwen3 Embedding 8B
chat,tool_use,code
|
Alibaba Qwen
Nebius
|
$0.01 | Free | $0.01 | 32k | Code, agents, developer tools |
Cheapest LLM API for code generation
Code tasks are output-heavy, so this table prioritizes models with low output price and code capability.
| Model | Provider | Input | Output | Total | Context | Best fit |
|---|---|---|---|---|---|---|
|
Qwen2.5 Coder 7B Instruct
chat,tool_use,code
|
Alibaba Qwen
Nebius
|
$0.03 | $0.09 | $0.12 | 33k | Code, agents, developer tools |
|
Qwen2.5 7B Instruct
chat,tool_use,code
|
Alibaba Qwen
Phala
|
$0.04 | $0.10 | $0.14 | 33k | Code, agents, developer tools |
|
Qwen3 235B A22B Instruct 2507
chat,tool_use,code
|
Alibaba Qwen
DeepInfra
|
$0.07 | $0.10 | $0.17 | 262k | Code, agents, developer tools |
|
Qwen2.5 Coder 32B Instruct
chat,tool_use,code
|
Alibaba Qwen
Chutes
|
$0.03 | $0.11 | $0.14 | 33k | Code, agents, developer tools |
|
Qwen-Turbo
chat,tool_use,code
|
Alibaba Qwen
Alibaba
|
$0.03 | $0.13 | $0.16 | 131k | Code, agents, developer tools |
|
Qwen3 8B
chat,reasoning,tool_use,code
|
Alibaba Qwen
Novita
|
$0.04 | $0.14 | $0.17 | 41k | Code, agents, developer tools |
|
Qwen2.5 Coder 32B Instruct
chat,tool_use,code
|
Alibaba Qwen
Hyperbolic
|
$0.20 | $0.20 | $0.40 | 33k | Code, agents, developer tools |
|
Qwen2.5-VL 7B Instruct
chat,vision,tool_use,code
|
Alibaba Qwen
Hyperbolic
|
$0.20 | $0.20 | $0.40 | 33k | Code, agents, developer tools |
|
Qwen2.5 VL 32B Instruct
chat,vision,tool_use,code
|
Alibaba Qwen
Chutes
|
$0.05 | $0.22 | $0.27 | 128k | Code, agents, developer tools |
|
Qwen3 14B
chat,reasoning,tool_use,code
|
Alibaba Qwen
Chutes
|
$0.05 | $0.22 | $0.27 | 41k | Code, agents, developer tools |
Cheapest reasoning-capable LLM APIs
Reasoning models can be worth a premium when they reduce retries, tool calls, or manual review.
| Model | Provider | Input | Output | Total | Context | Best fit |
|---|---|---|---|---|---|---|
|
Llama 3 8B Lunaris
chat,tool_use,reasoning
|
Other
DeepInfra
|
$0.04 | $0.05 | $0.09 | 8k | Reasoning and multi-step tasks |
|
gpt-oss-20b
chat,reasoning,tool_use
|
OpenAI
Chutes
|
$0.02 | $0.10 | $0.12 | 131k | Reasoning and multi-step tasks |
|
R1 Distill Llama 70B
chat,reasoning,tool_use
|
DeepSeek
Chutes
|
$0.03 | $0.11 | $0.14 | 131k | Reasoning and multi-step tasks |
|
gpt-oss-20b
chat,reasoning,tool_use
|
OpenAI
DeepInfra
|
$0.03 | $0.14 | $0.17 | 131k | Reasoning and multi-step tasks |
|
Qwen3 8B
chat,reasoning,tool_use,code
|
Alibaba Qwen
Novita
|
$0.04 | $0.14 | $0.17 | 41k | Code, agents, developer tools |
|
Trinity Mini
chat,reasoning,tool_use
|
Other
Clarifai
|
$0.04 | $0.15 | $0.20 | 131k | Reasoning and multi-step tasks |
|
Nemotron Nano 9B V2
chat,reasoning,tool_use
|
NVIDIA
DeepInfra
|
$0.04 | $0.16 | $0.20 | 131k | Reasoning and multi-step tasks |
|
gpt-oss-120b (exacto)
chat,reasoning,tool_use
|
OpenAI
DeepInfra
|
$0.04 | $0.19 | $0.23 | 131k | Reasoning and multi-step tasks |
|
gpt-oss-120b (exacto)
chat,reasoning,tool_use
|
OpenAI
DeepInfra
|
$0.04 | $0.19 | $0.23 | 131k | Reasoning and multi-step tasks |
|
Nemotron 3 Nano 30B A3B
chat,reasoning,tool_use
|
NVIDIA
DeepInfra
|
$0.05 | $0.20 | $0.25 | 262k | Reasoning and multi-step tasks |
Budget long-context LLM APIs
For document analysis, compare input price and context length before looking at output price.
| Model | Provider | Input | Output | Total | Context | Best fit |
|---|---|---|---|---|---|---|
|
Granite 4.0 Micro
chat,tool_use
|
IBM Granite
Cloudflare
|
$0.02 | $0.11 | $0.13 | 131k | Long context analysis |
|
Gemma 3 4B
chat,vision,tool_use
|
Google
Chutes
|
$0.02 | $0.07 | $0.09 | 131k | Text plus image analysis |
|
Mistral Nemo
chat,tool_use
|
Mistral
DeepInfra
|
$0.02 | $0.04 | $0.06 | 131k | Long context analysis |
|
Llama Guard 3 8B
chat,tool_use
|
Meta
Nebius
|
$0.02 | $0.06 | $0.08 | 131k | Long context analysis |
|
gpt-oss-20b
chat,reasoning,tool_use
|
OpenAI
Chutes
|
$0.02 | $0.10 | $0.12 | 131k | Reasoning and multi-step tasks |
|
Gemma 3 12B
chat,vision,tool_use
|
Google
Chutes
|
$0.03 | $0.10 | $0.13 | 131k | Text plus image analysis |
|
R1 Distill Llama 70B
chat,reasoning,tool_use
|
DeepSeek
Chutes
|
$0.03 | $0.11 | $0.14 | 131k | Reasoning and multi-step tasks |
|
Mistral Small 3.1 24B
chat,vision,tool_use
|
Mistral
Chutes
|
$0.03 | $0.11 | $0.14 | 128k | Text plus image analysis |
|
Gemma 3 27B
chat,vision,tool_use
|
Google
Chutes
|
$0.03 | $0.11 | $0.14 | 128k | Text plus image analysis |
|
gpt-oss-20b
chat,reasoning,tool_use
|
OpenAI
DeepInfra
|
$0.03 | $0.14 | $0.17 | 131k | Reasoning and multi-step tasks |
Free or free-tier LLM API candidates
Free models are useful for prototypes, tests, and low-risk workflows. Check rate limits before production use.
| Model | Provider | Input | Output | Total | Context | Best fit |
|---|---|---|---|---|---|---|
|
Sonoma Sky Alpha
chat,vision,reasoning
|
OpenRouter
OpenRouter
|
Free | Free | Free | 2.0M | Reasoning and multi-step tasks |
|
Sonoma Sky Alpha
chat,vision,reasoning
|
OpenRouter
OpenRouter
|
Free | Free | Free | 2.0M | Reasoning and multi-step tasks |
|
Sonoma Dusk Alpha
chat,vision
|
OpenRouter
OpenRouter
|
Free | Free | Free | 2.0M | Text plus image analysis |
|
Sonoma Dusk Alpha
chat,vision
|
OpenRouter
OpenRouter
|
Free | Free | Free | 2.0M | Text plus image analysis |
|
Gemini 1.5 Pro
chat,vision
|
Google
OpenRouter
|
Free | Free | Free | 2.0M | Text plus image analysis |
|
Gemini 1.5 Pro
chat,vision
|
Google
OpenRouter
|
Free | Free | Free | 2.0M | Text plus image analysis |
|
Auto Router
chat,vision,video,audio,image_gen
|
OpenRouter
OpenRouter
|
Free | Free | Free | 2.0M | Text plus image analysis |
|
Auto Router
chat,vision,video,audio,image_gen
|
OpenRouter
OpenRouter
|
Free | Free | Free | 2.0M | Text plus image analysis |
How to calculate the real cheapest LLM API
The lowest sticker price can still be expensive if your workload has long outputs, repeated context, retries, or non-token fees. Normalize every provider to the same unit before you decide:
Request cost = input tokens / 1,000,000 × input price + output tokens / 1,000,000 × output price
Then add request fees, image fees, cache discounts, batch discounts, and retry rates when they apply. A support bot with 300 input tokens and 120 output tokens is usually driven by combined price. A coding assistant that emits 2,000 output tokens is driven by output price. A document pipeline that reads 80,000 tokens per request is driven by input price, context length, and caching support.
Input-heavy
Summarization, RAG, legal review, and long document analysis. Prioritize low input price, context length, and prompt caching.
Output-heavy
Code generation, report writing, synthetic data, and multi-step agents. Output price usually dominates monthly spend.
Quality-sensitive
Reasoning, production support, and compliance reviews. Count retries and human escalation, not just token price.
Selection rules before you switch providers
Use these checks to avoid choosing a cheap model that increases your total cost later.
- Match capability first. Do not compare a tiny chat model to a reasoning or vision workload if it cannot complete the task.
- Estimate input/output ratio. Chat is often balanced, code is output-heavy, and retrieval workflows are input-heavy.
- Check context length. A cheap 8K model is not cheap if you must chunk every request into ten calls.
- Use caching for repeated prompts. System prompts, policy text, and shared documents can make cached input pricing more important than base input pricing.
- Use batch pricing for offline jobs. Classification, enrichment, extraction, and evaluation pipelines often tolerate delayed execution.
- Measure retries. If a model needs two retries to match another model's first answer, its effective price can triple.
Practical cheapest LLM API examples
These scenarios show why the same price table leads to different choices.
| Workload | Cost driver | What to optimize | Next step |
|---|---|---|---|
| Customer support chatbot | Many short requests | Total input + output price, latency, safe fallback behavior | Estimate monthly support cost |
| Code assistant | Longer output tokens | Output price, code capability, retry rate | Compare code-capable models |
| Document summarization | Large input context | Input price, context length, prompt caching | Browse text models |
| Offline enrichment pipeline | High volume, flexible timing | Batch discounts and deterministic output limits | Check recent pricing updates |
When the cheapest LLM API is the wrong choice
Do not optimize only for cents per million tokens when the model is part of a paid product workflow. A very cheap model can be the wrong pick when it has weak instruction following, insufficient context, missing tool use, poor structured output, unstable latency, or limited availability in your region. For production, run a small evaluation set before switching traffic.
A practical evaluation should include successful completion rate, average retries, output length, refusal rate, latency, and human review cost. If the cheapest candidate has a 15% lower completion rate, a slightly more expensive model may still reduce the total cost of the workflow.
Authoritative pricing references
Verify mission-critical decisions against official provider pricing pages: OpenAI, Anthropic, Google Gemini, DeepSeek, and OpenRouter reasoning token notes.
Cheapest LLM API FAQ
For a balanced workload, start with the lowest combined input plus output price in the live table above. For real workloads, verify the winner against your input/output ratio, context length, retries, and quality threshold.
Free tiers are useful for prototypes and tests, but they often have rate limits, model restrictions, or availability constraints. Production systems should compare paid fallback options before relying on a free tier.
Optimize for the larger side of your workload. RAG and summarization are input-heavy. Code generation, agent steps, and long reports are output-heavy. Chat support often needs a balanced total price.
If a stronger model reduces retries, tool calls, hallucination review, or prompt length, its total workflow cost can be lower even when token price is higher.
Re-check before major usage increases, new model launches, and provider migrations. Prices, free tiers, context windows, and cache discounts change often.
Yes. The tables normalize prices to USD per 1M tokens and link to provider/model pages so you can compare across major API providers.