GPT-OSS Pricing Calculator for API and Self-Hosted Costs
Estimate gpt-oss token spend across hosted providers, free routes, and self-hosted deployment planning. Compare gpt-oss-120b and gpt-oss-20b rows from the AI Pricing Hub database, then model monthly input, output, cache, and batch usage.
Quick answer · Pricing data refreshed 2026-03-13 12:45:29
GPT-OSS weights are free, but API and infrastructure costs depend on where you run the model.
OpenAI describes gpt-oss as open-weight reasoning models available in 120B and 20B sizes. The weights can be downloaded and used under the model license, but hosted API usage, GPU time, storage, monitoring, and third-party routing still create real operating cost. This calculator focuses on the hosted token-price side and keeps self-hosted costs visible as a separate planning note.
GPT-OSS API cost estimator
Model token spend for a monthly workload. Free provider routes are shown as $0 token cost, but they can still have rate limits, queueing, usage caps, or data-handling tradeoffs.
Select a gpt-oss row to calculate token cost.
When this calculator helps
- Compare gpt-oss-120b and gpt-oss-20b token prices before choosing a hosted provider.
- Estimate whether a free route is enough for prototypes or whether paid capacity is required.
- Plan monthly spend for coding assistants, agent workflows, document processing, or reasoning-heavy chat.
- Separate token billing from self-hosted costs such as GPUs, storage, autoscaling, observability, and engineering time.
How to calculate GPT-OSS API cost
- Choose the gpt-oss row that matches your provider, model size, and pricing source.
- Enter average input and output tokens per request. Include system prompts, tool schemas, retrieved context, and hidden chain-of-work style overhead only when your provider bills those tokens.
- Enter the expected monthly request count. For agent workloads, count every model call in a multi-step run, not just each user-facing task.
- Add cache or batch percentages only when your selected provider exposes those discounts for the specific gpt-oss route.
- Use the operational buffer to cover retries, evaluation runs, prompt experiments, and failed calls that may still consume quota or engineering time.
GPT-OSS pricing table
Rows come from the AI Pricing Hub database. Always verify final production pricing with the provider before committing budget.
| Model | Provider | Input / 1M | Output / 1M | Cache | Batch | Context | Open |
|---|---|---|---|---|---|---|---|
| gpt-oss-120b (exacto) | OpenInference | Free | Free | - | - | 131k | Estimate |
| gpt-oss-120b (exacto) | DeepInfra | $0.04 | $0.19 | - | - | 131k | Estimate |
| gpt-oss-120b (exacto) | DeepInfra | $0.04 | $0.19 | - | - | 131k | Estimate |
| gpt-oss-20b | OpenInference | Free | Free | - | - | 131k | Estimate |
| gpt-oss-20b | Chutes | $0.02 | $0.10 | - | - | 131k | Estimate |
| gpt-oss-20b | DeepInfra | $0.03 | $0.14 | - | - | 131k | Estimate |
| gpt-oss-safeguard-20b | Groq | $0.07 | $0.30 | - | - | 131k | Estimate |
Hosted API vs self-hosted GPT-OSS cost
Hosted API pricing
Hosted gpt-oss routes usually bill per million input and output tokens. This is easiest to budget when traffic is spiky, when you need rapid experimentation, or when your team does not want to operate inference infrastructure. The tradeoff is provider-specific rate limits, queueing, routing behavior, data-retention policy, and a price that can change outside your own deployment.
For a hosted API, the main cost levers are model size, output length, prompt reuse, batch eligibility, and failure/retry rate. Output tokens often dominate the invoice, so reducing verbose responses can matter more than shaving a few tokens from the prompt.
Self-hosted pricing
Open-weight does not mean free production inference. Self-hosting shifts cost from token billing to GPUs, memory, disk, deployment engineering, monitoring, scaling, security review, and utilization risk. It can be attractive for steady high-volume workloads, data-control requirements, edge deployments, or customized fine-tuning workflows.
A practical decision rule is to compare your predicted hosted monthly spend with the fully loaded monthly cost of the servers and engineering hours needed to meet the same latency and reliability target.
Example GPT-OSS monthly workloads
Developer coding assistant
50,000 requests · 1,800 input · 700 output tokens
Use this as a pattern, not a benchmark. Real cost changes with provider price, tool calls, retries, and context length.
Document reasoning workflow
12,000 requests · 9,000 input · 1,400 output tokens
Use this as a pattern, not a benchmark. Real cost changes with provider price, tool calls, retries, and context length.
High-volume support triage
350,000 requests · 900 input · 350 output tokens
Use this as a pattern, not a benchmark. Real cost changes with provider price, tool calls, retries, and context length.
Alternatives to compare before choosing GPT-OSS
GPT-OSS is a strong fit when open weights, local deployment options, or provider flexibility matter. If your main priority is lowest hosted token price, compare it against other budget chat, coding, and reasoning rows before finalizing a provider.
| Model | Provider | Input | Output | Context |
|---|---|---|---|---|
| GTE-Base | Other | $0.0050 | Free | 512 |
| E5-Base-v2 | Other | $0.0050 | Free | 512 |
| paraphrase-MiniLM-L6-v2 | Other | $0.0050 | Free | 512 |
| all-MiniLM-L12-v2 | Other | $0.0050 | Free | 512 |
| bge-base-en-v1.5 | Other | $0.0050 | Free | 512 |
| multi-qa-mpnet-base-dot-v1 | Other | $0.0050 | Free | 512 |
| all-mpnet-base-v2 | Other | $0.0050 | Free | 512 |
| all-MiniLM-L6-v2 | Other | $0.0050 | Free | 512 |
For broader model selection, open the AI API price comparison or use the cheapest LLM API guide.
Limitations and edge cases
- Provider rows can differ for the same gpt-oss model because routing, free tiers, latency targets, and enterprise plans are different.
- Free routes can be useful for tests, but production systems should plan for rate limits, availability, and fallback behavior.
- Self-hosted estimates require hardware-specific throughput, utilization, and operations assumptions that are outside token-price billing.
- Some providers bill cache reads, batch jobs, failed attempts, or long-context requests differently. Use the provider's official pricing page for final procurement.
- Prompt caching only reduces cost when repeated prompt segments are actually reused and the selected provider exposes a cache price for the gpt-oss route.