Why output tokens cost more than input tokens

Most people answer this with pricing logic. The real answer is in how inference works.

An LLM runs in two phases. They look similar from the outside but behave very differently on the GPU.

Prefill: processing your input

When you send a prompt, the model reads every token at once.

All 1,000 tokens go through the network in a single forward pass. The GPU does matrix multiplications on the full batch in parallel. Every core is busy. Memory bandwidth is saturated. Utilization is high.

This is what GPUs are built for. Big matrix, one shot.

The cost per input token ends up low because the hardware is being used the way it was designed to be used.

Decode: generating your output

Now the model has to produce a reply. It cannot generate the full response in one pass. It generates one token. Then it looks at that token plus everything before it. Then it generates the next one.

For every output token, the model runs a full forward pass.

100 output tokens = 100 forward passes.

Each pass has to read the entire model weights from memory. That is tens of gigabytes moved from HBM to the compute units, just to produce one token. The arithmetic is tiny compared to the memory traffic. Most of the GPU sits idle waiting for data.

This is the memory-bound regime. The GPU is not the bottleneck. Memory bandwidth is.

What this looks like on the price sheet

Here are the current rates for three frontier models, per million tokens:

Model	Input	Output	Ratio
Claude Opus 4.7	$5.00	$25.00	5x
GPT-5.4	$2.50	$15.00	6x
Gemini 3.1 Pro	$2.00	$12.00	6x

Notice the pattern. Different companies. Different hardware. Different model sizes. Same 5-6x output premium.

That is not a coincidence or a margin decision. It is the compute economics showing up on the invoice.

A 1,000 input token + 1,000 output token request on Opus 4.7 costs $0.005 for the input and $0.025 for the output. You sent the same number of tokens both ways. You paid 5x more for the second half.

Why the ratio matters for your bill

If you are building anything with LLMs, this one fact changes how you design the system:

Long input, short output is cheap. Short input, long output is expensive.

A RAG system that feeds 10,000 tokens of context and gets back a 200 token answer is very different on the bill from an agent that takes a 200 token instruction and writes 10,000 tokens of code. Same total tokens. The second one costs roughly 5x more.

This is also why prompt caching exists. Cached input tokens drop to 10% of the normal input rate on most providers. You are paying for what the GPU actually has to do, and reading cached KV values from memory is much cheaper than recomputing the prefill.

The short version

Prefill is bulk. Decode is drip.

Same GPU, very different efficiency.

The 5-6x you see on every provider's pricing page is not marketing. It is physics.

We build LLM-powered systems — conversational analytics, agents, AI overlays on data warehouses — with the token economics accounted for up front. If you're scoping an AI feature and want a sanity check on the cost model before you ship, book a discovery call.

Why output tokens cost more than input tokens

Prefill: processing your input

Decode: generating your output

What this looks like on the price sheet

Why the ratio matters for your bill

The short version

More on the same topics.

Talend to dbt on Google Cloud: migrating an API ingestion pipeline

ABM signals — how to fetch, process, and actually use intent data

Building an AI financial analyst with Claude and dbt

30 minutes. We'll tell you honestlywhat's broken.