Community Article

LLM Fundamentals: Tokens, Context, and Cost

Tokens are not characters or words. Context is not free. Cost is per-token in both directions. The three fundamentals that determine 80% of how an LLM-backed feature performs and bills.

LLM Fundamentals: Tokens, Context, and Cost

Tokens are not characters or words. Context is not free. Cost is per-token in both directions. The three fundamentals that determine 80% of how an LLM-backed feature performs and bills.

machine-learning
openai
performance
backend
ai-safety
valentinamwangi

By @valentinamwangi

December 28, 2025

·

Updated May 20, 2026

455 views

4

4.1 (9)

Three numbers determine roughly 80% of how a feature backed by a large language model performs in production: the number of tokens you send, the size of the context window the model can see, and the per-token price in each direction. Get those three intuitions right and most other questions about latency, cost, and feature design have obvious answers. Get them wrong and you ship features that work in the demo and quietly cost ten times what you budgeted.

This article is the on-ramp I wish I had been handed before I shipped my first LLM-powered feature. What a token actually is, why context windows are not the same as memory, and a cost-modeling exercise so the bill never surprises you. I will hedge specific numbers (model providers change pricing and capabilities frequently); the shapes are stable.

What a token actually is

A token is the smallest unit the model reads or writes. It is not a character. It is not a word. It is a sub-word unit produced by a tokenizer (BPE, SentencePiece, or a similar algorithm), and the same English text becomes a different number of tokens depending on which model's tokenizer you are using.

A few rough rules of thumb that have held across every English-language model I have used:

  • 1 token is roughly 4 characters, or roughly 3/4 of a word.
  • 1000 tokens is roughly 750 English words.
  • A short tweet (~280 characters) is roughly 70 tokens.
  • A typical email (~500 words) is roughly 650 to 700 tokens.
  • A long article (~3000 words) is roughly 4000 tokens.

Those are rough. Languages other than English (especially languages with rich morphology like Finnish, or character-based languages like Chinese) tokenize differently and the ratios shift. Code tokenizes more densely than prose because identifier names and punctuation are dense.

Why this matters in practice: when you call an API, you are billed per token in both directions. The user types 200 characters of question (roughly 50 tokens). You wrap it in a system prompt of 600 characters (roughly 150 tokens). You include three retrieved documents at 800 tokens each (2400 tokens). The total input is around 2600 tokens. The model's response is, say, 500 tokens. You billed 3100 tokens for one round trip, not the 50 tokens of the user's question.

A simple Python example using a generic tokenizer:

# pip install tiktoken (OpenAI's tokenizer; close enough for back-of-envelope)
import tiktoken

enc = tiktoken.get_encoding('cl100k_base')

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

prompt = '''You are a helpful assistant. Answer the user's question
based only on the provided context. If the context does not contain
the answer, say so explicitly.'''

user = 'What is the maximum number of items I can list per page?'

docs = [
    'API: GET /items?limit=N supports up to N=100. The default is 25.',
    'Performance: large limits trigger rate-limit checks at 1000 req/min.',
    'Pagination: cursor-based pagination is preferred over offset.',
]

total = count_tokens(prompt) + count_tokens(user) + sum(count_tokens(d) for d in docs)
print(f'input tokens: {total}')

Running that on a typical system prompt + question + three short docs lands somewhere around 150 to 300 tokens. The point is not the exact number; the point is that counting tokens is the first habit to develop, because every product question downstream (cost, latency, context budget) flows from it.

Context windows are not memory

The context window is the maximum number of tokens the model can see in a single call: the system prompt, the chat history, the retrieved documents, the user's current message, all of it together. Modern frontier models advertise context windows from 128K tokens to over 1 million; this is a tempting number, and the temptation is the trap.

Three things to know about context windows:

The window is not memory across calls. Each API call is stateless. "Memory" between turns of a chat is just you including the previous turns in the next call's input. Every previous turn you keep is paid for again, in tokens, on every subsequent call. A 20-turn conversation where each turn is 500 tokens costs 500 tokens on turn 1, 1000 on turn 2, ... 10,000 on turn 20. The cost grows quadratically with conversation length.

Long context degrades retrieval, sometimes badly. A model with a 1M-token context can technically read a small library, but recent evaluations show that information in the middle of a long context window is often retrieved less reliably than information at the start or end. The shape is sometimes called the "lost in the middle" effect; specifics vary by model, but the practical takeaway has held: shorter, focused contexts almost always retrieve better than longer, padded ones.

Long context is expensive linearly. Most providers bill input tokens linearly, with no discount for longer contexts. Sending 100K tokens of input is 100x the cost of sending 1K. There are exceptions (some providers offer prompt caching that effectively gives big discounts on repeated context), but the linear-cost shape is the safe assumption.

The consequence for product design: shorter, more relevant context beats longer, exhaustive context, on every dimension that matters. Quality, cost, and latency all improve. The skill is selecting which 5K tokens of context the model actually needs to answer well, not stuffing 200K tokens in and hoping.

A cost model that fits on the back of an envelope

The cost calculation that determines whether a feature is sustainable is straightforward; teams skip it because the numbers feel small per request and intimidating in aggregate. Here is the back-of-envelope I run before approving any LLM feature.

LLM cost per request (back of envelope)
  input_tokens  =  system_prompt + retrieved_docs + chat_history + user_message
  output_tokens =  estimated response length
  cost_per_request = (input_tokens * price_in)  +  (output_tokens * price_out)
  monthly_cost = cost_per_request * requests_per_user * monthly_active_users

A worked example. Assume hypothetical (and rounded) prices: $5 per million input tokens, $15 per million output tokens. Assume the feature averages 3000 input tokens and 500 output tokens per call. A single call costs:

cost_per_request = (3000 / 1_000_000) * $5  +  (500 / 1_000_000) * $15
                 = $0.015 + $0.0075
                 = $0.0225

Two and a quarter cents per call. Now layer the user model. Suppose 10,000 monthly active users each make 30 calls per month:

monthly_cost = $0.0225 * 30 * 10_000 = $6,750/month

That is the realistic number to budget for. The features that fail in production are usually the ones where this calculation was not done and the team discovered the cost on the first month's invoice. The features that succeed almost always have one of three properties: input tokens are small (under 1K), the user model is bounded (calls per user are controlled), or the per-call cost is offset by clear monetizable value.

The optimization knobs to remember, in rough order of payoff:

  1. Reduce input tokens. Better retrieval, shorter system prompts, summarize chat history past N turns. This is almost always the biggest lever.
  2. Cap output tokens. Set max_tokens so the model cannot ramble. A response capped at 300 tokens costs less than one capped at 1500.
  3. Cache the static parts. If your system prompt is 1500 tokens and never changes, providers offering prompt caching can charge a fraction of full input rate for the cached portion.
  4. Pick the right model size. A smaller model is often 5x to 10x cheaper than the flagship. For many tasks (classification, simple extraction, short rewrites) the smaller model is fine. Default to smaller, escalate when measurement shows you need it.
  5. Batch when latency allows. Some providers offer a batch API at half price for non-interactive workloads.

Latency: the other budget

Cost is one budget. Latency is the other, and the two trade off.

A typical request to a hosted frontier model has two phases: time-to-first-token (the model preparing to generate) and tokens-per-second after that. Rough numbers I have measured at various points:

Latency rough order of magnitude (varies by provider and model)
  Time-to-first-token       300 to 1500 ms
  Tokens-per-second          30 to 150 tok/s after the first token
  500-token response total   ~3 to 8 seconds end-to-end

For an interactive feature (chat, autocomplete), this is the budget. Streaming the response (showing tokens as they arrive) is the pattern that hides the duration: the user sees something within 500 ms and reads at the same rate the model generates. Without streaming, an 8-second wait is unacceptable; with streaming, it feels fluid.

For a non-interactive feature (a background job, a daily report), latency mostly does not matter and you can pick the cheaper batch options.

Failure modes I keep hitting

Three categories of failure I see across LLM-backed features. None of them are fundamental to the model; all of them are fundamental to using the model in production.

Token-count mismatches between dev and prod. The system prompt grows from 500 tokens in dev to 1800 tokens in prod because someone added a few examples. The cost triples. Nobody noticed until the invoice arrived. Fix: budget tokens explicitly at the prompt-construction layer, log the input-token count on every call, alert when it drifts.

Output that is technically correct but wrong format. The model returned valid prose where the schema expected JSON. Or it added a polite preamble before the JSON. Production code crashes parsing. Fix: structured-output features (JSON mode, schema-constrained generation) on providers that support them; fallback retry with a stricter prompt if the parse fails; do not assume a format will hold.

Slow drift in quality without anyone noticing. The provider updates the model. The previous prompt that worked now misbehaves on edge cases nobody tests. Fix: a small evaluation harness (say, 50 to 200 representative cases with expected outputs) that runs nightly. The harness does not catch everything, but it catches the obvious regressions before they become support tickets.

The intuition I keep coming back to

LLMs are powerful but they are not free, and they are not memoryless in the operational sense. Every call is a paid stateless function from a pile of tokens to another pile of tokens, and the engineering work is mostly about keeping the input pile small and informative, the output pile bounded, and the cost-per-call low enough to scale with usage. Once that intuition is settled, every other question (which model, which retrieval strategy, how to handle long conversations, when to cache) has a clearer answer. The features that ship and stay shipped are the ones where someone counted tokens before they wrote the prompt; the features that get rolled back are the ones where the team discovered the bill on the first of the month.

Back to Articles