The buffet is unlimited in theory. You can make as many trips as you want. But the plate you carry is finite. Stack it wrong and you have room for eight crab legs but no space for the mashed potatoes you actually wanted. The token budget is that plate. The model has a context window, the total space for everything you send it, and the plate fills up fast when you are not paying attention. The limit is structural and it operates whether you are aware of it or not, which means ignoring it produces degraded outputs rather than errors. You do not get an error message when the plate is full; you get a model that quietly ignores what did not fit.
The context window is the maximum number of tokens the model processes in a single call. Tokens are not exactly words, but close enough for intuition: a common word is one token, a longer word might be two or three, and punctuation adds tokens too. Your prompt, the model’s previous responses, the examples you include, the text you want generated: all of it sits in that window, all of it competes for the same space. A 500-word document is roughly 600-750 tokens. A 10-page contract is roughly 3,000-4,000 tokens. These add up faster than you expect when you start combining multiple documents with conversation history and system prompts.
Think of the budget as three layers competing for the same plate. The instruction layer tells the model what to do: your system prompt, your task description, your output format requirements, any constraints on the answer. The context layer is the information it reasons over: retrieved documents, conversation history, grounding data, few-shot examples. The generation layer is the space reserved for the answer. Move the sliders and you change the shape of the problem. If your instructions are verbose, you have less room for context. If your context documents are long, you have less room for the answer. The plate is shared, and every layer competes for the same finite space.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Long documents consume the context layer fast in ways that surprise teams. A fifty-page contract at roughly three tokens per word is about 7,500 tokens. A model with an 8,000-token window has 500 tokens left for instructions and generation, which is enough for a short answer but not much else. Even models offering 128k tokens sound generous until you start adding multiple documents, conversation history, detailed system prompts, and few-shot examples. The window fills before you expect it, and it fills faster when you are not actively managing it. Teams often discover their context is full only when the model starts producing truncated or degraded responses.
The cost dimension is concrete and often surprises teams in production. Longer context means more tokens processed, which means higher per-query costs and slower responses. A query against 50k tokens costs more than a query against 5k tokens, sometimes dramatically so depending on the model pricing structure. Some vendors charge linearly with context length. Others have tiered pricing where longer contexts cost proportionally more at each tier. Either way, context is not free real estate. Every token you send has a price tag, and the cumulative cost of oversized contexts across thousands of queries per day becomes a meaningful budget line that finance will notice even if engineering has not modeled it.
Some models compress context aggressively before reasoning over it. Others truncate the beginning when the window fills. A few use sliding windows that keep recent content visible while pushing older content further back, harder to access. The compression approaches vary in how much they preserve. A model that summarizes context to fit may lose important nuance. A model that truncates may lose the beginning of a document that establishes key framing. Know what your model does when the window fills, because you cannot assume it handles the overflow gracefully. The behavior under load is part of your model evaluation, not an afterthought.
The RAG Retrieval Punch
RAG systems have a direct conflict with token budgets that teams often discover only in production. You retrieve relevant documents to ground a model’s answer, but those documents consume your context budget. Retrieve too much and you have no room for the answer. Retrieve too little and the model lacks the information to answer correctly. The buffer plate is also a forcing function: it makes you decide what actually matters, which is often healthier than the alternative of retrieving everything and hoping the model figures it out.
A team building a contract review system learned this the hard way. They tried to feed entire contracts into the model for clause-by-clause analysis. With large contracts, there was no room left for the analysis output. The model kept truncating the contract mid-clause and producing incomplete reviews that missed key provisions in the truncated sections. The fix was not a bigger model or a longer context window. The fix was to chunk the contract intelligently: retrieve only the sections relevant to the specific question, not the whole document. The plate constraint forced a better design, and the resulting system was both faster and more accurate than the one that tried to put everything on the plate.
The retrieval budget is worth designing explicitly rather than discovering it in production. If you know your model has a 32k context and your system prompt consumes 2k tokens, you have 30k for context and generation. If you want to leave 5k for generation, your retrieval budget is 25k tokens. That number should drive your chunking strategy, your retrieval ranking cutoff, and your decision about how many documents to retrieve. Without an explicit budget, you will discover the constraint only when the model truncates your context mid-query and produces answers that reference content you never saw.
Query complexity affects how much context you need. A simple factual query might need only the document that contains the answer. A complex analytical query that requires reasoning across multiple documents needs more context. Designing your retrieval budget requires knowing the typical complexity of your queries, which requires analyzing your query distribution. A system that handles “what is X” questions needs a different budget than one that handles “compare X to Y across these dimensions.”
What Gets Left Behind
Truncation is the most common failure mode and it is invisible when it happens. The beginning of your context or the end gets cut when the budget runs out. Models vary in how gracefully they handle this. Some retain the beginning better. Some lose the middle entirely in a pattern sometimes called the “lost in the middle” problem, where content in the middle of a long context is reasoned over less effectively than content at the boundaries. A few models trained specifically on very long contexts handle middle content better, but this is not universal.
The practical implication is counterintuitive if you think of context as a uniform resource. Put your most important information at the boundaries. Your system instructions belong at the beginning or end of the context, not buried in the middle. The retrieved documents most central to the answer should be loaded last, sitting in the most recent and most attended positions. If you have a 100-page document and you only have room for 30 pages, the 30 pages at the boundaries will get more effective attention than the same 30 pages from the middle. When you are designing your retrieval pipeline, retrieval order matters as much as retrieval relevance.
Truncation also silently corrupts outputs in ways that are hard to debug. When a model is trained on context where the beginning and end get more attention, it learns to rely more heavily on boundary content. If your critical instructions get truncated, the model behaves as if they were never there. You will not get an error message. You will get a confident wrong answer that passes the casual review. The silent failure is the dangerous kind because it looks correct until you examine the wrong parts.
The lost-in-the-middle problem is especially acute for RAG systems. If you retrieve ten documents and only six fit in context, the model may underweight the ones that get truncated from the middle. A document relevant to the answer but positioned in the truncated middle section may as well not have been retrieved at all. Order your retrieved documents by relevance, then put the most relevant at the boundaries.
The Multi-Model Cost Picture
If you are running multiple model calls in a pipeline, the token cost compounds in ways that are easy to underestimate. A system that retrieves documents, re-ranks them, synthesizes a response, and then formats the output might run three or four model calls, each with its own context budget. The total token consumption of such a system is the sum of all those calls, not just the final one. Budget at the system level, not just the call level.
A pipeline that retrieves with a large context, reranks with another large context, and then synthesizes with yet another large context is expensive even if each step seems reasonable in isolation. The retrieval call might consume 30k tokens. The rerank call another 15k. The synthesis call another 20k. Three calls, 65k tokens total, three times the cost and latency of a single call. Pipeline design requires understanding the token cost of each stage and optimizing the expensive ones, not just the final one that users see. The retrieval call is often the largest context consumer and the most optimizable.
Token budgeting should account for the full conversation lifecycle. If a user has a long conversation with twenty exchanges, each exchange includes the full conversation history. After ten exchanges, you may be sending half the context window as conversation history before the new query. Explicit conversation window management (summarizing old turns, evicting early context) becomes necessary as conversations grow. The plate fills up differently as the meal progresses.
Build a token cost model before going to production. Start with your average query complexity: how many tokens does a typical query, context, and response consume? Multiply by your expected queries per day. Multiply by your cost per token. The result is a daily context cost that engineering can present to finance before finance starts asking questions. The model is not free at scale, and the cost curves are not linear.
Reserve budget for response generation, not just input context. A model that runs out of generation budget midway through an answer produces truncated outputs. If your answers typically run 500-1000 tokens, reserve at least 1500 tokens for generation to handle variance. Truncated answers require regeneration, which costs more than reserving adequate generation budget upfront. The answer is the product; protect the space for it.
Budget your token plate deliberately and explicitly. Keep critical instructions at the start and end of context where attention is highest. Put supporting documents in the middle only if they genuinely matter for reasoning. Verify what happens when you exceed the limit, whether it truncates or errors, before production use. Consider whether a longer-context model is worth the cost trade-off for your use case. Design your retrieval budget explicitly, not as an afterthought, and measure actual token usage in production to verify your estimates. Model the total token cost of your pipeline, not just individual calls, because the sum is what you pay.
Use shorter context for simple short tasks where the model does not need much grounding, for high-volume cost-sensitive applications where every token matters, for tasks where latency matters more than depth, and for cases where the answer fits in a few hundred tokens. The smaller context is faster and cheaper, and for these tasks it is also equally effective. Reserve the longer context for tasks where the extra information genuinely changes the answer.
The plate is finite. Choose what you put on it, and pay attention to which end of the plate the model actually eats from. Attention is not uniform across the context window, and designing your context with that in mind is the difference between a system that works and a system that mostly works until it does not.