Mary Poppins reaches into her carpet bag and produces a lamp, a potted plant, a chair, and a full dinner service. The bag is impossibly large on the inside. But Mary does not reach past the top layer. She extracts what she needs in the order she needs it. The bag holds everything, but access is sequential from the top. The bottom layer exists but takes longer to reach. Some things are easier to pull out than others depending on what is on top.
The context window is that bag. It has a capacity measured in tokens. Everything you send to the model fits inside. But the model pays attention more readily to tokens at the beginning and end of the context than to tokens buried in the middle. The attention mechanism that lets models reason over context is not uniform across the window. For many models, content at the extremes gets more weight than content in the center. The bag is large; the reach is selective.
If you load a long document into the middle of your context, the model may not reason over it as effectively as if that same content appeared at the start or end. This is not a universal property of all models, but it is common enough to be a design consideration. Position matters. The document in the bottom of the bag exists; it just takes longer to get to and might not be reached at all if the bag is too full.
This is not merely a theoretical concern. A team building a contract analysis system found that clause-level provisions buried in the middle of long contracts were regularly missed or misinterpreted. Moving the most important clauses to the beginning or end of the context (after moving them in the document structure itself) substantially improved analysis quality. The model was not equally attending to all parts of the document. The lawyer highlighted the key clause in bold; the judge read it first.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Some models have been specifically trained to handle long contexts more uniformly. Relative position encoding and other architectural improvements can reduce the positional bias. But the bias has not been eliminated across all available models. Know your model’s behavior at the context lengths you actually use, not just at the context lengths that are benchmarked. The bag has better internal organization in some models than others.
The lost-in-the-middle problem varies in severity across models. Some models show severe attention dropout in the middle. Others distribute attention more evenly. For long documents, this is part of your model evaluation criteria, not an afterthought. Test with your actual document lengths and measure whether critical information in the middle is faithfully represented in outputs. Measure whether your key clauses get analyzed correctly when they are in the middle of the contract.
This creates a practical heuristic: put your most important information at the boundaries of your context. System instructions at the beginning, critical data at the end, with supporting context in between only when it genuinely matters for reasoning. If you have two large documents and limited context space, decide which one goes at the boundary and which one risks reduced attention. Mary puts the lamp on top because it is what you need first.
What Consumes the Space
The context window is shared. Your system prompt consumes some. The conversation history consumes some. Your retrieved documents for a RAG system consume some. The examples you provide for few-shot prompting consume some. The space left for the actual response (the generation budget) is whatever remains. The bag is full of things; Mary decides what goes on top.
This means adding retrieved documents to a RAG system is not free. Every document you add to the context is a document that displaced something else. If your system prompt is 2k tokens and your retrieved documents are 20k tokens, you have 10k tokens left for conversation history and generation in a 32k context model. If you increase retrieval to 25k tokens to improve recall, you have 5k left, and your generation may be truncated or your conversation history may be evicted. You took out the lamp to make room for the plant; now you cannot find the lamp.
The trade-off between retrieval breadth and generation space is one you manage explicitly in well-designed systems. A common mistake is retrieving as many documents as possible because “more context is better.” More context is not always better if the extra context displaces the most relevant content or forces truncation of the generation output. Mary fills the bag carefully; the chaos approach means things get lost.
Context management is an active discipline, not a one-time configuration. As your application evolves, your context consumption evolves. A system prompt that starts at 500 tokens may grow to 2000 tokens as you add instructions. If you do not monitor context usage, you will gradually fill the context with overhead and have less room for the actual problem. The bag contents change over time; Mary reorganizes.
Conversation history management becomes critical as sessions grow. After twenty exchanges in a 32k context window, history alone may consume 15k tokens. Explicit truncation or summarization of old turns keeps space available for new work. Waiting for the context to fill before addressing this leads to degraded outputs once the limit is hit. The bag fills up during the conversation; Mary takes things out to make room for new things.
Model Differences in Context Handling
Different models handle context differently. Some have trained specifically to handle long contexts more uniformly. Some use relative position encoding that handles middle content better. Some have known degradation in the middle of very long contexts. Some truncate gracefully; others lose the beginning entirely when the window fills. The internal organization of the bag varies by model.
A model that handles 128k tokens sounds like it can reason over very long documents. That is sometimes true and sometimes a benchmark artifact. The model may perform well on tests that use synthetic documents specifically constructed for context length benchmarks but degrade on real documents with different structural properties. Real documents have hierarchical structure, digressions, and redundancy that synthetic benchmarks do not capture. The bag works well in the demo; real trips pack differently.
The practical implication: test with your actual documents at your actual context lengths. If you are building a system that will process 50-page documents, test with 50-page documents, not with documents engineered to hit a certain token count. The behavior on real content may differ substantially from the behavior on benchmark content. Mary packed for a demo; real travel is different.
Some models use context extension techniques that affect how position bias manifests. RoPE (Rotary Position Embedding) encodings allow better generalization to longer contexts than absolute position encodings. Models with RoPE may handle extended contexts better, but the improvement is in the model’s ability to use longer contexts, not necessarily in uniform attention distribution across them. The bag is bigger; the reach is still selective.
The Attention Budget
The model does not equally attend to everything in context. It has a limited attention budget that gets distributed across the context. When the context is small, the budget covers everything. When the context is large, the model must make trade-offs about what to attend to. Content at the extremes and content that is more salient tends to get more of that budget. Content in the middle and content that is less salient gets less. Mary can only pull out so many things; the top layers get attention.
This means there is a practical limit on how much useful information you can pack into a context. Beyond a certain point, adding more context dilutes the attention that each piece of content receives. A retrieval system that returns 50 relevant documents might provide less useful grounding than one that returns 5 highly curated documents, because the model cannot attend to 50 documents with the depth that each one needs. Mary stops pulling things out when the bag is too heavy.
Curating what goes into context is a design skill. It requires knowing what the model actually needs to answer the question, not just what information is tangentially related. The retrieval step should be selective, not exhaustive. Retrieve the 10 most relevant documents rather than the 50 documents that are all somewhat relevant. Mary chooses the lamp, not the entire dining service.
The generation budget is often overlooked in context design. Teams optimize retrieval for recall but forget to reserve space for the generated answer. If your generated answer is 2k tokens and you have not reserved that space, the answer gets truncated. Budget the generation space explicitly before designing your retrieval strategy. Mary knows the dining service must fit back in the bag.
Design your context usage around model attention patterns: put critical instructions and key data at context boundaries (beginning and end), put supporting information in the middle only when it is genuinely necessary for reasoning, treat additional context as a budget: retrieval breadth trades off against generation space, verify your model’s middle-context behavior for your target context lengths before production deployment, and test with real documents, not synthetic benchmarks.
Use shorter context when your task does not need extensive grounding, when cost and latency are primary constraints, when your most important content fits at the boundaries naturally, and when you are serving high-volume queries where context savings multiply across calls. Use longer context when you need to reason over large documents in full, when the document structure means important content is distributed throughout (not just at ends), when you have verified your model handles the target length without positional degradation, and when your retrieval curation is good enough that more context means better answers, not diluted attention.
The bag holds everything. What Mary extracts depends on what is on top. Manage the top of your context, because that is what the model will prioritize.