You have a 600-page book on regulatory compliance. You do not read it front to back. You scan the table of contents, identify the chapters relevant to your current question, read those chapters closely, and note the page numbers where the details live. When a new regulation applies next quarter, you know which chapter to revisit. The chapter structure is not arbitrary decoration; it is an organizing system that makes the book navigable. Without chapters, you would have to read the entire book to answer any question. With chapters, you can find the relevant section in seconds.
Chunking is the same practice applied to storing and retrieving documents for AI systems. Long documents get broken into segments, each chunk stored alongside a reference to its source. When a query comes in, the system retrieves the most relevant chunks and reads them to generate an answer. The goal is the same as book chapters: enable fast navigation to the content that answers the question, without requiring the system to process the entire document for every query. The chapter method works because you know roughly where to look. Chunking succeeds when your queries are specific enough to narrow the haystack.
But here is what many teams miss: the chapter method only works when the chapters are organized around the questions readers actually ask. A cookbook organized by type of dish (appetizers, entrees, desserts) works well for “what is a good dessert for a dinner party?” It works poorly for “what recipes can I make with leftover chicken?” The organization serves some queries and fails others. Chunking inherits this limitation. Your chunk boundaries should reflect your query patterns, not just your document structure.
The Size Problem
Chunk size determines what you can find and what you can understand. Too small and you lose context. A sentence about “the variance calculated under section 4.2” makes no sense if the preceding paragraph explaining what variance means got stored in a different chunk. The model reading this fragment does not know what variance is or why section 4.2 matters. It has the answer without the explanation. The recipe is in the book; you cannot find the ingredient list because the list was split across two pages in different chapters.
Too large and you dilute relevant content with noise. A chunk that contains three pages of background plus one relevant paragraph will deliver the noise along with the signal. When the retrieved chunk is mostly irrelevant, the model wastes context window capacity on content that does not help answer the query. The retrieval system returns something, but the something is not useful. You asked for the recipe for bread; the system returned the entire chapter on grain agriculture.
The right chunk size depends on the content structure and the retrieval use case. Narrative prose can often be chunked by paragraph or fixed token length. Structured documents with clear sections benefit from respecting those boundaries. A legal contract with clearly numbered sections should be chunked at the section level, not at arbitrary token boundaries that split definitions from references. Code retrieval may work better with entire function or class boundaries preserved, because code depends on context that crosses line boundaries. A function that calls another function expects that function to be present; splitting them across chunks breaks the code’s semantic continuity.
Fixed token chunking is simple to implement but naive. It ignores content structure entirely. A 500-token chunk that happens to split across a critical definition and its first use delivers half the relevant content and half noise. Semantic chunking that identifies natural break points produces better retrieval at the cost of more complex preprocessing. The preprocessing investment pays off in retrieval quality, especially for documents where structure carries meaning.
What Gets Lost
Chunking inherently fragments context. Relationships between chunks, the structure of arguments across sections, the hierarchy of general principle to specific exception, these things are harder for a retrieval system to use. The retrieval system can only return individual chunks. It cannot express the relationship between a principle in one chunk and an exception in another. The book has an index that tells you where to look; the chunked document has no index that tells you how the pieces relate.
Some systems add cross-chunk metadata or summary fields to preserve some of this structure. A chunk might include a summary field that captures what precedes and follows it. This adds overhead but helps the model understand the chunk’s place in the larger document. The simpler approach is to accept the fragmentation and design queries to be specific enough that individual chunks provide sufficient context. Neither solution fully restores the lost structure, but both reduce the damage.
When you retrieve a chunk, you retrieve a fragment, not the full argument. The model reading the fragment must reconstruct enough context to understand what the fragment means. If the fragment was extracted from a longer argument, the model has incomplete information. It may misinterpret the fragment’s intent, especially if the fragment appears to support a conclusion that the full argument qualified or rejected. A retrieved sentence that says “this approach is generally preferred” may have been extracted from a paragraph that said “this approach is generally preferred except in cases involving nuclear materials.” The fragment preserves the preference; it loses the exception. The reader draws the wrong conclusion from correct-sounding text.
Consider a medical guideline document. The full text says: “Administer aspirin for chest pain. However, if the patient has a history of bleeding disorders, do not administer aspirin and consider alternative anticoagulants.” The retrieval system returns the first sentence because “chest pain” matched the query. The model recommends aspirin without the critical exception. The patient with the bleeding disorder receives inappropriate treatment based on a fragment that preserved the rule but lost the caveat.
This is the fundamental trade-off. Chunking enables retrieval by making documents searchable. It degrades understanding by breaking documents apart. The more aggressively you chunk for searchability, the more you sacrifice coherence. Finding the right chunk size is balancing these competing pressures. The question is not “how do we chunk everything perfectly” but “how do we chunk for the queries we actually receive.”
The Boundary Problem
Deciding where to break chunks is not neutral. A naive chunker that slices every 500 tokens without regard for content structure will produce chunks that end mid-sentence, mid-paragraph, or mid-argument. The retrieval system then delivers fragments that are syntactically incomplete or logically partial. The sentence stops mid-word because the tokenizer hit its limit. The model receives a broken sentence and tries to make sense of it.
Consider a chunk that begins mid-sentence: “the variance must be calculated under section 4.2 of the applicable standard.” Without the preceding context that defined what “variance” means in this document, the chunk is ambiguous. Does this refer to statistical variance, budget variance, or some domain-specific variance? The document probably defined it earlier, but that definition is in a different chunk. The retrieval finds the wrong variance because the definition that would disambiguate was split off.
Semantic chunking attempts to identify natural break points: where a topic shifts, where a new argument begins, where a section concludes. This produces more coherent chunks but requires more processing to identify boundaries. The investment is usually worth it for documents with clear structure like legal contracts, regulatory filings, or academic papers. It is less worth it for simple collections like FAQs or product listings where items are independent. A FAQ where each question is independent does not need sophisticated chunking; a legal contract where sections reference each other does.
For nested documents, hierarchical chunking preserves structure by chunking at multiple levels: section chunks, subsection chunks, paragraph chunks. The retrieval system can then return the appropriate level based on query specificity. This adds complexity but preserves more of the original structure. A terms of service document might be chunked at the section level for queries like “what is the liability cap?” but at the paragraph level for queries like “what happens if I dispute a charge?” The hierarchical approach supports both, returning larger chunks for broad queries and smaller chunks for specific ones.
The hierarchy itself encodes meaning. Section headings tell the model what topics are distinct. Nested headings tell the model what is a subsection of what. This structural metadata is lost when you flatten everything to a single chunking level. Hierarchical chunking preserves the tree; flat chunking preserves only the leaves.
The Query Chunk Match Problem
Retrieval quality depends on whether your chunks align with what queries are asking. A chunking strategy that works well for “what is the penalty for late filing” may work poorly for “how do I appeal a penalty.” The first query is specific and factual; the retrieved chunk likely contains the penalty amount. The second query is procedural; answering it may require synthesizing steps from multiple sections. No single chunk contains the full procedure. The retrieval system returns fragments of the procedure; the model must assemble them.
This is the recall problem in retrieval. The chunks that contain individual steps are retrieved, but the system must synthesize across them. This works when the synthesis is straightforward. It breaks down when the chunks do not contain enough context for the model to connect them correctly. If step three assumes knowledge from step one, but steps one and three are in different chunks, the model may miss the connection.
Testing chunking strategy requires representative queries, not just representative documents. If your queries ask about procedures, ensure your chunks preserve procedural continuity. If your queries ask about definitions, ensure definitions are not split across chunks. The retrieval system cannot reconstruct coherent answers from incoherent fragments. You must design chunks for the queries you will receive, not for the documents in the abstract. The documents do not know what you will ask; you must anticipate.
Query analysis helps. If you know that 80% of queries are specific factual lookups, you can optimize for small, precise chunks. If 80% are broad topic explorations, you need larger chunks with more surrounding context. A retrieval system optimized for one query type may perform poorly on another. The mismatch between chunk design and query type is a common failure mode. Teams design chunks based on document structure, then discover their chunks do not serve their queries.
Overlapping Chunks
A technique that helps: overlapping chunk boundaries. If each chunk includes the last 100 tokens of the previous chunk and the first 100 tokens of the next chunk, you reduce the risk of losing critical context at boundaries. The overlap provides continuity. When a relevant passage appears at a chunk boundary, the overlapping tokens ensure that context is preserved across retrievals. The passage at the boundary appears in both chunks, complete with surrounding context.
This is especially useful for legal documents, technical specifications, and other content where definitions in one section are referenced in later sections. A chunk that contains a referenced term plus its definition will retrieve better than a chunk that contains only the reference. Overlap ensures that when a definition and its reference would be split across chunks, the overlap captures both. The term and its meaning travel together.
The overlap size is a tuning parameter. Too much overlap duplicates too much content across chunks, diluting retrieval precision. If every chunk is 80% overlap, the retrieval system returns mostly redundant content with little additional signal. Too little overlap fails to capture the boundary cases where context matters most. For most documents, 10-20% overlap strikes a reasonable balance. The exact percentage depends on how much context the domain requires and how frequently boundary cases arise.
Parent Chunk Retrieval
An alternative approach: store both chunk-level and document-level embeddings. When a query matches a chunk, retrieve the chunk. But also retrieve the parent document or section, and give the model both. This provides the precision of small chunks for retrieval with the context of larger units for generation.
This hybrid approach adds storage overhead but improves generation quality. The retrieval step finds the relevant passage. The generation step has access to the broader context to ensure the answer is complete and coherent. The trade-off is storage cost and retrieval complexity; the benefit is better answers for queries that require context. The model sees both the tree and the forest; it can place the leaf in context.
Consider a regulatory document with many sections. A query about reporting requirements retrieves a specific paragraph about quarterly reports. The parent chunk provides the section context that explains what department is responsible, what happens if reports are late, and how reports relate to annual disclosures. The model generates an answer that includes this broader context, not just the bare paragraph about timing.
Decision Rules
Use chunking when:
- Your documents exceed what can fit in a context window
- Queries tend to ask about specific topics or sections
- The retrieval task is find-and-synthesize rather than full-document reasoning
- Document structure allows meaningful boundaries to be identified
Do not over-chunk when:
- Your documents are already short (one page or less)
- Queries tend to ask about the document as a whole
- Document structure matters for the answer (preserve section boundaries)
Design chunk boundaries by:
- Respecting semantic units (paragraphs, sections, code functions)
- Testing with representative queries, not just representative documents
- Considering overlapping chunks for documents with heavy cross-references
- Using hierarchical chunking when documents have clear nested structure
- Storing parent document context for hybrid retrieval when context matters
The chapter method works because you know roughly where to look. Chunking succeeds when your queries are specific enough to narrow the haystack. If your users ask broad questions about undifferentiated topics, chunking will not solve your retrieval problems.