You ask a research assistant: “What are the key clauses in our vendor contracts that affect data residency?” The assistant does not know off the top of their head. They go to the document store, find relevant contracts, read the relevant sections, and come back with a synthesized answer. They did not generate the information from memory. They retrieved it and synthesized it. The assistant’s value is in finding the right documents and summarizing accurately.
RAG (Retrieval-Augmented Generation) is the same pattern applied to language models. The model does not rely solely on what it learned during training. It retrieves relevant documents, reads them, and generates an answer grounded in what it found. The retrieval step is explicit; the synthesis is the model’s job. The model is a research assistant that can read at scale.
The two-step structure matters. Retrieval is a search problem. Generation is a language problem. Separating them lets each component be evaluated and improved independently. You can swap your embedding model without changing your generation model. You can improve your retrieval relevance without retraining anything. The assistant’s ability to find documents is separate from their ability to summarize.
This separation also means you can debug each step separately. When the system produces a wrong answer, you can ask whether the failure was in retrieval (wrong documents found) or in synthesis (right documents, wrong interpretation). A monolithic system where the model retrieves and synthesizes in one step does not give you that diagnostic clarity. The assistant returned bad results; you need to know whether the assistant looked in the wrong place or read the wrong things.
The quality of the answer is bounded by the quality of the retrieval. Retrieve irrelevant documents and the model synthesizes from noise. This is the most common RAG failure mode. Teams spend months tuning the generation model and then discover that the retrieval was returning the wrong documents in the wrong order. Improving the retrieval produces immediate answer quality improvements that months of prompt tuning could not achieve. The assistant cannot summarize what they could not find.
Evaluation should happen at both stages independently. Measure retrieval quality with standard information retrieval metrics: recall (did you find the relevant documents?), precision (are the found documents actually relevant?), and MRR (mean reciprocal rank, does the most relevant document rank highest?). Measure synthesis quality separately on a grounded evaluation set where retrieved documents are held constant. The assistant is evaluated on finding and on summarizing; different skills.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Retrieval quality depends on chunking strategy, embedding model choice, and the match between your query and your stored content. Chunk too small and you lose context; chunk too large and you dilute relevant content with irrelevant surrounding text. The embedding model determines what “relevant” means; a model trained on general text may not capture your domain’s key distinctions. The assistant’s search strategy matters as much as their reading ability.
Chunking is undersung. The right chunk size depends on the nature of your documents and the nature of your queries. Legal contracts have section-level structure that should inform chunk boundaries. Technical documentation often has hierarchical headings. A chunking strategy that ignores document structure produces chunks that are coherent for neither the embedding model nor the human reader. The assistant needs enough context to understand each chunk; a sentence torn from a paragraph is harder to summarize than the paragraph itself.
A good default is semantic chunking: split on meaningful boundaries (paragraphs, sections) rather than fixed token counts. Then evaluate whether the resulting chunks are coherent enough to be retrieved independently. A chunk that cannot stand alone as a meaningful unit of information will not produce good retrieval results. The chunk must make sense on its own; the assistant must be able to understand it without the surrounding text.
Overlap between chunks preserves context at boundaries. If a section header is critical for interpreting the following paragraph, and you chunk at the boundary, the header and paragraph may be separated. Small overlap (10-15% token overlap) between adjacent chunks preserves boundary context without significant redundancy. The assistant gets enough context to understand each chunk without reading the entire document twice.
Hybrid search often outperforms either semantic or keyword search alone. BM25 keyword retrieval catches exact matches that embedding similarity misses. The vector component catches semantic similarity that keyword matching misses. The combination handles both semantic similarity and exact terminology. Queries about specific identifiers, product names, or code terms benefit from keyword matching. Queries about concepts and meaning benefit from embedding similarity. The assistant uses both the map and the index.
Reranking and Refinement
Raw retrieval results are usually not the final input to synthesis. A reranking step takes the initially retrieved documents and reorders them based on additional relevance signals. The initial retrieval might return 20 documents. The reranker narrows to the 5 most relevant, possibly considering query-document term overlap, proximity of query terms, and document quality signals. The assistant does not read everything returned by the search; the assistant reads the most promising.
Cross-encoder rerankers evaluate query-document pairs jointly rather than independently. A bi-encoder embeds the query and document separately and computes similarity between embeddings. A cross-encoder takes the query and document together and produces a relevance score. Cross-encoders are slower but more accurate because they can attend to the full query-document interaction. The assistant reads each document carefully before deciding it is relevant.
Reranking adds latency but meaningfully improves synthesis quality. The trade-off is usually worth it for systems where answer quality matters more than immediate response time. A two-stage retrieval-then-rerank pipeline is a common production architecture. The assistant takes an extra minute to read carefully rather than skimming everything.
Query expansion is another refinement. A user query like “data residency” might be expanded to include related terms (“data localization,” “cross-border data transfer,” “data sovereignty”) before retrieval. This increases recall for queries where the exact term is not what the document uses. Query expansion requires domain knowledge to do well; expanding with irrelevant terms degrades retrieval quality. The assistant asks follow-up questions to clarify.
Query decomposition breaks complex queries into sub-queries. “What is our policy on data residency for EU customers and how does it compare to US requirements?” might decompose into “EU data residency policy” and “US data residency requirements.” Retrieve for each sub-query, then synthesize the combined results. This handles multi-part questions that single queries retrieve poorly. The assistant breaks the question into parts before searching.
Grounding and Citation
RAG answers can cite their sources because the retrieved documents are explicit. This is valuable for auditability and for user trust. “The answer came from Section 4.2 of Contract X” is better than “the model said so.” Citation enables users to verify the answer, which is especially important when the answer has consequences. The assistant shows their work.
Citation requires that the synthesis stay close to the retrieved documents. If the model generates an answer that goes beyond the documents and the citation still points to those documents, the citation is misleading. Good RAG systems include a citation verification step that checks whether the generated answer is actually supported by the cited passages. The assistant does not add conclusions that the sources do not support.
Attribution is different from citation. Citation means “this passage informed the answer.” Attribution means “this passage is the source of this specific claim.” The distinction matters for downstream use. If a claim in the answer needs to be verified for legal purposes, attribution to a specific passage is more useful than general citation to a document. The assistant distinguishes between general guidance and specific evidence.
Grounding quality degrades when the retrieved documents do not actually support the generated answer. This can happen when the retrieval is tangentially related but not directly supportive. A claim verification step that checks generated claims against source passages catches this. Without it, you have confident synthesis that is not actually grounded. The assistant verifies each claim before stating it.
When RAG Is the Wrong Architecture
RAG adds retrieval complexity. For tasks where the model already knows what it needs, RAG is overhead. If you are asking a model to write a creative story, retrieval grounding is irrelevant. If you are asking a model to explain a well-established concept that is in its training data, RAG adds cost and latency for no benefit. The assistant does not need to look things up for general knowledge.
RAG is also wrong when your retrieval quality is low. If your corpus is poorly organized, your chunking is wrong, or your embedding model is a poor fit, RAG will return bad documents and the model will generate from bad inputs. Improving retrieval before adding RAG is the right sequence. A RAG system is only as good as its retrieval. The assistant cannot summarize documents they could not find.
For closed-domain questions where the answer is a specific fact in a specific document, a good keyword search may outperform embedding retrieval. RAG shines when the questions are semantic and the answers require reasoning across multiple documents. It adds less value when the questions are lookup questions with precise answers. The assistant is better at summarizing than at finding specific numbers.
Synthetic context windows (long context models that effectively do RAG internally) are an alternative to explicit RAG for some use cases. A model with a 128k context window can ingest many documents directly. The trade-off is cost: long-context inference is more expensive than retrieval-augmented short-context inference. At scale, explicit RAG is usually cheaper than long-context synthesis. The assistant reads the whole library; or the assistant finds the right book.
Use RAG when your application needs grounded, factual answers, when the model needs access to current or proprietary information beyond its training cutoff, when source citation and audit trails matter, when the task requires specific document content rather than general knowledge, and when your retrieval quality is demonstrably good. Do not use RAG when the task is purely generative or creative (no grounding needed), when the model already knows what it needs (general knowledge tasks), when retrieval latency and cost exceed the value of grounding, when your retrieval quality is low (improve retrieval before adding RAG), and when you need lookup-style exact match answers (keyword search may be better).
The research assistant’s value depends on finding the right documents. Build your retrieval before you rely on the synthesis. A synthesizer working from bad documents produces bad answers, and the synthesis looks confident because it does not know it is working from noise. The assistant must search well before they can summarize well.