A legal technology company had invested six months building a retrieval-augmented generation system to help contract attorneys find relevant precedent clauses across a corpus of 180,000 executed agreements. The system ingested contracts, chunked them into passages, embedded the passages into a vector store, and used semantic search to retrieve relevant clauses for a given query. The retrieval results were passed to a large language model that synthesized an answer with citations.
The system worked well on test queries that resembled the examples the team had used during development. It failed in production in ways that were expensive to detect and dangerous to miss. Attorneys using the system reported that it returned plausible-sounding answers that were subtly wrong — not fabricated facts, but clauses retrieved from the wrong contract type, from governing law jurisdictions that did not apply, or from agreement versions that had been superseded by amendments. The system was confident. The system was wrong. And because the answers included citations, attorneys initially trusted them.
The failure mode
The root cause was not the language model. The language model faithfully summarized the passages it was given. The root cause was the retrieval layer. Semantic similarity, as measured by embedding distance, does not capture the structural and legal distinctions that matter in contract analysis.
A clause about limitation of liability in a software licensing agreement and a clause about limitation of liability in a professional services agreement may be semantically similar — they discuss the same legal concept, use overlapping terminology, and occupy similar positions in a contract. But they are not interchangeable. The software license clause may cap liability at the license fee. The services agreement may cap liability at the total contract value. An attorney who retrieves the wrong one and relies on it is making a drafting error that could cost their client millions.
The vector store had no notion of contract type, governing law, effective date, or amendment status. All passages were embedded as plain text. The semantic search returned passages that were linguistically similar to the query but legally irrelevant. The system could not distinguish between “relevant in meaning” and “relevant in context.”
Why the team missed this
The development team tested retrieval quality using standard information retrieval metrics: precision at k, recall at k, and mean reciprocal rank. These metrics measure whether the system returns documents that are topically related to the query. They do not measure whether the system returns documents that are legally applicable to the query’s context.
The test set was built by the engineering team, not by attorneys. The engineers created queries like “show me limitation of liability clauses” and evaluated whether the system returned clauses that contained limitation of liability language. By that metric, the system performed well. But the attorneys were not asking for clauses that contained language. They were asking for clauses that applied to their specific deal — the right contract type, the right jurisdiction, the right time period. The test set never measured this.
This is the failure pattern that repeats in RAG systems built for domain-specific knowledge. The engineering team optimizes for semantic relevance. The domain experts need contextual relevance. These are different objectives, and semantic search alone cannot bridge the gap.
The rebuild: structured retrieval with metadata filtering
We redesigned the retrieval system to treat metadata as a first-class citizen alongside the text embeddings. Every passage was indexed not only by its semantic embedding but also by a structured metadata record that captured the distinctions the attorneys actually cared about.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
The query parser extracted structured constraints from the attorney’s natural language query. When an attorney asked “show me limitation of liability clauses in software license agreements governed by New York law executed after 2022,” the parser identified three metadata filters: contract type equals software license, governing law equals New York, effective date after January 1, 2022. The semantic search retrieved candidate passages. The metadata filter eliminated passages that did not match the constraints. The relevance ranker re-scored the remaining candidates.
This sounds like a simple filter-and-search pattern, and in some sense it is. The hard part was not the architecture. The hard part was the metadata extraction. Contracts are unstructured documents. Extracting contract type, governing law, effective date, and amendment status from 180,000 agreements required a dedicated extraction pipeline with domain-specific validation rules. The extraction pipeline itself became a substantial engineering project — one that the original RAG system had skipped entirely.
What we gave up
The original system could answer any question about any contract with no setup required. The rebuilt system required well-structured metadata to function correctly. For contracts where the extraction pipeline could not confidently determine the metadata fields, the system either flagged low confidence or excluded the contract from retrieval results. Coverage dropped from one hundred percent of the corpus to roughly eighty-seven percent. The remaining thirteen percent required manual metadata annotation.
The second trade-off was query flexibility. The original system accepted any natural language query. The rebuilt system worked best when queries included structural constraints — contract type, jurisdiction, date range. Queries that were purely semantic, like “show me unusual termination clauses,” still worked but without the metadata filtering benefit. The team addressed this by building query suggestion tools that helped attorneys add structural constraints to their searches.
The third trade-off was maintenance cost. The metadata extraction pipeline required ongoing attention as the contract corpus grew and as the firm’s contract templates evolved. This was an operational cost that did not exist in the original vector-only system.
Results
Attorney trust in the system increased measurably after the rebuild. The firm tracked “citation verification rate” — how often attorneys checked the system’s citations against the source contract. Before the rebuild, attorneys verified seventy-eight percent of citations, indicating high distrust. After the rebuild, verification rates dropped to thirty-one percent, indicating that attorneys trusted the system’s citations enough to stop checking most of them.
More importantly, the rebuild eliminated the specific failure mode that had caused the most harm: retrieval of legally inapplicable clauses. Post-rebuild error analysis showed that ninety-four percent of retrieved passages were from the correct contract type and jurisdiction, compared to forty-one percent before the rebuild.
The decision heuristic
If your RAG system retrieves passages that are topically relevant but contextually wrong, the problem is not the embedding model or the chunking strategy. The problem is that your retrieval layer has no representation of the structural distinctions your domain experts actually use to judge relevance. Before tuning embeddings or re-chunking documents, ask your domain experts what makes two passages non-interchangeable. The answer will tell you what metadata your retrieval system is missing. Build the metadata layer first. The retrieval quality will follow.