When RAG failed: a knowledge retrieval project post-mortem

Simor Consulting | 29 Apr, 2026 | 05 Mins read

A legal technology company had invested six months building a retrieval-augmented generation system to help contract attorneys find relevant precedent clauses across a corpus of 180,000 executed agreements. The system ingested contracts, chunked them into passages, embedded the passages into a vector store, and used semantic search to retrieve relevant clauses for a given query. The retrieval results were passed to a large language model that synthesized an answer with citations.

The system worked well on test queries that resembled the examples the team had used during development. It failed in production in ways that were expensive to detect and dangerous to miss. Attorneys using the system reported that it returned plausible-sounding answers that were subtly wrong — not fabricated facts, but clauses retrieved from the wrong contract type, from governing law jurisdictions that did not apply, or from agreement versions that had been superseded by amendments. The system was confident. The system was wrong. And because the answers included citations, attorneys initially trusted them.

The failure mode

The root cause was not the language model. The language model faithfully summarized the passages it was given. The root cause was the retrieval layer. Semantic similarity, as measured by embedding distance, does not capture the structural and legal distinctions that matter in contract analysis.

A clause about limitation of liability in a software licensing agreement and a clause about limitation of liability in a professional services agreement may be semantically similar — they discuss the same legal concept, use overlapping terminology, and occupy similar positions in a contract. But they are not interchangeable. The software license clause may cap liability at the license fee. The services agreement may cap liability at the total contract value. An attorney who retrieves the wrong one and relies on it is making a drafting error that could cost their client millions.

The vector store had no notion of contract type, governing law, effective date, or amendment status. All passages were embedded as plain text. The semantic search returned passages that were linguistically similar to the query but legally irrelevant. The system could not distinguish between “relevant in meaning” and “relevant in context.”

Why the team missed this

The development team tested retrieval quality using standard information retrieval metrics: precision at k, recall at k, and mean reciprocal rank. These metrics measure whether the system returns documents that are topically related to the query. They do not measure whether the system returns documents that are legally applicable to the query’s context.

The test set was built by the engineering team, not by attorneys. The engineers created queries like “show me limitation of liability clauses” and evaluated whether the system returned clauses that contained limitation of liability language. By that metric, the system performed well. But the attorneys were not asking for clauses that contained language. They were asking for clauses that applied to their specific deal — the right contract type, the right jurisdiction, the right time period. The test set never measured this.

This is the failure pattern that repeats in RAG systems built for domain-specific knowledge. The engineering team optimizes for semantic relevance. The domain experts need contextual relevance. These are different objectives, and semantic search alone cannot bridge the gap.

The rebuild: structured retrieval with metadata filtering

We redesigned the retrieval system to treat metadata as a first-class citizen alongside the text embeddings. Every passage was indexed not only by its semantic embedding but also by a structured metadata record that captured the distinctions the attorneys actually cared about.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

The query parser extracted structured constraints from the attorney’s natural language query. When an attorney asked “show me limitation of liability clauses in software license agreements governed by New York law executed after 2022,” the parser identified three metadata filters: contract type equals software license, governing law equals New York, effective date after January 1, 2022. The semantic search retrieved candidate passages. The metadata filter eliminated passages that did not match the constraints. The relevance ranker re-scored the remaining candidates.

This sounds like a simple filter-and-search pattern, and in some sense it is. The hard part was not the architecture. The hard part was the metadata extraction. Contracts are unstructured documents. Extracting contract type, governing law, effective date, and amendment status from 180,000 agreements required a dedicated extraction pipeline with domain-specific validation rules. The extraction pipeline itself became a substantial engineering project — one that the original RAG system had skipped entirely.

What we gave up

The original system could answer any question about any contract with no setup required. The rebuilt system required well-structured metadata to function correctly. For contracts where the extraction pipeline could not confidently determine the metadata fields, the system either flagged low confidence or excluded the contract from retrieval results. Coverage dropped from one hundred percent of the corpus to roughly eighty-seven percent. The remaining thirteen percent required manual metadata annotation.

The second trade-off was query flexibility. The original system accepted any natural language query. The rebuilt system worked best when queries included structural constraints — contract type, jurisdiction, date range. Queries that were purely semantic, like “show me unusual termination clauses,” still worked but without the metadata filtering benefit. The team addressed this by building query suggestion tools that helped attorneys add structural constraints to their searches.

The third trade-off was maintenance cost. The metadata extraction pipeline required ongoing attention as the contract corpus grew and as the firm’s contract templates evolved. This was an operational cost that did not exist in the original vector-only system.

Results

Attorney trust in the system increased measurably after the rebuild. The firm tracked “citation verification rate” — how often attorneys checked the system’s citations against the source contract. Before the rebuild, attorneys verified seventy-eight percent of citations, indicating high distrust. After the rebuild, verification rates dropped to thirty-one percent, indicating that attorneys trusted the system’s citations enough to stop checking most of them.

More importantly, the rebuild eliminated the specific failure mode that had caused the most harm: retrieval of legally inapplicable clauses. Post-rebuild error analysis showed that ninety-four percent of retrieved passages were from the correct contract type and jurisdiction, compared to forty-one percent before the rebuild.

The decision heuristic

If your RAG system retrieves passages that are topically relevant but contextually wrong, the problem is not the embedding model or the chunking strategy. The problem is that your retrieval layer has no representation of the structural distinctions your domain experts actually use to judge relevance. Before tuning embeddings or re-chunking documents, ask your domain experts what makes two passages non-interchangeable. The answer will tell you what metadata your retrieval system is missing. Build the metadata layer first. The retrieval quality will follow.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Take AI Readiness Assessment Schedule Technical Consultation

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.

Similar Articles

Case Study Data Architecture

The data pipeline that cost $50K/month — and the audit that found why

22 Apr, 2026 | 04 Mins read

A financial services firm running analytics on trade settlement data came to us with a specific complaint: their cloud data platform cost had tripled in eighteen months, and nobody could explain why.

Case Study AI Infrastructure

How a retailer reduced inference latency 90% with feature store caching

21 Apr, 2026 | 04 Mins read

A mid-market e-commerce retailer with roughly $200M in annual revenue had invested eighteen months building a product recommendation engine. The models were accurate. Offline evaluation showed meaning

Case Study Data Architecture

Migrating from batch to streaming: a 6-month journey

28 Apr, 2026 | 05 Mins read

A logistics company processing two million shipments per day ran their entire operational reporting stack on nightly batch ETL. Every morning at 6 AM, operations managers reviewed dashboards built on

Case Study Data Architecture

From 3-hour dashboards to 3-minute insights: a BI modernization story

05 May, 2026 | 05 Mins read

A manufacturing company with facilities in twelve countries ran its operational reporting on a traditional BI stack: a data warehouse, an ETL pipeline, and a dashboard tool that had been deployed six

Case Study AI Infrastructure

The vector database that couldn't scale — and what we did instead

12 May, 2026 | 05 Mins read

A media company with a library of twelve million articles, transcripts, and research documents had built a semantic search system on a managed vector database. The system was designed to let journalis

Case Study AI Infrastructure

Building an AI operating system for a 10,000-person company

19 May, 2026 | 05 Mins read

A diversified industrial company with 10,000 employees across manufacturing, logistics, and field services had accumulated forty-seven separate AI projects over three years. Each business unit had bui

Case Study Data Architecture

How we killed our ETL pipeline (and productivity went up)

26 May, 2026 | 05 Mins read

A B2B SaaS company running a customer success platform had a data pipeline that consumed sixty percent of the data engineering team's time. Not feature work. Not analytics. Pipeline maintenance. The p

Case Study RAG

Case Study: End-to-End RAG Platform for Customer Support

05 Dec, 2025 | 05 Mins read

A SaaS company with 200 support agents and 10,000+ knowledge base articles had an 18-hour average response time and 23% first-contact resolution. Their largest enterprise client threatened to cancel a

Knowledge Layer AI Infrastructure

Designing the Enterprise Knowledge Layer: Beyond RAG

16 Jan, 2026 | 14 Mins read

Most teams implement retrieval-augmented generation and call it a knowledge layer. Give the model access to a vector database, stuff in some documents, and ship. This approach works for demos. It fall

Knowledge Layer Case Study

Case Study: Building a Production AI Knowledge Layer for Financial Services

01 Mar, 2026 | 10 Mins read

A regional bank's investment research team spent 60% of their time gathering information and 40% doing analysis. Analysts had to search through regulatory filings, internal research memos, market data

Knowledge Layer Knowledge Graphs

Knowledge Graphs and Vector Search: Complementary, Not Competitive

19 Apr, 2026 | 11 Mins read

The framing of knowledge graphs versus vector databases as competing technologies is a symptom of hype cycles that simplify complex architectural decisions for public discourse. Practitioners argue ab