The vector database that couldn't scale — and what we did instead

Simor Consulting | 12 May, 2026 | 05 Mins read

A media company with a library of twelve million articles, transcripts, and research documents had built a semantic search system on a managed vector database. The system was designed to let journalists and editors find related content across the entire archive using natural language queries. At five million documents, the system performed well. Query latency was under 200ms, relevance was high, and the editorial team adopted it enthusiastically.

By the time the index reached nine million documents, the system was in trouble. Query latency had crept above two seconds. Relevance had degraded — the system was returning semantically similar but editorially irrelevant results. And the monthly infrastructure bill for the vector database had grown to $38,000, roughly triple what the team had projected.

The problem was not the vector database product. The problem was the assumption that a single embedding space could represent twelve million documents across dozens of editorial domains with equal fidelity.

Why the vector space collapsed

Embedding models map text into a high-dimensional space where proximity represents semantic similarity. This works well when the corpus is homogeneous — medical papers, legal contracts, product reviews. The media company’s corpus was not homogeneous. It spanned political analysis, sports reporting, financial markets, cultural criticism, technology reviews, and investigative journalism. A query about “market volatility” could refer to financial markets, real estate, or the market for political advertising.

In a single embedding space, documents from all domains compete for position. As the corpus grew, the embedding space became crowded. Documents from different domains with superficially similar language began clustering together. The vector database’s approximate nearest neighbor search returned results that were close in the embedding space but distant in editorial relevance. The system could not distinguish between “semantically similar” and “editorially related” because the embedding space had no notion of editorial domain.

This is the scaling wall that vector-only search systems hit when the corpus is large and diverse. The embedding model was not the bottleneck. The bottleneck was the single-vector-space assumption.

What they tried

The team first tried re-embedding with a larger model. A model with more dimensions should, in theory, create more separation between unrelated documents. The larger model improved relevance slightly but doubled indexing costs and increased query latency by forty percent. The improvement did not justify the cost.

The second attempt was to add metadata filtering — restrict results by section, date range, or author. This helped for queries where the user knew which section to search. It did not help for cross-domain queries, which were the most common and the most valuable use case. A journalist investigating a story that spanned politics, finance, and technology needed results from all three domains, filtered by relevance rather than section.

The third attempt was to shard the vector database by editorial domain — one index for politics, one for finance, one for sports, and so on. Queries were routed to the relevant shards. This improved precision within each domain but broke cross-domain search. The routing logic also introduced a classification step that was itself error-prone: a story about political fundraising was classified as politics sixty percent of the time and finance forty percent of the time, depending on which paragraphs the classifier weighted most heavily.

The approach: hierarchical retrieval with domain-specific indexes

We replaced the single-index architecture with a hierarchical retrieval system that combined coarse-grained domain routing with fine-grained vector search within each domain.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

The domain router used a lightweight classifier to determine which domains were relevant to the query. Unlike the previous shard routing attempt, the router was not forced to pick a single domain. It returned a ranked list of relevant domains, and the search fan-out covered the top three. A query about political fundraising would hit both the political analysis and financial markets indexes. A query about a specific technology company would hit the technology index and possibly the financial markets index.

Each domain index contained only documents from its domain. The embedding model was the same across all indexes, but the reduced corpus size per index meant that the vector space was less crowded and the nearest neighbor search was more precise. The same embedding model that produced noisy results on a twelve-million-document corpus produced sharp results on a two-million-document corpus.

The cross-domain re-ranker took the top candidates from each domain index and re-scored them using a lightweight model that considered both semantic similarity and editorial relevance signals. The re-ranker had access to metadata that the embedding model did not: publication date, author reputation, section placement, and citation count within the archive. These signals were not available during embedding but were highly predictive of editorial relevance.

What we gave up

The hierarchical system added latency. The domain routing step, the parallel index queries, and the cross-domain re-ranking each added time. Total query latency was 400ms, compared to 180ms for the original system when it was running on the five-million-document corpus. However, compared to the degraded state at nine million documents, the hierarchical system was five times faster.

The second trade-off was indexing complexity. Instead of writing all documents to a single index, the ingestion pipeline had to classify each document into one or more domain indexes. The domain classifier was accurate about ninety-three percent of the time. The remaining seven percent were indexed in multiple domains, which increased storage costs by roughly fifteen percent.

The third trade-off was maintenance. Six domain indexes meant six index management operations instead of one. Schema changes, re-indexing jobs, and capacity planning all multiplied by the number of domains. The team automated most of this, but the operational surface area was larger.

Results

At twelve million documents, the hierarchical system returned results with median latency of 380ms and p99 latency of 900ms. Editorial relevance, as measured by click-through rate on search results, improved by thirty-four percent compared to the degraded single-index system. The infrastructure cost dropped from $38,000 per month to $14,000 per month, because the domain indexes were smaller and required less powerful instances than the monolithic index.

The most significant improvement was in cross-domain search. Journalists reported that the system surfaced connections between stories in different editorial domains that they would not have found through section-specific search. A query about a specific CEO returned results from financial markets, technology, and investigative journalism — results that had previously been hidden in separate, disconnected parts of the archive.

The decision heuristic

If your vector search relevance degrades as your corpus grows, the problem is not the embedding model or the vector database. The problem is that a single embedding space cannot represent a diverse corpus with equal fidelity at every scale. Split the corpus by domain, build separate indexes, and route queries to the relevant indexes. The embedding model stays the same. The vector space gets room to breathe. And the relevance improves because the nearest neighbor search is no longer competing with unrelated documents for position.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Take AI Readiness Assessment Schedule Technical Consultation

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.

Similar Articles

Data Engineering AI Infrastructure

Building AI-Ready Data Pipelines: Key Architecture Considerations

04 Mar, 2025 | 02 Mins read

Data pipelines built for business intelligence often fail when supporting AI workloads. The root cause is usually architectural: BI pipelines assume bounded, relatively static datasets, while AI syste

Data Architecture AI Infrastructure

The Modern Data Stack for AI Readiness: Architecture and Implementation

28 Jan, 2025 | 03 Mins read

Existing data infrastructure often cannot support ML workflows. The modern data stack offers a foundation, but it requires adaptation to become AI-ready. This article covers building a data architectu

AI Infrastructure Operations

The 7-step vector database selection checklist

26 Apr, 2026 | 06 Mins read

Most vector database selection failures come down to one mistake: picking the technology before mapping the workload. Teams benchmark embedding search speed on a curated dataset, pick the fastest opti

Case Study Data Architecture

The data pipeline that cost $50K/month — and the audit that found why

22 Apr, 2026 | 04 Mins read

A financial services firm running analytics on trade settlement data came to us with a specific complaint: their cloud data platform cost had tripled in eighteen months, and nobody could explain why.

Case Study AI Infrastructure

How a retailer reduced inference latency 90% with feature store caching

21 Apr, 2026 | 04 Mins read

A mid-market e-commerce retailer with roughly $200M in annual revenue had invested eighteen months building a product recommendation engine. The models were accurate. Offline evaluation showed meaning

Case Study Data Architecture

Migrating from batch to streaming: a 6-month journey

28 Apr, 2026 | 05 Mins read

A logistics company processing two million shipments per day ran their entire operational reporting stack on nightly batch ETL. Every morning at 6 AM, operations managers reviewed dashboards built on

Case Study Knowledge Layer

When RAG failed: a knowledge retrieval project post-mortem

29 Apr, 2026 | 05 Mins read

A legal technology company had invested six months building a retrieval-augmented generation system to help contract attorneys find relevant precedent clauses across a corpus of 180,000 executed agree

Trends AI Infrastructure

The open-source LLM landscape just shifted — again

02 May, 2026 | 03 Mins read

Three releases in the last six weeks have redrawn the open-source LLM map. Meta shipped Llama 4 with a mixture-of-experts architecture that narrows the gap with proprietary frontier models. Mistral re

AI Infrastructure Operations

Build vs buy: a decision tree for AI infrastructure

03 May, 2026 | 06 Mins read

Every AI infrastructure team eventually faces the same argument. One faction wants to build a custom solution because the commercial options do not handle their specific requirements. The other factio

Case Study Data Architecture

From 3-hour dashboards to 3-minute insights: a BI modernization story

05 May, 2026 | 05 Mins read

A manufacturing company with facilities in twelve countries ran its operational reporting on a traditional BI stack: a data warehouse, an ETL pipeline, and a dashboard tool that had been deployed six

Trends AI Infrastructure

Why every cloud provider launched an AI operating system this year

09 May, 2026 | 03 Mins read

AWS announced Bedrock Studio. Google shipped Vertex AI Platform as a unified surface. Azure consolidated its AI offerings under a single "AI Foundry" brand. Databricks, Snowflake, and even Cloudflare

Tooling AI Infrastructure

LLM evaluation platforms compared: LangSmith, Braintrust, Patronus

14 May, 2026 | 05 Mins read

Building an LLM application is the easy part. Knowing whether it works — whether it still works after you change a prompt, swap a model, or add a tool — is the hard part. LLM evaluation platforms exis

Trends AI Infrastructure

The A2A protocol and what it means for enterprise AI

16 May, 2026 | 03 Mins read

Google published the Agent-to-Agent (A2A) protocol specification in late 2025 and, as of this quarter, has secured endorsement from over fifty technology companies including Salesforce, SAP, ServiceNo

Case Study AI Infrastructure

Building an AI operating system for a 10,000-person company

19 May, 2026 | 05 Mins read

A diversified industrial company with 10,000 employees across manufacturing, logistics, and field services had accumulated forty-seven separate AI projects over three years. Each business unit had bui

AI Infrastructure Operations

A cost optimization framework for LLM inference

24 May, 2026 | 06 Mins read

LLM inference costs follow a pattern that catches teams off guard. The first prototype costs almost nothing -- a few hundred dollars a month during development. The pilot scales to a few thousand. Pro

Case Study Data Architecture

How we killed our ETL pipeline (and productivity went up)

26 May, 2026 | 05 Mins read

A B2B SaaS company running a customer success platform had a data pipeline that consumed sixty percent of the data engineering team's time. Not feature work. Not analytics. Pipeline maintenance. The p

Trends AI Infrastructure

AI spending is up 300% — where is it actually going?

27 May, 2026 | 03 Mins read

Enterprise AI spending increased roughly 300% year-over-year according to multiple industry surveys released this quarter. The headline number gets attention, but the breakdown is where the actionable

Tooling AI Infrastructure

The observability stack: Datadog vs Grafana vs Monte Carlo

28 May, 2026 | 05 Mins read

Observability is not one problem — it is three. Infrastructure observability watches your servers, containers, and network. Application observability watches your code, APIs, and user-facing behavior.

Data Infrastructure AI Infrastructure

The Rise of GPU Databases for AI Workloads

22 Jan, 2024 | 03 Mins read

Traditional relational database management systems were designed for an era of megabyte-scale datasets and batch reporting. AI workloads demand processing terabyte-scale datasets with complex analytic

AI Infrastructure Vector Databases

Vector Databases: The Missing Piece in Your AI Infrastructure

12 Jan, 2024 | 02 Mins read

Vector databases index and query high-dimensional vector embeddings. Unlike traditional databases that excel at exact matches, vector databases enable similarity search: finding items conceptually clo

Case Study RAG

Case Study: End-to-End RAG Platform for Customer Support

05 Dec, 2025 | 05 Mins read

A SaaS company with 200 support agents and 10,000+ knowledge base articles had an 18-hour average response time and 23% first-contact resolution. Their largest enterprise client threatened to cancel a

Knowledge Layer AI Infrastructure

Designing the Enterprise Knowledge Layer: Beyond RAG

16 Jan, 2026 | 14 Mins read

Most teams implement retrieval-augmented generation and call it a knowledge layer. Give the model access to a vector database, stuff in some documents, and ship. This approach works for demos. It fall

Agent Orchestration AI Infrastructure

AI Agent Orchestration Patterns: From Chaining to Multi-Agent Systems

27 Jan, 2026 | 13 Mins read

A software debugging agent receives a bug report. It needs to search code, understand the error, propose a fix, write tests, and summarize for the developer. None of these steps are independent. Each

Knowledge Layer Case Study

Case Study: Building a Production AI Knowledge Layer for Financial Services

01 Mar, 2026 | 10 Mins read

A regional bank's investment research team spent 60% of their time gathering information and 40% doing analysis. Analysts had to search through regulatory filings, internal research memos, market data

AI Infrastructure Legacy Modernization

AI Infrastructure for Legacy Systems: Modernizing 20-Year-Old ERPs with AI

18 Feb, 2026 | 13 Mins read

A manufacturing company runs their operations on an ERP system installed in 2004. The vendor still supports it. The team knows how to maintain it. The integrations are stable. It works. The problem i

AI Infrastructure Data Architecture

Feature Stores for AI: The Missing MLOps Component Reaching Maturity

12 Mar, 2026 | 11 Mins read

A recommendation system team built their tenth model. Each model required feature engineering. Each feature engineering project started by copying code from the previous project, then modifying it for

Agent Orchestration AI Infrastructure

Tool Calling and Function Calling: Connecting AI to Enterprise Systems

28 Mar, 2026 | 14 Mins read

A language model that only generates text is not enough for most enterprise problems. The real value emerges when an AI system can look up your customer record, check inventory levels across warehouse

Data Architecture AI Infrastructure

The AI Data Pipeline: Special Considerations for Unstructured and Structured Data

11 May, 2026 | 13 Mins read

Data pipelines for AI are not the same as data pipelines for traditional software systems. The outputs are different. The failure modes are different. The tolerance for data quality issues is differen

AI Infrastructure Observability

AI Observability: Monitoring Hallucinations, Latency, and Cost at Scale

30 Apr, 2026 | 09 Mins read

Traditional software monitoring tracks CPU utilization, memory consumption, request rates, and error counts. These metrics tell you whether your service is running and whether it is handling load. The

AI Infrastructure Performance

Semantic Caching for AI: Reducing Latency and Cost with Meaning-Based Retrieval

19 May, 2026 | 07 Mins read

Every repeated question your AI system answers is money spent and latency incurred that you did not need to. If a thousand users ask the same question in a week, running it through the language model

AI Infrastructure Evaluation

Evaluating LLM Providers for Enterprise: A Framework Beyond Benchmark

08 Apr, 2026 | 10 Mins read

Benchmark scores tell you how a model performs on problems that someone else chose. Your enterprise systems present different problems: your proprietary terminology, your specific data distributions,