Vector Databases: The Missing Piece for Building Effective LLM Applications

Simor Consulting | 10 Jan, 2025 | 03 Mins read

LLM applications face four consistent challenges: hallucination, context window limits, knowledge freshness, and cost. Vector databases enable retrieval-augmented generation (RAG), a pattern that addresses these challenges by combining LLMs with information retrieval. This article covers how vector databases work and how to implement them effectively.

The LLM Implementation Challenges

1. Hallucination Risk

LLMs generate incorrect information with high confidence. This creates business risks when applications provide inaccurate technical information, make false product claims, or generate misleading advice.

2. Context Window Limitations

Despite recent improvements, LLMs have finite context windows:

GPT-4 Turbo: 128,000 tokens (~100 pages)
Claude 3 Opus: 200,000 tokens (~150 pages)
Llama 3: 8,000 tokens (~6 pages)

These limits make it impossible to include all potentially relevant information for complex queries.

3. Knowledge Freshness

Pre-trained models have knowledge cutoffs:

GPT-4: April 2023 cutoff
Claude 3: August 2023 cutoff
Llama 3: September 2023 cutoff

Models cannot access up-to-date information without external supplementation.

4. Cost Efficiency

Token usage directly drives operational costs:

GPT-4 Turbo: $0.01/1K input tokens, $0.03/1K output tokens
Claude 3 Opus: $0.015/1K input tokens, $0.075/1K output tokens

Without optimization, costs escalate with usage volume.

Vector Databases and RAG

Vector databases enable RAG, combining retrieval with generation:

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Embedding generation: Convert documents into vector representations
Vector storage: Index vectors for efficient similarity search
Query processing: Convert user queries to the same vector space
Retrieval: Find relevant documents based on vector similarity
Augmentation: Include retrieved information in the LLM prompt
Generation: Produce response based on augmented context

Key Vector Database Capabilities

Efficient Vector Search

Indexing algorithms: HNSW, IVF, PQ
Distance metrics: Cosine, Euclidean, Dot Product
Hybrid search: Vector similarity with metadata filtering

Document Management

Document storage alongside or linked to vectors
Chunking strategies for dividing documents
Metadata management for filtering

Integration

LLM platform connectors: OpenAI, Anthropic, etc.
Embedding model support: Multiple embedding types
API accessibility: REST, gRPC, client libraries

Vector Database Options

Dedicated Vector Databases

Pinecone: Fully managed, serverless, strong performance at scale, hybrid search, limited metadata filtering complexity.

Weaviate: Open-source, strong multimedia support, GraphQL API, module-based architecture.

Qdrant: Open-source, strong filtering capabilities, extensive distance function support, payload storage with vectors.

Extended Databases with Vector Capabilities

Postgres with pgvector: Extension to PostgreSQL, familiar SQL interface, strong ACID compliance, limited optimization for very large collections.

Redis with RediSearch: In-memory, extremely low latency, ephemeral by default, limited advanced indexing.

MongoDB Atlas Vector Search: Vector search within MongoDB, unified database for operational and vector data, newer capabilities.

Implementation Best Practices

Chunking Strategy

Document division impacts retrieval quality:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_text(document)

Options: Fixed size (simple but may break semantic units), semantic boundaries (split at paragraph breaks), sliding window (overlapping for context preservation).

Embedding Model Selection

Model	Dimensions	Performance	Cost
OpenAI text-embedding-3-small	1536	Strong	$0.02/1M tokens
OpenAI text-embedding-3-large	3072	Excellent	$0.13/1M tokens
Cohere embed-english-v3.0	1024	Strong	$0.10/1M tokens
Jina-embedding-v2-base-en	768	Good	Self-hosted

Consider: Performance requirements, operational costs at scale, privacy/compliance requirements, latency constraints.

Decision Rules

If your LLM application generates factual errors about your organization, RAG with a vector database reduces hallucination.
If your context window is full but responses lack specific details, you need retrieval rather than more context.
If your LLM costs exceed $10K/month and you have large internal knowledge bases, vector search typically reduces costs 50-80% versus full context.
If your knowledge base changes frequently, you need a vector database with efficient update mechanisms, not periodic full reindexing.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.