You have a wall covered in photos. You are looking at one from a beach trip. Nearby are other beach photos, vacation snapshots, summer memories. Not identical shots, but related moments. The clustering is semantic, not temporal. The wall organizes by meaning. You glance at the wall and recognize the context. The wall is not searching for an exact match to the photo in your hand. It is finding the neighborhood of similar memories.
A semantic cache works the same way. It stores query-result pairs indexed by the semantic meaning of the query, not by exact string match. When a new query arrives, the cache finds the most similar stored query and, if similarity is high enough, returns the cached result. The hit is approximate rather than exact. The cache is organized by meaning, not by text.
LLM responses are expensive. The cost of generating a response from a large model is measured in tokens, latency, and money. If your application sees repeated or similar queries, recomputing the response every time wastes all three. A semantic cache lets you serve cached responses for queries that are semantically similar enough to previous ones. The savings compound with query repetition. A system that sees the same questions phrased differently twenty times a day saves nineteen model calls.
The classic use case is a customer support bot. Users ask the same questions in different phrasings. “How do I reset my password,” “I forgot my password,” “cannot log in need to reset,” and “password reset not working” are semantically equivalent in intent but lexically different. A keyword cache misses all of these. A semantic cache that knows they are the same question serves the cached answer to all four. The wall recognizes that all four photos are from the same vacation.
The key question is “similar enough.” Set the threshold too high and you get almost no cache hits; the system is effectively uncached because no query is similar enough to trigger a hit. Set it too low and you serve incorrect answers because the similar query had a different correct answer. Tuning this threshold is empirical: it depends on your query distribution and your tolerance for approximation. The threshold is the most important calibration you will make.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
The tolerance question is not uniform. A cached answer to “what is your refund policy” can be served with high confidence even at moderate similarity. A cached answer to “what is my account balance” should never be served based on semantic similarity; it needs exact match. The threshold should reflect what the query is asking for and what the cost of an incorrect answer is. Some questions have one right answer; some have many acceptable answers.
The Similarity Problem
Similar queries do not always want similar answers. “What is the capital of France?” and “What is the largest city in France?” are semantically similar but have different answers. Paris is the capital. Marseille is the largest city. A cache that returns the capital answer for the largest-city question has failed silently. The user gets a wrong answer and may not realize it. The photo wall showed a French vacation and the user asked about France, so the cache returned a French photo. But the user wanted a specific moment, not a general category.
This failure mode is worst for factual queries with precise answers. It is less severe for explanatory queries where multiple valid framings exist. “Explain how photosynthesis works” and “what is photosynthesis” are asking for the same thing. “What is the population of France” and “What is the GDP of France” are asking for different numbers. Semantic cache needs to understand which category its queries fall into. Without that understanding, the cache serves answers that are wrong in ways the user may not catch.
A practical mitigation is to separate cache entries by query type. Factual queries with specific answer types get their own cache partition. Explanatory queries with flexible answers get a more permissive threshold. This requires knowing something about your query distribution, which requires instrumenting your cache and analyzing what types of queries you receive. The photo wall has separate sections: one for vacations, one for work events, one for family gatherings. A photo from the wrong section looks wrong even if the colors match.
Another mitigation is to cache at a higher level of abstraction. Cache an answer template rather than a specific answer, and have the model instantiate the template with query-specific details. This is more complex to build but handles a wider range of semantically similar queries. The cache stores the structure of the answer, not the specific words.
A third approach is to use a two-tier cache: exact-match cache for queries requiring precision, semantic cache for queries where approximation is acceptable. This hybrid approach separates the problem space and applies appropriate handling to each tier. The photo wall has a filing cabinet for exact negatives (the photo of that specific beach) and a wall for semantic clusters.
The Staleness Problem
A cached response reflects the model’s knowledge at the time it was generated. If the model updates, if the underlying data changes, if the world moves on, the cached answer may be stale. The cache does not know it is stale unless you give it staleness signals. Time-based expiry is simple: every cache entry has a TTL. Version tracking is more precise: invalidate the cache when the model version or data version changes.
Time-based expiry is easy to implement but has a sharp edge. A cache entry with a 24-hour TTL might contain a factually correct answer at hour 23 but an incorrect answer at hour 25 if the world changed in between. The TTL assumes uniform staleness, which is often wrong. Product pricing changes; policy documents get updated; numbers get revised. A cache that assumes answers age uniformly will be wrong on the answers that aged non-uniformly. Some photos age well; others look dated within days.
Version-based invalidation is more precise but requires versioning discipline. If you deploy a new model version and the cache was built on the old model, the cache entries may reflect old model behavior, not new model behavior. If your knowledge base updates, cache entries grounded in the old knowledge base are stale. The cache needs to know what it depends on, and it needs invalidation signals when those dependencies change. The photo wall needs metadata: when was this taken, what model was current, what data was current.
Event-driven invalidation is a middle path. Rather than TTL or version tracking, invalidate specific cache entries when external events occur: a price change invalidates pricing-related entries, a policy update invalidates policy-related entries. This requires the cache to understand the semantic content of entries and map events to affected queries. It is more complex but handles non-uniform staleness better than uniform TTL. The wall updates the beach photos when winter arrives, not on a fixed schedule.
Cache Hit Rate Measurement
A semantic cache that is never checked is useless infrastructure. A cache that is checked but rarely hits is expensive overhead. Measuring hit rate and the quality of hits matters. The cache must justify its existence with actual savings.
The naive hit rate is queries that find a similar enough match. The meaningful hit rate is hits that produce correct answers. A cache might have a 40% hit rate but serve wrong answers for half of those hits, meaning only 20% of queries get correct cached answers. Distinguishing between hit rate and correct-hit rate is important for understanding whether the cache is actually helping. A cache that serves wrong answers is worse than no cache.
Build your evaluation to measure correct-hit rate, not just hit rate. This requires either human evaluation of cached answers or a ground truth dataset with known correct answers. Without this measurement, you are flying blind on whether your cache threshold is well-calibrated. You may be serving wrong answers and calling it success.
Segment your hit analysis by query type. The correct-hit rate for explanatory queries may be high even with moderate similarity thresholds. The correct-hit rate for factual queries may be low even with high similarity thresholds. Aggregated metrics mask these differences. Measuring per-segment shows where the cache is working and where it is failing. The photo wall works well for vacation photos but poorly for work photos; aggregate metrics would hide that.
Building for Production
Cache invalidation in production is harder than in theory. The cache stores responses that were correct at generation time. When the source of truth changes, the cache does not know. Consider a FAQ cache: the FAQ gets updated, but cached answers still reflect the old FAQ. Without invalidation signals tied to the FAQ update event, the cache serves stale answers until the TTL expires. The wall still has the old photo even after the event it captured changed.
The operational question of what to cache is also non-trivial. A cache entry is a query-response pair. The response is a specific model output, which may be nondeterministic across calls even for the same query. If your model produces slightly different outputs on each call, caching stores one specific output. Subsequent cache hits serve that specific output, which may differ from what a fresh call would produce. This variance may be acceptable for your use case, but it should be explicit in your cache design. The photo is one moment; the event continues.
Memory usage grows with cache size. Each cache entry stores a query embedding, a response, and metadata. At scale, this memory can become significant. Cache eviction policies matter: LRU (least recently used) is simple, but LFU (least frequently used) may better serve workloads where some queries repeat often and others rarely. Analyze your access patterns before choosing an eviction policy. The wall has limited space; some photos need to come down.
Embedding dimension affects cache memory usage. A 1536-dimensional embedding uses more memory than a 384-dimensional embedding. Smaller embedding models may have lower retrieval quality but enable larger caches in the same memory footprint. The trade-off depends on your similarity requirements and memory constraints.
Cache warming strategies matter for cold-start performance. A cache that starts empty has no hits until it fills. Proactively caching high-frequency queries on startup reduces cold-start latency. Query frequency analysis from production logs identifies which queries to cache first. The wall starts empty on opening day; pre-fill it with the most common photos.
Use semantic cache when your query distribution has meaningful repetition or similarity, when responses are expensive relative to cache retrieval cost, when your data does not change frequently (staleness risk is low), when approximate answers are acceptable for cache hits, when you can tune and measure your similarity threshold, and when the query types are separable into factual (high-precision needed) vs explanatory (approximation acceptable).
Do not use semantic cache when every query is unique (cache never hits), when accuracy requirements are strict and approximate answers are unacceptable, when underlying data changes frequently (staleness dominates), when latency requirements are tighter than cache retrieval can provide, and when you have not measured whether your threshold is actually calibrated. Design for your threshold: a cache that is never consulted is overhead, not infrastructure. A cache that serves wrong answers is worse than no cache at all. The photo wall earns its space when it shows you the right neighborhood.