Lego blocks come in standard sizes. A 2x4 stud configuration connects with other 2x4 configurations. A 1x2 connects with other 1x2s. The shape determines which pieces fit together. You do not need to guess whether a piece will connect; the geometry tells you. The encoding is in the shape itself, not in a separate lookup table.
Embedding vectors work the same way. Each concept gets encoded as a vector of numbers. Similar concepts end up with similar vectors. The distance between vectors tells you how related the concepts are. When you retrieve by embedding similarity, you are finding pieces whose shapes match the shape of your query. The geometry encodes the semantics.
What Dimensions Represent
A vector with 300 dimensions does not have 300 independent features in any human-readable sense. The dimensions are learned from data, and each dimension participates in representing multiple semantic features simultaneously. This distributed representation is part of why embeddings capture nuance that simple keyword matching misses. The concept of “professionalism” might be spread across dozens of dimensions, none of which corresponds directly to professionalism.
The trade-off is interpretability. You can measure distance between vectors easily. You cannot easily explain what each dimension means or why two vectors are close. The geometry works, but the map is not labeled in any language we can read. You know that “lawyer” and “attorney” are close in the embedding space; you do not know which dimensions capture their similarity.
This creates a debugging challenge. When retrieval produces unexpected results, you cannot simply inspect which dimensions caused the mismatch. You can measure the distance and see that two vectors are closer than expected, but you cannot tell whether this is because they genuinely share meaning or because of some artifact in the training data. The embedding space is a black box.
The Size Question
More dimensions can capture more nuanced relationships, up to a point. Too few dimensions and distinct concepts collapse into the same location. “Contract termination” and “employee termination” become indistinguishable because the embedding space does not have enough capacity to distinguish the different types of termination. The model cannot express the distinction because the dimensions are not there.
Too many dimensions and you get oversampling: concepts that seem similar only because you have too many dimensions for the number of distinct concepts you are actually representing. In high-dimensional spaces, random directions become nearly orthogonal, and the distinction between meaningful and meaningless similarity blurs. The curse of dimensionality affects embedding spaces as it affects other high-dimensional data structures.
The right dimensionality depends on the vocabulary size, the expressiveness needed, and the downstream retrieval task. There is no universal answer, but 384 to 1536 is a common range for most production embedding models. The exact number is usually determined empirically: test different dimensionalities and see which works best for your retrieval task.
As dimensions increase, the contrast between nearest and farthest neighbors diminishes. In very high dimensions, all points become roughly equidistant, and similarity measures lose their discriminative power. This is not just a theoretical concern. Embedding models with thousands of dimensions can produce retrieval results where the similarity scores between top results and lower-ranked results are very close. The ranking is unreliable.
What Dimensionality Tells You
The dimensionality of your embedding space should match the complexity of your domain. A legal corpus with thousands of document types and nuanced distinctions requires more dimensions than a product catalog with simple category structure. If you are retrieving across highly similar documents where subtle distinctions matter, you need more dimensions to express those distinctions.
When dimensionality is too low, unrelated documents cluster together. A query about contract termination might retrieve documents about employee termination because the embedding space did not have enough capacity to distinguish the different types of termination. This is a recall failure: the system fails to find relevant documents. The retrieval returns relevant-looking results that are actually irrelevant.
When dimensionality is too high, related documents fail to cluster because the space is too sparse to capture their similarity reliably. A query about contract termination might retrieve only documents that use the exact phrase “contract termination,” missing documents that discuss the concept using different language like “agreement cancellation” or “contract rescission.” This is a precision failure: the system retrieves too few relevant documents alongside irrelevant ones.
Both failure modes are retrieval degradation. The goal is dimensionality that balances recall and precision for your specific retrieval task. Too little dimensionality loses recall. Too much dimensionality loses precision. The right balance is empirical.
Similarity in High Dimensions
Distance measures behave strangely in high-dimensional spaces. In low dimensions, the nearest neighbors of a point are clearly closer than distant points. In very high dimensions, the contrast between nearest and farthest neighbors diminishes. All points become roughly equidistant. The retrieval system can still return results, but the confidence in the ranking is low.
This is not just a theoretical concern. Embedding models with thousands of dimensions can produce retrieval results where the similarity scores between top results and lower-ranked results are very close. The retrieval system returns results, but the ranking is unreliable. A document with similarity 0.85 might be only marginally more relevant than a document with similarity 0.80.
Approximate nearest neighbor algorithms sacrifice some precision for speed. The system returns something like the right results, but not necessarily the exact right results in the exact right order. The approximation error is usually small but not zero, and it accumulates for large result sets. For most retrieval tasks, this is acceptable. For retrieval tasks where precision is critical, the approximation error may not be acceptable.
The Embedding Drift Problem
Embedding models are trained at a point in time. The model captures semantic relationships as the training data understood them. The world changes. New concepts emerge. Existing concepts shift meaning. The embedding space, once trained, does not update.
This creates embedding drift. Over time, the embedding space becomes less aligned with current usage. A query that meant one thing two years ago might mean something different now. The embedding model still captures the old meaning. Retrieval returns results based on outdated semantic relationships.
Monitoring for drift is important for long-running retrieval systems. Track retrieval quality over time. If quality degrades, embedding drift might be the cause. Re-embedding documents periodically can address this, but re-embedding changes the embedding space and invalidates historical similarity comparisons. The decision to re-embed is not trivial.
Monitoring Retrieval Quality
You cannot directly observe whether your embedding space is well-structured. You can observe retrieval outcomes: whether users find what they are looking for, whether related documents cluster together, whether false positives are rare or common. The embedding space is not directly observable; its effects are.
Precision-focused applications (legal research, medical literature search) need low false positive rates and may accept lower recall. Users in these domains are willing to do additional filtering if the retrieved set is mostly relevant. They would rather miss some relevant documents than waste time reviewing irrelevant ones. Recall-focused applications (discovery search, content recommendations) need to find everything and may accept some false positives. Missing a relevant document is worse than reviewing an irrelevant one.
Different applications need different embedding models or different similarity thresholds. A model optimized for semantic similarity in one domain may not work in another domain with different semantic structures. A model trained on general web text may not capture the specialized terminology of medical or legal domains. Domain-specific embedding models outperform general models in their target domains.
Decision Rules
Choose embedding dimensionality based on:
- The size and diversity of your content domain
- The granularity of distinctions your retrieval needs to make
- The balance between recall (finding related content) and precision (not finding false positives)
- The computational cost of higher dimensions in storage and similarity search
Monitor for:
- Concept collision (unrelated content retrieved together)
- False negatives (related content not retrieved)
- Retrieval degradation over time as content evolves beyond the original embedding space
- Low confidence in ranking (similarity scores clustered together)
Evaluate retrieval quality with:
- Precision and recall on representative query sets
- User feedback on retrieval relevance
- Analysis of false positive and false negative patterns
Lego blocks work because the shape tells you what connects. Embeddings work because the geometry tells you what is related. Know whether your geometry is precise enough for your task.