Embeddings assign numerical coordinates to words and concepts. “Cat” sits near “kitten” and “feline” but far from “airplane.” “Paris” neighbors “France” and “Eiffel Tower” but distances itself from “Tokyo.” Move from “king” toward “woman” and away from “man,” and you arrive at “queen.” This transformation makes language navigable for computers, the same way GPS makes physical space navigable.
Words Without Coordinates
Before embeddings, computers processed words like phone books process names.
Dictionary Order
The obvious approach assigns each word a number:
- Apple = Entry #1,234
- Apples = Entry #1,235
- Application = Entry #1,236
The computer sees Apple and Apples as unrelated (different numbers) while Apple and Application appear consecutive (and therefore similar). This is like organizing a library by book size instead of topic.
One-Hot Encoding
Another approach represents each word as a vector with a single 1:
- Cat = [0,0,0,1,0,0,0,0,0,0…]
- Dog = [0,0,0,0,1,0,0,0,0,0…]
- Kitten = [0,0,0,0,0,0,0,0,1,0…]
Every word sits at the same distance from every other word. This is like placing every location on Earth at equal distances from each other. Navigation becomes meaningless.
Embedding Space
Embeddings assign words coordinates in high-dimensional space:
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Cat and kitten have similar coordinates. Dog and puppy share a region. Pets cluster together.
Navigating Meaning
Semantic Directions
In embedding space, moving in specific directions carries meaning:
Starting at “king”:
- Move toward “female”: Approach “queen”
- Move toward “young”: Approach “prince”
- Move toward “evil”: Approach “tyrant”
A dimension might capture formality (formal to casual), another sentiment (positive to negative), another concreteness (abstract to concrete).
Distance as Difference
The distance between two words reflects their semantic relationship:
- “Car” to “automobile”: 0.2 units (near-synonyms)
- “Car” to “truck”: 2 units (related vehicles)
- “Car” to “bicycle”: 5 units (different transport modes)
- “Car” to “banana”: 500 units (unrelated)
Distances encode relationships automatically.
Vector Arithmetic
Classic example:
- Start at “king” coordinates
- Subtract the “man” vector (removes maleness)
- Add the “woman” vector (adds femaleness)
- Result: coordinates near “queen”
This works: king - man + woman ≈ queen. The arithmetic captures analogical relationships.
Use Cases
Finding Similar Items
Query: “affordable laptop for students”
Embedding search understands:
- “budget” is semantically near “affordable”
- “notebook computer” is near “laptop”
- “college” is near “students”
Returns results like “Budget notebook for college” even without exact word matches.
Translation
English “dog” at [2.3, 0.9, -1.1, …] Spanish “perro” at [2.2, 0.8, -1.0, …] French “chien” at [2.4, 0.9, -1.2, …]
Nearly identical positions across languages. Translation becomes coordinate transformation.
Training Embeddings
Learning from Context
Words acquire coordinates by observing their neighbors in sentences:
“The cat sat on the mat” “The dog sat on the mat” “The kitten played with yarn”
The system observes cat, dog, and kitten in similar contexts. They receive nearby coordinates. But cat and kitten are more similar to each other than to dog, so they end up closer.
This is like mapping a city by watching traffic patterns.
Training Process
- Start with random coordinates for all words
- Observe co-occurrences: “banking” appears near “financial” frequently
- Adjust: move “banking” coordinates closer to “financial”
- Repeat millions of times
- A coherent semantic map emerges
Embedding Models
Word2Vec: Fast, simple, general purpose. 100-300 dimensions typical.
GloVe: Incorporates global word statistics. Better at capturing analogies.
BERT: Context-sensitive. “Bank” near river differs from “Bank” near financial institution. 768 dimensions typical.
GPT: Predictive. Trained to predict next words. 1000+ dimensions.
Dimensionality
Why 300 Dimensions?
Physical space needs 3 coordinates. Language needs hundreds because meaning has many aspects:
- Dimension 1: concrete vs. abstract
- Dimension 2: positive vs. negative
- Dimension 3: formal vs. casual
- Dimensions 4-300: increasingly subtle distinctions
More dimensions allow finer distinctions without crowding.
The Dimensionality Tradeoff
High dimensions seem problematic but help:
In 3D: Limited positions, crowding likely In 300D: Vast space, nuanced positions possible
Every word can find its location without crowding neighbors.
Dimension Reduction
For visualization, compress to 2D or 3D using t-SNE or UMAP. Some distortion occurs, but clusters become visible.
Common Problems
Polysemy
“Bank” (financial institution) and “bank” (river edge) share coordinates. Without context, the embedding cannot distinguish them. Contextual embeddings like BERT address this by adjusting positions based on surrounding words.
Rare Words
Uncommon words receive poor coordinates. Few training examples mean unreliable positions. This is the GPS coverage problem in remote areas.
Bias
Training data encodes historical biases:
- “Doctor” closer to “male” than “female”
- “Nurse” shows the reverse
Embeddings capture whatever patterns exist in training data, including unwanted ones.
Out-of-Vocabulary
New words like “COVID-19” have no coordinates. Subword embeddings (breaking words into pieces) partially address this.
Decision Rules
Use embeddings when:
- You need semantic similarity (not just exact matches)
- You’re building recommendation or search systems
- Cross-lingual understanding matters
Consider alternatives when:
- Exact matching is required (traditional indexes work better)
- Interpretability is critical (embeddings are opaque)
- Training data is scarce (embeddings may not generalize)
ff-17.md (Library Book Whisperer - Distributed Caching)
A library maintains an unofficial whisper network. A patron asks about a book, and a librarian remembers: “Sarah at the reference desk has it.” This network bypasses the official catalog, turning hours of searching into seconds of knowing.
Distributed caching works the same way. Between users and the authoritative source, a faster informal network remembers answers to common questions.
The Catalog Approach
Central Library has five floors. To find a book:
- Search the catalog
- Note the call number
- Navigate to the correct floor and section
- Scan shelves
- Discover it’s checked out
- Return to catalog, repeat
Twenty minutes, no book.
With the whisper network:
Patron: “Anyone seen ‘Distributed Systems Design’?” Librarian: “Tom at the computer section has it on his cart.” Patron walks directly to Tom, gets the book.
Two minutes.
The whisper network cached Tom’s knowledge.
Cache Levels
Reading Room Caches
Each reading room maintains relevant knowledge:
Science Room tracks which professors have which journals, current physics research requests.
Literature Room knows this week’s book club selections, poetry anthology locations.
Each room caches information relevant to its visitors.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Librarian Network
Librarians form a higher-level cache:
Reference Desk Rachel knows all active research projects, tracks rare book movements.
Circulation Desk Carlos knows reserved books, overdue returns.
Stack Supervisor Susan knows misplaced books, restoration queue.
They share knowledge through informal channels.
Patron Network
Regular patrons become caches too:
Professor Patricia knows all Victorian literature locations, shares with students.
Researcher Robert knows archive organization, helps newcomers.
The network becomes self-organizing.
Cache Validity
Stale Information
Monday: “The new AI textbook is on the New Arrivals shelf” Tuesday: Book moves to regular stacks Wednesday: Patron still checking New Arrivals (stale data)
Solutions:
Time-Based Expiry: “As of this morning, it was there”
Event-Based Updates: “I saw them move it an hour ago”
Verification: “Let me double-check… yes, still there”
Update Propagation
When books move:
- Stack Supervisor notices
- Informs Reference Desk
- Reference updates their knowledge
- Tells Reading Room volunteers
- Network knowledge refreshes
Replacement Strategies
The whisper network has limited memory.
LRU (Least Recently Used)
Forget whispers about books nobody asks for. “Location of 1952 Telephone Directory” forgotten. “Where’s Harry Potter?” remembered.
LFU (Least Frequently Used)
Quantum Physics locations: Asked 50 times/day (keep). Medieval Farming Techniques: Asked once/month (discard).
TTL (Time To Live)
“New arrivals shelf” expires after 1 week. “In restoration” expires after 1 month. “Permanent collection” never expires.
Multi-Factor
Consider:
- How hard to rediscover?
- How often needed?
- How likely to change?
- Storage cost?
Failure Modes
Broken Telephone
Whisper chain: A → B → C → D Original: “Blue book on third shelf” Final: “New book on bird self”
Shorter chains and written notes for complex information help.
Cache Poisoning
Malicious or mistaken whispers spread: “Rare books are in the basement” (they’re not). Defense requires verification and trusted sources.
When to Cache
Cache close to where knowledge is needed. Share information across the network. Plan for staleness and failure. Trust but verify.
The next time you find what you’re looking for instantly, remember the whisper network that made it possible.