Embeddings: GPS for Words

Simor Consulting | 20 Jun, 2025 | 05 Mins read

Embeddings assign numerical coordinates to words and concepts. “Cat” sits near “kitten” and “feline” but far from “airplane.” “Paris” neighbors “France” and “Eiffel Tower” but distances itself from “Tokyo.” Move from “king” toward “woman” and away from “man,” and you arrive at “queen.” This transformation makes language navigable for computers, the same way GPS makes physical space navigable.

Words Without Coordinates

Before embeddings, computers processed words like phone books process names.

Dictionary Order

The obvious approach assigns each word a number:

Apple = Entry #1,234
Apples = Entry #1,235
Application = Entry #1,236

The computer sees Apple and Apples as unrelated (different numbers) while Apple and Application appear consecutive (and therefore similar). This is like organizing a library by book size instead of topic.

One-Hot Encoding

Another approach represents each word as a vector with a single 1:

Cat = [0,0,0,1,0,0,0,0,0,0…]
Dog = [0,0,0,0,1,0,0,0,0,0…]
Kitten = [0,0,0,0,0,0,0,0,1,0…]

Every word sits at the same distance from every other word. This is like placing every location on Earth at equal distances from each other. Navigation becomes meaningless.

Embedding Space

Embeddings assign words coordinates in high-dimensional space:

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Cat and kitten have similar coordinates. Dog and puppy share a region. Pets cluster together.

Navigating Meaning

Semantic Directions

In embedding space, moving in specific directions carries meaning:

Starting at “king”:

Move toward “female”: Approach “queen”
Move toward “young”: Approach “prince”
Move toward “evil”: Approach “tyrant”

A dimension might capture formality (formal to casual), another sentiment (positive to negative), another concreteness (abstract to concrete).

Distance as Difference

The distance between two words reflects their semantic relationship:

“Car” to “automobile”: 0.2 units (near-synonyms)
“Car” to “truck”: 2 units (related vehicles)
“Car” to “bicycle”: 5 units (different transport modes)
“Car” to “banana”: 500 units (unrelated)

Distances encode relationships automatically.

Vector Arithmetic

Classic example:

Start at “king” coordinates
Subtract the “man” vector (removes maleness)
Add the “woman” vector (adds femaleness)
Result: coordinates near “queen”

This works: king - man + woman ≈ queen. The arithmetic captures analogical relationships.

Use Cases

Finding Similar Items

Query: “affordable laptop for students”

Embedding search understands:

“budget” is semantically near “affordable”
“notebook computer” is near “laptop”
“college” is near “students”

Returns results like “Budget notebook for college” even without exact word matches.

Translation

English “dog” at [2.3, 0.9, -1.1, …] Spanish “perro” at [2.2, 0.8, -1.0, …] French “chien” at [2.4, 0.9, -1.2, …]

Nearly identical positions across languages. Translation becomes coordinate transformation.

Training Embeddings

Learning from Context

Words acquire coordinates by observing their neighbors in sentences:

“The cat sat on the mat” “The dog sat on the mat” “The kitten played with yarn”

The system observes cat, dog, and kitten in similar contexts. They receive nearby coordinates. But cat and kitten are more similar to each other than to dog, so they end up closer.

This is like mapping a city by watching traffic patterns.

Training Process

Start with random coordinates for all words
Observe co-occurrences: “banking” appears near “financial” frequently
Adjust: move “banking” coordinates closer to “financial”
Repeat millions of times
A coherent semantic map emerges

Embedding Models

Word2Vec: Fast, simple, general purpose. 100-300 dimensions typical.

GloVe: Incorporates global word statistics. Better at capturing analogies.

BERT: Context-sensitive. “Bank” near river differs from “Bank” near financial institution. 768 dimensions typical.

GPT: Predictive. Trained to predict next words. 1000+ dimensions.

Dimensionality

Why 300 Dimensions?

Physical space needs 3 coordinates. Language needs hundreds because meaning has many aspects:

Dimension 1: concrete vs. abstract
Dimension 2: positive vs. negative
Dimension 3: formal vs. casual
Dimensions 4-300: increasingly subtle distinctions

More dimensions allow finer distinctions without crowding.

The Dimensionality Tradeoff

High dimensions seem problematic but help:

In 3D: Limited positions, crowding likely In 300D: Vast space, nuanced positions possible

Every word can find its location without crowding neighbors.

Dimension Reduction

For visualization, compress to 2D or 3D using t-SNE or UMAP. Some distortion occurs, but clusters become visible.

Common Problems

Polysemy

“Bank” (financial institution) and “bank” (river edge) share coordinates. Without context, the embedding cannot distinguish them. Contextual embeddings like BERT address this by adjusting positions based on surrounding words.

Rare Words

Uncommon words receive poor coordinates. Few training examples mean unreliable positions. This is the GPS coverage problem in remote areas.

Bias

Training data encodes historical biases:

“Doctor” closer to “male” than “female”
“Nurse” shows the reverse

Embeddings capture whatever patterns exist in training data, including unwanted ones.

Out-of-Vocabulary

New words like “COVID-19” have no coordinates. Subword embeddings (breaking words into pieces) partially address this.

Decision Rules

Use embeddings when:

You need semantic similarity (not just exact matches)
You’re building recommendation or search systems
Cross-lingual understanding matters

Consider alternatives when:

Exact matching is required (traditional indexes work better)
Interpretability is critical (embeddings are opaque)
Training data is scarce (embeddings may not generalize)

ff-17.md (Library Book Whisperer - Distributed Caching)

A library maintains an unofficial whisper network. A patron asks about a book, and a librarian remembers: “Sarah at the reference desk has it.” This network bypasses the official catalog, turning hours of searching into seconds of knowing.

Distributed caching works the same way. Between users and the authoritative source, a faster informal network remembers answers to common questions.

The Catalog Approach

Central Library has five floors. To find a book:

Search the catalog
Note the call number
Navigate to the correct floor and section
Scan shelves
Discover it’s checked out
Return to catalog, repeat

Twenty minutes, no book.

With the whisper network:

Patron: “Anyone seen ‘Distributed Systems Design’?” Librarian: “Tom at the computer section has it on his cart.” Patron walks directly to Tom, gets the book.

Two minutes.

The whisper network cached Tom’s knowledge.

Cache Levels

Reading Room Caches

Each reading room maintains relevant knowledge:

Science Room tracks which professors have which journals, current physics research requests.

Literature Room knows this week’s book club selections, poetry anthology locations.

Each room caches information relevant to its visitors.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Librarian Network

Librarians form a higher-level cache:

Reference Desk Rachel knows all active research projects, tracks rare book movements.

Circulation Desk Carlos knows reserved books, overdue returns.

Stack Supervisor Susan knows misplaced books, restoration queue.

They share knowledge through informal channels.

Patron Network

Regular patrons become caches too:

Professor Patricia knows all Victorian literature locations, shares with students.

Researcher Robert knows archive organization, helps newcomers.

The network becomes self-organizing.

Cache Validity

Stale Information

Monday: “The new AI textbook is on the New Arrivals shelf” Tuesday: Book moves to regular stacks Wednesday: Patron still checking New Arrivals (stale data)

Solutions:

Time-Based Expiry: “As of this morning, it was there”

Event-Based Updates: “I saw them move it an hour ago”

Verification: “Let me double-check… yes, still there”

Update Propagation

When books move:

Stack Supervisor notices
Informs Reference Desk
Reference updates their knowledge
Tells Reading Room volunteers
Network knowledge refreshes

Replacement Strategies

The whisper network has limited memory.

LRU (Least Recently Used)

Forget whispers about books nobody asks for. “Location of 1952 Telephone Directory” forgotten. “Where’s Harry Potter?” remembered.

LFU (Least Frequently Used)

Quantum Physics locations: Asked 50 times/day (keep). Medieval Farming Techniques: Asked once/month (discard).

TTL (Time To Live)

“New arrivals shelf” expires after 1 week. “In restoration” expires after 1 month. “Permanent collection” never expires.

Multi-Factor

Consider:

How hard to rediscover?
How often needed?
How likely to change?
Storage cost?

Failure Modes

Broken Telephone

Whisper chain: A → B → C → D Original: “Blue book on third shelf” Final: “New book on bird self”

Shorter chains and written notes for complex information help.

Cache Poisoning

Malicious or mistaken whispers spread: “Rare books are in the basement” (they’re not). Defense requires verification and trusted sources.

When to Cache

Cache close to where knowledge is needed. Share information across the network. Plan for staleness and failure. Trust but verify.

The next time you find what you’re looking for instantly, remember the whisper network that made it possible.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Take AI Readiness Assessment Schedule Technical Consultation

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.