Counting attendees at a massive festival: individual counting requires massive infrastructure for millions of attendees. Sampling small areas and extrapolating fails with uneven crowd distribution. The drone approach: fly overhead, observe crowd patterns, estimate total from clever observations with minimal data. That’s HyperLogLog: a probabilistic algorithm estimating cardinality of huge sets using logarithmic space.
The Counting Challenge
Individual counting at scale requires infrastructure: thousands of counters, database storing millions of attendee IDs, prevention of double counting. For 10 million attendees, you need gigabytes of memory and hours of processing time.
Statistical sampling seems smarter: count 1% of the area, find 1,000 people, estimate 100,000 total. But crowds cluster near stages and sparse in back areas. Sampling fails with non-uniform distribution.
The Drone Observer Method
HyperLogLog exploits a mathematical insight about random data.
The Leading Zeros Insight
Hash each person to a random binary ID. Track the maximum number of leading zeros observed.
- Person A: 11010110… (0 leading zeros)
- Person B: 01101011… (1 leading zero)
- Person C: 00110101… (2 leading zeros)
- Person D: 00001101… (4 leading zeros)
If you see k leading zeros, you expect roughly one occurrence per 2^(k+1) items. Maximum k across all items indicates scale.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Multi-Drone Approach
One observation is noisy. Use multiple independent hash buckets (16,384 by default). Each tracks maximum leading zeros in its subset. Combine with harmonic mean to reduce outlier impact.
Real-World Applications
Website Unique Visitors
Traditional approach: store every IP seen, 100 million unique visitors needs 400MB+ memory, exact count.
HyperLogLog approach: hash each IP, update registers, 16KB memory total, ~2% error. Memory reduced 25,000x.
Database Cardinality
COUNT(DISTINCT user_id) requires full table scan and hash table proportional to cardinality. HyperLogLog sketch streams once, fixed memory usage, good enough for query planning.
Real-time Analytics
Traditional approach: unbounded memory growth as unique items accumulate. Eventually crashes or drops old data.
HyperLogLog: fixed 16KB per metric, handles infinite streams, merges time windows, scales indefinitely.
Mathematics
Probability Distribution
Probability of 0 leading zeros: 50% Probability of 1 leading zero: 25% Probability of k leading zeros: 1/2^(k+1)
If you observe k leading zeros maximum, expected cardinality is approximately 2^(k+1).
Register Array
m registers (typically 2^14 = 16,384), each stores maximum leading zeros seen for its bucket. Total memory ~16KB regardless of cardinality.
Error Bounds
Standard error: 1.04/√m. With 16,384 registers: ~0.81% error. Error is proportional to 1/√m, so quadrupling registers halves the error.
Advanced Techniques
Bias Correction
Small cardinalities overestimate. Use empirical bias correction tables for the 0-10k range, smooth transition to normal formula above that.
Sparse Representation
Early in a sketch, few registers are non-zero. Store only non-zero registers until dense representation becomes more efficient.
Merging Sketches
Combine sketches via element-wise maximum of registers. Union works perfectly, intersection is approximate.
HyperLogLog++
Google’s improvements: better hash function (64-bit), improved bias correction, sparse representation. Same memory, higher accuracy.
Common Challenges
Hash Function Quality
Bad hash = bad estimates. Sequential IDs poorly hashed produce massive errors. Use MurmurHash3 or similar quality hash functions.
Merge Precision Loss
Merging increases error slightly. Individual exact counts cannot be recovered from union. Design queries for approximate answers.
Cardinality Only
HyperLogLog counts unique items. It cannot tell you which items, or answer “how many red items”. Only the count of distinct values.
Update-Only Nature
Items cannot be removed. Deletions would require knowing if any items remain with that hash. Use time-bucketed sketches for sliding windows.
When HyperLogLog Works
HyperLogLog fits:
- Unique counting where exactness isn’t required
- Memory-constrained environments
- Streaming data with infinite cardinality
- Scenarios where 2% error is acceptable
HyperLogLog fails for:
- Applications requiring exact counts
- Small cardinalities where bias correction matters more
- When you need the actual items, not just counts
Decision Rules
Use HyperLogLog when:
- You need to count uniques at scale
- 2% error is acceptable
- Memory is constrained
- You’re counting streaming data
Use exact counting when:
- Accuracy must be 100%
- Dataset is small enough for exact counting
- You need to know which items exist
- Regulatory requirements mandate exactness
The drones observe patterns. Maximum leading zeros indicates crowd size. Probabilistic counting becomes practical.