HyperLogLog: Counting Crowd with Drones

Simor Consulting | 14 Nov, 2025 | 03 Mins read

Counting attendees at a massive festival: individual counting requires massive infrastructure for millions of attendees. Sampling small areas and extrapolating fails with uneven crowd distribution. The drone approach: fly overhead, observe crowd patterns, estimate total from clever observations with minimal data. That’s HyperLogLog: a probabilistic algorithm estimating cardinality of huge sets using logarithmic space.

The Counting Challenge

Individual counting at scale requires infrastructure: thousands of counters, database storing millions of attendee IDs, prevention of double counting. For 10 million attendees, you need gigabytes of memory and hours of processing time.

Statistical sampling seems smarter: count 1% of the area, find 1,000 people, estimate 100,000 total. But crowds cluster near stages and sparse in back areas. Sampling fails with non-uniform distribution.

The Drone Observer Method

HyperLogLog exploits a mathematical insight about random data.

The Leading Zeros Insight

Hash each person to a random binary ID. Track the maximum number of leading zeros observed.

Person A: 11010110… (0 leading zeros)
Person B: 01101011… (1 leading zero)
Person C: 00110101… (2 leading zeros)
Person D: 00001101… (4 leading zeros)

If you see k leading zeros, you expect roughly one occurrence per 2^(k+1) items. Maximum k across all items indicates scale.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Multi-Drone Approach

One observation is noisy. Use multiple independent hash buckets (16,384 by default). Each tracks maximum leading zeros in its subset. Combine with harmonic mean to reduce outlier impact.

Real-World Applications

Website Unique Visitors

Traditional approach: store every IP seen, 100 million unique visitors needs 400MB+ memory, exact count.

HyperLogLog approach: hash each IP, update registers, 16KB memory total, ~2% error. Memory reduced 25,000x.

Database Cardinality

COUNT(DISTINCT user_id) requires full table scan and hash table proportional to cardinality. HyperLogLog sketch streams once, fixed memory usage, good enough for query planning.

Real-time Analytics

Traditional approach: unbounded memory growth as unique items accumulate. Eventually crashes or drops old data.

HyperLogLog: fixed 16KB per metric, handles infinite streams, merges time windows, scales indefinitely.

Mathematics

Probability Distribution

Probability of 0 leading zeros: 50% Probability of 1 leading zero: 25% Probability of k leading zeros: 1/2^(k+1)

If you observe k leading zeros maximum, expected cardinality is approximately 2^(k+1).

Register Array

m registers (typically 2^14 = 16,384), each stores maximum leading zeros seen for its bucket. Total memory ~16KB regardless of cardinality.

Error Bounds

Standard error: 1.04/√m. With 16,384 registers: ~0.81% error. Error is proportional to 1/√m, so quadrupling registers halves the error.

Advanced Techniques

Bias Correction

Small cardinalities overestimate. Use empirical bias correction tables for the 0-10k range, smooth transition to normal formula above that.

Sparse Representation

Early in a sketch, few registers are non-zero. Store only non-zero registers until dense representation becomes more efficient.

Merging Sketches

Combine sketches via element-wise maximum of registers. Union works perfectly, intersection is approximate.

HyperLogLog++

Google’s improvements: better hash function (64-bit), improved bias correction, sparse representation. Same memory, higher accuracy.

Common Challenges

Hash Function Quality

Bad hash = bad estimates. Sequential IDs poorly hashed produce massive errors. Use MurmurHash3 or similar quality hash functions.

Merge Precision Loss

Merging increases error slightly. Individual exact counts cannot be recovered from union. Design queries for approximate answers.

Cardinality Only

HyperLogLog counts unique items. It cannot tell you which items, or answer “how many red items”. Only the count of distinct values.

Update-Only Nature

Items cannot be removed. Deletions would require knowing if any items remain with that hash. Use time-bucketed sketches for sliding windows.

When HyperLogLog Works

HyperLogLog fits:

Unique counting where exactness isn’t required
Memory-constrained environments
Streaming data with infinite cardinality
Scenarios where 2% error is acceptable

HyperLogLog fails for:

Applications requiring exact counts
Small cardinalities where bias correction matters more
When you need the actual items, not just counts

Decision Rules

Use HyperLogLog when:

You need to count uniques at scale
2% error is acceptable
Memory is constrained
You’re counting streaming data

Use exact counting when:

Accuracy must be 100%
Dataset is small enough for exact counting
You need to know which items exist
Regulatory requirements mandate exactness

The drones observe patterns. Maximum leading zeros indicates crowd size. Probabilistic counting becomes practical.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Take AI Readiness Assessment Schedule Technical Consultation

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.