Column Stores: The Vertical Filing Cabinet

Simor Consulting | 30 May, 2025 | 04 Mins read

Reorganize an enormous filing cabinet. Instead of keeping complete employee records in manila folders (one folder per person with all their information), you create specialized drawers: one for all salaries, one for all birthdates, one for all phone numbers. Need to calculate average salary? Open just the salary drawer. Want to find everyone born in May? Check only the birthdate drawer. That’s columnar storage - organizing data by attribute rather than by record - and it’s revolutionizing how we handle massive datasets.

The Traditional Filing Cabinet

Let’s start with how most of us organize information - the row-based approach:

Employee Folders (Row Storage)

In the traditional office, each employee has their own folder:

Folder: “Alice Anderson”

Name: Alice Anderson
Employee ID: 10001
Department: Engineering
Salary: $95,000
Start Date: 2019-03-15
Phone: 555-0101

Folder: “Bob Baker”

Name: Bob Baker
Employee ID: 10002
Department: Marketing
Salary: $78,000
Start Date: 2020-07-22
Phone: 555-0102

To find Alice’s salary, you pull Alice’s folder and flip to the salary page. Perfect! But what if your boss asks: “What’s our total salary expense?”

Now you must:

Open every single folder
Flip to the salary page
Write down each number
Add them all up

With 10,000 employees, you’re opening 10,000 folders just to read one piece of information from each.

The Columnar Revolution

Now imagine reorganizing everything:

The Vertical Drawers

Instead of employee folders, you have attribute drawers:

Drawer: “Salaries”

10001: $95,000
10002: $78,000
10003: $82,000
10004: $91,000
… (all 10,000 salaries in one place)

Drawer: “Departments”

10001: Engineering
10002: Marketing
10003: Engineering
10004: Sales
… (all departments listed)

Now to calculate total salaries? Open one drawer, sum the numbers. Done in minutes instead of hours.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

The Magic of Compression

Here’s where column storage gets interesting. Look at the Department drawer:

Row Storage Nightmare

Record 1: “Engineering” (11 bytes)
Record 2: “Marketing” (9 bytes)
Record 3: “Engineering” (11 bytes)
Record 4: “Engineering” (11 bytes)
Record 5: “Sales” (5 bytes)

Total: 47 bytes, lots of repetition

Column Storage Brilliance

The Department drawer looks like:

Engineering: Records 1,3,4,7,9,12,15,18… (5,000 employees)
Marketing: Records 2,6,8,11,14,17… (2,000 employees)
Sales: Records 5,10,13,16,19… (3,000 employees)

Instead of storing “Engineering” 5,000 times, we store it once with a list of employee IDs. Massive space savings.

Real-World Scenarios

Scenario 1: The Annual Salary Review

The CEO wants to know salary distribution:

Row Storage Approach:

Open folder 1, find salary, close folder
Open folder 2, find salary, close folder
Repeat 10,000 times
Finally start calculating statistics

Time: Hours of folder shuffling

Column Storage Approach:

Open salary drawer
All 10,000 salaries right there
Calculate min, max, average, percentiles

Time: Minutes

Scenario 2: The Birthday List

HR needs everyone with May birthdays:

Row Storage:

Check every employee folder
Look at birthdate field
If May, add to list
10,000 folders examined for maybe 800 matches

Column Storage:

Open birthdate drawer
Scan for May dates
Already sorted, grouped together
Grab the ~800 IDs instantly

Scenario 3: The Emergency Contact Update

Need to find all employees in the 415 area code:

Row Storage:

Open each folder
Check phone number
Pattern matching on 10,000 records
Most folders opened unnecessarily

Column Storage:

Phone number drawer
All phones in one place
Quick pattern match
Only touch relevant data

Advanced Columnar Techniques

The Smart Secretary (Encoding)

Your secretary notices patterns and creates shortcuts:

Dictionary Encoding: Instead of:

Engineering, Engineering, Engineering, Engineering…

Store:

Dictionary: {1: Engineering, 2: Marketing, 3: Sales}
Data: 1,1,1,2,3,1,1,2…

Massive space savings, faster comparisons.

Run-Length Encoding: For sorted data:

Instead of: 1,1,1,1,1,2,2,2,3,3,3,3,3,3
Store: (1,5), (2,3), (3,6)
Meaning: Five 1s, three 2s, six 3s

The Index Cards (Metadata)

Each drawer has an index card summarizing contents:

Salary Drawer Index Card:

Count: 10,000 entries
Min: $35,000
Max: $250,000
Average: $87,500
Null values: 0

Before opening the drawer, you know if it contains what you need.

When Columns Shine vs. Struggle

Perfect for Column Storage

Analytics Queries:

“Average salary by department”
“Count employees hired each year”
“Find all phone numbers in area code 212”

Touch only relevant columns, process in bulk.

Data Warehousing:

Historical analysis
Trend detection
Aggregate calculations

Read-heavy, few updates, predictable patterns.

Time Series Data:

Sensor readings
Stock prices
Website metrics

Append-only, highly compressible, query by time range.

Better with Row Storage

Transaction Processing:

“Update Alice’s complete record”
“Insert new employee Bob”
“Delete employee #10001”

Need all fields together, frequent updates.

Point Lookups:

“Get everything about employee #10001”
“Show John’s complete profile”

Want entire record, not pieces.

Small Datasets:

Under 100,000 records
Frequently accessed in full

Overhead not worth it.

The Hybrid Office

Modern systems often combine approaches:

Hot and Cold Storage

Recent employee records (hot): Row storage for quick access Historical records (cold): Column storage for analytics

Row Groups

Instead of pure columns, group rows into chunks:

100,000 employees split into 100-row groups
Within each group: columnar storage
Benefits of both approaches

Decision Rules

Use columnar storage when:

You’re primarily reading and aggregating data
You frequently need all values from specific columns
Compression ratio matters (highly repetitive data)
You’re building analytics or data warehouse workloads

Use row storage when:

You frequently need complete records
You’re doing point lookups by ID
Writes happen as often as reads
Your data has many unique values per column

Columnar storage transforms how we think about data organization. It trades the convenience of having everything in one place for the efficiency of having everything of one type in one place.

There’s no universal “best” way to organize data. The key is matching your storage strategy to your access patterns. Row storage excels at transactions and complete record retrieval. Column storage dominates analytics and pattern detection.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.