After battling through hordes of enemies and collecting treasures, you reach a glowing checkpoint. If you fail now, you restart from the save, not the beginning. That’s checkpointing: periodically saving progress so failures don’t mean starting over.
Without Checkpoints
The Marathon Level
Playing “Distributed Dungeon Crawler”:
- Level spans 3 hours
- Collect 1,000 gold pieces
- Defeat 50 mini-bosses
- One death = restart entire level
Two hours in, you’ve collected 750 gold, defeated 38 bosses. A surprise trap kills you. Everything lost.
Constant Auto-Save
Every action saved immediately:
- Kill enemy: Save
- Pick up coin: Save
- Take step: Save
Problems: Game stutters constantly, save file corrupted mid-write during crash, game won’t load.
Neither extreme works.
Checkpoint System
Golden Save Points
Glowing pedestals throughout the level:
- After major battles
- Before difficult sections
- At natural break points
- Stand on pedestal → Progress saved → Continue
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
What Gets Saved
Each checkpoint captures:
- Character position
- Inventory contents
- Health/mana/stamina
- Enemies defeated
- Puzzles solved
- Doors unlocked
Everything needed to restore exactly where you were.
Strategies
Periodic Checkpoints
“Save every 10 minutes”
Timer-based, predictable overhead. May save at awkward moments. Simple to implement.
Event-Based
“Save after significant events”
Checkpoint triggers: boss defeated, puzzle solved, area completed. Natural save points aligned with progress.
Incremental
“Save only what changed”
First checkpoint: Full save. Subsequent: Only differences. Reduces checkpoint size and time.
Asynchronous
“Save without pausing”
Snapshot state, continue playing, save snapshot in background. No gameplay interruption.
Distributed Checkpointing
Stream Processing
Processing millions of events:
Without checkpoints:
- Process 10 million events
- Crash at 9.5 million
- Restart from event 0
With checkpoints:
- Checkpoint every million events
- Crash at 9.5 million
- Restart from 9 million
- Minimal reprocessing
Multi-Node Coordination
Barrier checkpoint: all nodes reach save point, game pauses briefly, everyone saves simultaneously, resume.
Chandy-Lamport algorithm: one node initiates, saves state, sends markers to all, friends save upon receiving markers, record messages in flight, global consistent snapshot.
Common Problems
Checkpoint Storm
Too frequent:
- Save every 10 seconds
- 90% time saving, 10% playing
- Progress grinds to halt
Checkpoint Gap
Too infrequent:
- Save every 2 hours
- Failure loses massive progress
- Players frustrated
Inconsistent Checkpoint
Partial save during crash:
- Position saved
- Inventory not saved
- State corrupted
- Cannot restore
Decision Rules
Test your checkpoints: simulate failures, restore from checkpoints, verify completeness, measure recovery time.
Balance frequency with cost: too many checkpoints waste resources, too few risk massive reprocessing on failure.
Clean up old checkpoints: keep recent ones, archive older ones, delete ancient ones, monitor storage.
The art is finding the balance: not so often you spend all your time saving, not so rare that failures hurt badly.