The harbormaster watches a gauge showing tide level. Ships can only depart when the tide rises above their draft mark. Some arrive on time, others are delayed by storms, a few drift in days late.
When has the tide “risen enough” to release the waiting fleet? Wait too long, ships sit idle. Release too early, late arrivals miss their departure window.
That’s watermarks in stream processing: tracking the “tide of time” in your data stream, deciding when it’s safe to process what you’ve collected.
The Problem
Anarchist Approach
Ships process immediately upon arrival:
- 9:00 AM ship: Departs immediately
- 8:45 AM ship (arrives at 9:15): Departs immediately
- 9:30 AM ship: Departs immediately
Results: Morning summary at 9:00 misses the 8:30 and 8:45 ships. Cargo counts wrong. Schedule meaningless.
Eternal Wait
Wait forever for stragglers:
- “We can’t close the 8:00 hour—what if an 8:15 ship is still coming?”
- “Last month’s report? Can’t finish it—ships might still arrive.”
Nothing ever completes. The harbor clogs.
Watermark System
The Time Tide
The watermark represents: “We believe we’ve seen all events up to this time.”
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
How Watermarks Rise
Ships report their event time:
- 9:00 AM: Ship with 8:45 timestamp arrives
- Watermark considers rising to 8:45
- But experience shows ships can be 15 minutes late
- Watermark rises to 8:30 (allowing grace period)
Controlled flooding: the tide rises steadily but cautiously.
Strategies
Fixed Delay
“Watermark always trails by 10 minutes”
- Current max event time: 9:00 AM
- Watermark: 8:50 AM
- Fixed 10-minute buffer
Simple, predictable. But inflexible, may be too much or too little.
Percentile-Based
“Watermark at 99th percentile of lateness”
- Track arrival delays
- 99% of ships: Less than 15 minutes late
- 1% outliers: Up to 2 hours late
- Set watermark: 15 minutes behind
Data-driven, adaptive. Always loses 1% of data.
Dynamic Adjustment
“Watermark adapts to conditions”
- Rush hour: Increase delay
- Night time: Reduce delay
- Storms: Increase significantly
Optimal for conditions. Complex, needs tuning.
Advanced Concepts
Allowed Lateness
Grace period after watermark passes:
- Watermark passes 9:00 AM
- Window “closes”
- But accept late data for 30 minutes
- Update results if needed
- After grace period: Drop data
Side Outputs
Ship arrives after window closed:
- Don’t drop silently
- Route to “late arrivals” dock
- Process separately
- Generate “late data” reports
Never lose data, segregate it.
Watermark Alignment
Multiple streams converging:
- Stream A watermark: 9:00 AM
- Stream B watermark: 8:45 AM
- Stream C watermark: 9:15 AM
Join watermark = minimum = 8:45 AM. Wait for slowest stream.
Common Problems
Straggler Ship
One ship always hours late:
- Holds back watermark
- Delays all processing
- Other ships wait unnecessarily
Solutions: Timeout for sources, ignore after threshold.
Watermark Plateau
Watermark stops advancing:
- One source stops sending
- All sources idle
- Processing halts
Solutions: Idle timeout, heartbeat events, watermark forcing.
Decision Rules
Start conservative: generous buffers, monitor actual lateness, tighten gradually.
Plan for late data: design for updates, track late arrivals, adjust strategy.
The key insight: We can’t wait forever, but we can wait intelligently. Watermarks codify our patience.