The case for streaming is straightforward: data that arrives in minutes instead of hours enables decisions that were previously impossible. Fraud detection catches transactions before they clear. Personalization engines react to user behavior in the same session. Operational dashboards reflect current state, not yesterday’s state. The engineering value is real.
The migration from batch to streaming is where teams get hurt. The typical approach is a big-bang cutover: build the streaming pipeline in parallel, test it against the batch pipeline, and flip the switch on a scheduled weekend. The weekend comes. The streaming pipeline handles 80% of the use cases correctly. The other 20% surface edge cases that only appear at production volume. The team spends the next three weeks fixing streaming bugs while the batch pipeline runs as a fallback, and the promised benefits of streaming recede further into the future.
A phased migration avoids this. Instead of cutting over all at once, you migrate one property at a time — starting with the properties that have the least operational risk and the most learning value. Each phase builds confidence and surfaces problems while the batch pipeline continues to handle production traffic.
Prerequisites
You need a streaming platform selected and provisioned. Kafka, Pulsar, Kinesis, or Pub/Sub — the choice depends on your existing infrastructure and team expertise. Do not start the migration until the platform is running and your team has basic operational familiarity with it: producing messages, consuming messages, monitoring lag, and handling consumer group rebalancing.
You need a clear map of your batch pipeline: what data sources feed it, what transformations it applies, what outputs it produces, and what downstream systems depend on those outputs. If you do not have this map, document it before starting. A migration without a dependency map is a migration that breaks things you did not know existed.
You need a dual-write or dual-read capability in your source systems. The migration strategy requires running batch and streaming in parallel for an extended period. Your source systems must be able to emit data to both pipelines simultaneously, or your streaming pipeline must be able to read from the same sources the batch pipeline uses.
Phase 1: Shadow mode (weeks 1-4)
Build the streaming pipeline to process the same inputs as the batch pipeline, but do not let it produce outputs that any downstream system consumes. It writes to its own output tables or topics. Nobody depends on it.
The purpose of shadow mode is to validate three things: does the streaming pipeline produce the same results as the batch pipeline, can it keep up with the data volume, and where does it diverge.
Run both pipelines on the same data. Compare outputs. The comparison is not a simple equality check — batch and streaming may produce the same results in different orders, at different times, with different metadata. Build a reconciliation process that normalizes both outputs and compares them on the business-meaningful dimensions: row counts, aggregate values, and per-record correctness.
Log every divergence. Categorize them: timing differences (expected), aggregation differences (investigate), and missing or duplicated records (fix before proceeding). Do not leave Phase 1 until the divergence rate is below your tolerance threshold. For most systems, that threshold is less than 0.1% of records affected by any divergence category.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Phase 2: Read migration (weeks 5-8)
Select one downstream consumer that reads from the batch output. Modify it to read from the streaming output instead. The batch pipeline continues to run and produce its output. Other consumers continue reading from the batch output. Only this one consumer has moved.
This phase tests whether your streaming output meets the interface requirements of downstream systems. Batch outputs often have properties that downstream systems implicitly depend on: complete data for a time window, sorted ordering, deduplication guarantees. Streaming outputs may not have these properties by default. Phase 2 surfaces these implicit dependencies.
Monitor the migrated consumer closely. Track error rates, latency, and data freshness compared to when it read from the batch output. Run the reconciliation process from Phase 1 continuously — it now serves as a safety net for the live consumer.
Keep the batch output available as a fallback. If the streaming consumer encounters an error it cannot handle, it should be able to fall back to the batch output with a configurable switch. This fallback capability is critical in Phase 2 and Phase 3.
Duration: run this phase for at least two weeks with the migrated consumer handling real traffic. One week is not enough to encounter the edge cases that appear at different times of day, different days of the week, and different data volumes.
Phase 3: Incremental consumer migration (weeks 9-16)
Migrate downstream consumers one at a time, starting with the consumers that have the lowest risk and the least complex data requirements.
Order your consumers by migration risk:
- Low risk: Consumers that use data for reporting, analytics, or dashboards. They can tolerate minor discrepancies and are not customer-facing.
- Medium risk: Consumers that feed operational systems but have human review in the loop. A discrepancy is caught and corrected manually.
- High risk: Consumers that feed automated decisions. A discrepancy causes incorrect actions. Migrate these last.
For each consumer, follow the same process as Phase 2: modify the consumer to read from streaming, monitor closely, keep the batch fallback available. Budget one to two weeks per consumer depending on complexity.
After each migration, evaluate whether the streaming pipeline needs adjustment. Consumer-specific requirements often reveal streaming pipeline gaps that the shadow-mode reconciliation did not catch. Fix these gaps before migrating the next consumer.
Phase 4: Batch pipeline decommission (weeks 17-20)
All consumers have migrated to streaming. The batch pipeline is still running but no downstream system reads from it. This phase decommissions the batch pipeline in three steps.
Step one: stop the batch pipeline from producing outputs. Keep the pipeline code and configuration, but disable the scheduled runs. Monitor for one week. If any consumer breaks, you missed a dependency. Re-enable the batch pipeline, identify the missed consumer, migrate it, and try again.
Step two: after one week with no batch runs and no consumer failures, remove the batch pipeline’s data outputs. Archive them according to your data retention policy. Verify that no monitoring, alerting, or reporting depends on the batch output paths.
Step three: after another week with no issues, remove the batch pipeline code from your active repository. Archive it in a separate repository or a historical branch. The code should be recoverable but not part of your active codebase.
The total decommission process takes four weeks: one week per step plus a buffer. Do not rush it. The most common batch-to-streaming rollback scenario is a consumer dependency that surfaces two weeks after the batch pipeline was disabled.
Phase 5: Streaming optimization (weeks 21+)
With the batch pipeline gone, the streaming pipeline is now the sole data path. This phase focuses on optimizing the streaming pipeline for properties that were not critical during migration but are critical for long-term operations.
Backpressure handling. The migration phases tested the streaming pipeline under normal load. Optimize for abnormal load: what happens when a source system produces a burst of events? Does the pipeline slow down gracefully, or does it drop events? Implement backpressure mechanisms — consumer throttling, buffer sizing, and overflow routing.
Exactly-once semantics. During migration, at-least-once delivery with deduplication was sufficient. For long-term operations, evaluate whether your pipeline needs exactly-once semantics. This depends on whether your downstream consumers are idempotent. If they are, at-least-once is fine. If they are not, invest in exactly-once delivery or make consumers idempotent.
Schema evolution. Streaming pipelines handle schema changes differently than batch pipelines. A batch pipeline can fail and be re-run after a schema fix. A streaming pipeline that encounters an unexpected schema change may drop events or poison its processing state. Implement a schema registry with compatibility checks and dead-letter queues for schema violations.
Monitoring maturity. Upgrade your monitoring from migration-level (is the pipeline running?) to operational-level (is the pipeline healthy?). Key metrics: consumer lag, processing latency percentiles, event drop rate, and end-to-end freshness. Set alerts on each metric at production-appropriate thresholds.
Common failure modes
Migrating high-risk consumers first. The temptation is to migrate the consumer that would benefit most from streaming freshness. Resist this. Migrate low-risk consumers first to build operational confidence. High-risk consumers migrate last, when the team has experience with streaming failure modes and the pipeline has been hardened by earlier migrations.
No reconciliation after Phase 1. Teams often build reconciliation for shadow mode and then abandon it when they start migrating consumers. Keep reconciliation running until Phase 4 is complete. It is your safety net, and removing it early is how you discover data discrepancies weeks after they started.
Skipping Phase 5. The migration is “done” when all consumers are on streaming. But the streaming pipeline was built for migration, not for long-term operations. Without Phase 5 optimization, the pipeline accumulates technical debt that causes operational incidents within six months.
Ignoring backpressure. Batch pipelines have natural backpressure — they process a fixed window and stop. Streaming pipelines process continuously. Without explicit backpressure handling, a source system burst can cascade through the pipeline and overwhelm downstream consumers. Test backpressure behavior explicitly.
Next step
If you are running batch pipelines today, start by building the dependency map. List every batch job, every output, every downstream consumer, and the business impact of each output being delayed or incorrect. This map determines your migration order. Do not start building the streaming pipeline until the map is complete and reviewed by the teams that own the downstream consumers.