Migration playbook: batch to streaming in 5 phases

Simor Consulting | 31 May, 2026 | 06 Mins read

The case for streaming is straightforward: data that arrives in minutes instead of hours enables decisions that were previously impossible. Fraud detection catches transactions before they clear. Personalization engines react to user behavior in the same session. Operational dashboards reflect current state, not yesterday’s state. The engineering value is real.

The migration from batch to streaming is where teams get hurt. The typical approach is a big-bang cutover: build the streaming pipeline in parallel, test it against the batch pipeline, and flip the switch on a scheduled weekend. The weekend comes. The streaming pipeline handles 80% of the use cases correctly. The other 20% surface edge cases that only appear at production volume. The team spends the next three weeks fixing streaming bugs while the batch pipeline runs as a fallback, and the promised benefits of streaming recede further into the future.

A phased migration avoids this. Instead of cutting over all at once, you migrate one property at a time — starting with the properties that have the least operational risk and the most learning value. Each phase builds confidence and surfaces problems while the batch pipeline continues to handle production traffic.

Prerequisites

You need a streaming platform selected and provisioned. Kafka, Pulsar, Kinesis, or Pub/Sub — the choice depends on your existing infrastructure and team expertise. Do not start the migration until the platform is running and your team has basic operational familiarity with it: producing messages, consuming messages, monitoring lag, and handling consumer group rebalancing.

You need a clear map of your batch pipeline: what data sources feed it, what transformations it applies, what outputs it produces, and what downstream systems depend on those outputs. If you do not have this map, document it before starting. A migration without a dependency map is a migration that breaks things you did not know existed.

You need a dual-write or dual-read capability in your source systems. The migration strategy requires running batch and streaming in parallel for an extended period. Your source systems must be able to emit data to both pipelines simultaneously, or your streaming pipeline must be able to read from the same sources the batch pipeline uses.

Phase 1: Shadow mode (weeks 1-4)

Build the streaming pipeline to process the same inputs as the batch pipeline, but do not let it produce outputs that any downstream system consumes. It writes to its own output tables or topics. Nobody depends on it.

The purpose of shadow mode is to validate three things: does the streaming pipeline produce the same results as the batch pipeline, can it keep up with the data volume, and where does it diverge.

Run both pipelines on the same data. Compare outputs. The comparison is not a simple equality check — batch and streaming may produce the same results in different orders, at different times, with different metadata. Build a reconciliation process that normalizes both outputs and compares them on the business-meaningful dimensions: row counts, aggregate values, and per-record correctness.

Log every divergence. Categorize them: timing differences (expected), aggregation differences (investigate), and missing or duplicated records (fix before proceeding). Do not leave Phase 1 until the divergence rate is below your tolerance threshold. For most systems, that threshold is less than 0.1% of records affected by any divergence category.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Phase 2: Read migration (weeks 5-8)

Select one downstream consumer that reads from the batch output. Modify it to read from the streaming output instead. The batch pipeline continues to run and produce its output. Other consumers continue reading from the batch output. Only this one consumer has moved.

This phase tests whether your streaming output meets the interface requirements of downstream systems. Batch outputs often have properties that downstream systems implicitly depend on: complete data for a time window, sorted ordering, deduplication guarantees. Streaming outputs may not have these properties by default. Phase 2 surfaces these implicit dependencies.

Monitor the migrated consumer closely. Track error rates, latency, and data freshness compared to when it read from the batch output. Run the reconciliation process from Phase 1 continuously — it now serves as a safety net for the live consumer.

Keep the batch output available as a fallback. If the streaming consumer encounters an error it cannot handle, it should be able to fall back to the batch output with a configurable switch. This fallback capability is critical in Phase 2 and Phase 3.

Duration: run this phase for at least two weeks with the migrated consumer handling real traffic. One week is not enough to encounter the edge cases that appear at different times of day, different days of the week, and different data volumes.

Phase 3: Incremental consumer migration (weeks 9-16)

Migrate downstream consumers one at a time, starting with the consumers that have the lowest risk and the least complex data requirements.

Order your consumers by migration risk:

Low risk: Consumers that use data for reporting, analytics, or dashboards. They can tolerate minor discrepancies and are not customer-facing.
Medium risk: Consumers that feed operational systems but have human review in the loop. A discrepancy is caught and corrected manually.
High risk: Consumers that feed automated decisions. A discrepancy causes incorrect actions. Migrate these last.

For each consumer, follow the same process as Phase 2: modify the consumer to read from streaming, monitor closely, keep the batch fallback available. Budget one to two weeks per consumer depending on complexity.

After each migration, evaluate whether the streaming pipeline needs adjustment. Consumer-specific requirements often reveal streaming pipeline gaps that the shadow-mode reconciliation did not catch. Fix these gaps before migrating the next consumer.

Phase 4: Batch pipeline decommission (weeks 17-20)

All consumers have migrated to streaming. The batch pipeline is still running but no downstream system reads from it. This phase decommissions the batch pipeline in three steps.

Step one: stop the batch pipeline from producing outputs. Keep the pipeline code and configuration, but disable the scheduled runs. Monitor for one week. If any consumer breaks, you missed a dependency. Re-enable the batch pipeline, identify the missed consumer, migrate it, and try again.

Step two: after one week with no batch runs and no consumer failures, remove the batch pipeline’s data outputs. Archive them according to your data retention policy. Verify that no monitoring, alerting, or reporting depends on the batch output paths.

Step three: after another week with no issues, remove the batch pipeline code from your active repository. Archive it in a separate repository or a historical branch. The code should be recoverable but not part of your active codebase.

The total decommission process takes four weeks: one week per step plus a buffer. Do not rush it. The most common batch-to-streaming rollback scenario is a consumer dependency that surfaces two weeks after the batch pipeline was disabled.

Phase 5: Streaming optimization (weeks 21+)

With the batch pipeline gone, the streaming pipeline is now the sole data path. This phase focuses on optimizing the streaming pipeline for properties that were not critical during migration but are critical for long-term operations.

Backpressure handling. The migration phases tested the streaming pipeline under normal load. Optimize for abnormal load: what happens when a source system produces a burst of events? Does the pipeline slow down gracefully, or does it drop events? Implement backpressure mechanisms — consumer throttling, buffer sizing, and overflow routing.

Exactly-once semantics. During migration, at-least-once delivery with deduplication was sufficient. For long-term operations, evaluate whether your pipeline needs exactly-once semantics. This depends on whether your downstream consumers are idempotent. If they are, at-least-once is fine. If they are not, invest in exactly-once delivery or make consumers idempotent.

Schema evolution. Streaming pipelines handle schema changes differently than batch pipelines. A batch pipeline can fail and be re-run after a schema fix. A streaming pipeline that encounters an unexpected schema change may drop events or poison its processing state. Implement a schema registry with compatibility checks and dead-letter queues for schema violations.

Monitoring maturity. Upgrade your monitoring from migration-level (is the pipeline running?) to operational-level (is the pipeline healthy?). Key metrics: consumer lag, processing latency percentiles, event drop rate, and end-to-end freshness. Set alerts on each metric at production-appropriate thresholds.

Common failure modes

Migrating high-risk consumers first. The temptation is to migrate the consumer that would benefit most from streaming freshness. Resist this. Migrate low-risk consumers first to build operational confidence. High-risk consumers migrate last, when the team has experience with streaming failure modes and the pipeline has been hardened by earlier migrations.

No reconciliation after Phase 1. Teams often build reconciliation for shadow mode and then abandon it when they start migrating consumers. Keep reconciliation running until Phase 4 is complete. It is your safety net, and removing it early is how you discover data discrepancies weeks after they started.

Skipping Phase 5. The migration is “done” when all consumers are on streaming. But the streaming pipeline was built for migration, not for long-term operations. Without Phase 5 optimization, the pipeline accumulates technical debt that causes operational incidents within six months.

Ignoring backpressure. Batch pipelines have natural backpressure — they process a fixed window and stop. Streaming pipelines process continuously. Without explicit backpressure handling, a source system burst can cascade through the pipeline and overwhelm downstream consumers. Test backpressure behavior explicitly.

Next step

If you are running batch pipelines today, start by building the dependency map. List every batch job, every output, every downstream consumer, and the business impact of each output being delayed or incorrect. This map determines your migration order. Do not start building the streaming pipeline until the map is complete and reviewed by the teams that own the downstream consumers.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.

Similar Articles

Data Engineering Operations

Legacy Data Pipeline Modernization Without Rewriting Everything

10 Jul, 2026 | 07 Mins read

The pipeline runs every night at 2 a.m. Nobody fully understands it. The original author left in 2019. It is part SAS, part shell, part stored procedures, and part a spreadsheet someone emails in. It

AI Enablement Operations

5 AI Workflows Professional Services Firms Can Deploy This Quarter

10 Jul, 2026 | 09 Mins read

Professional services firms sell judgment, billed by the hour or by the matter. That makes them both the biggest winners and the most cautious adopters of AI. The upside is real: every firm carries ho

Data Engineering AI Infrastructure

Building AI-Ready Data Pipelines: Key Architecture Considerations

04 Mar, 2025 | 02 Mins read

Data pipelines built for business intelligence often fail when supporting AI workloads. The root cause is usually architectural: BI pipelines assume bounded, relatively static datasets, while AI syste

AI Infrastructure Operations

Lightweight MLOps for Mid-Market Teams: Ship Models Without a Platform Engineering Org

10 Jul, 2026 | 11 Mins read

A head of ML at a 120-person company told us recently that his team had spent nine months trying to stand up a "proper MLOps platform." They had evaluated three orchestration tools, designed a feature

AI Governance Operations

Anatomy of an AI Incident: Post-Mortem of a Model Provider Outage

19 Jun, 2026 | 09 Mins read

On a Tuesday at 2:14 PM, a major model provider began returning elevated error rates for a specific model endpoint. By 2:31 PM, a customer support platform that depended on that endpoint was producing

AI Infrastructure Operations

AI Rollback Patterns: When to Roll Back a Prompt, a Model, or the Whole Release

27 Jun, 2026 | 11 Mins read

Software rollbacks are well-understood. You deploy a new version, detect an issue, and roll back to the previous version. The rollback is atomic: the entire application reverts to the previous state.

AI Infrastructure Operations

The 7-step vector database selection checklist

26 Apr, 2026 | 06 Mins read

Most vector database selection failures come down to one mistake: picking the technology before mapping the workload. Teams benchmark embedding search speed on a curated dataset, pick the fastest opti

AI Infrastructure Operations

Build vs buy: a decision tree for AI infrastructure

03 May, 2026 | 06 Mins read

Every AI infrastructure team eventually faces the same argument. One faction wants to build a custom solution because the commercial options do not handle their specific requirements. The other factio

AI Enablement Operations

How to design a prompt ops pipeline from scratch

10 May, 2026 | 06 Mins read

Prompt management in most AI teams starts the same way. One engineer writes a prompt, it works well enough, and the prompt gets committed to a config file. Three months later, there are forty prompts

Data Engineering Operations

The data quality scorecard: metrics that actually matter

17 May, 2026 | 06 Mins read

Most data quality initiatives fail not because teams lack tools, but because they measure the wrong things. Teams track hundreds of data quality metrics, generate dashboards full of green indicators,

Trends Data Engineering

Conference report: key takeaways from Data Council 2026

23 May, 2026 | 04 Mins read

Data Council 2026 wrapped in Austin last week, and the signal-to-noise ratio was higher than in recent years. The conference has historically been the venue where data infrastructure practitioners — n

AI Infrastructure Operations

A cost optimization framework for LLM inference

24 May, 2026 | 06 Mins read

LLM inference costs follow a pattern that catches teams off guard. The first prototype costs almost nothing -- a few hundred dollars a month during development. The pilot scales to a few thousand. Pro

AI Governance Operations

How to audit your AI pipeline for bias -- step by step

07 Jun, 2026 | 06 Mins read

Bias in AI systems is not a theoretical risk. It is a measurable property that can be detected, quantified, and mitigated at every stage of the pipeline. The teams that treat bias as an audit problem

AI Enablement Operations

The 30-day AI readiness assessment

14 Jun, 2026 | 07 Mins read

Organizations that skip readiness assessment before investing in AI tend to discover their gaps expensively. A financial services firm spent four months building a customer churn prediction model only

Data Engineering Forecasting

Data Pipelines for Time Series Forecasting

21 Mar, 2024 | 02 Mins read

Time series forecasting requires specialized pipeline architecture. Unlike standard batch processing, time series work demands strict chronological ordering, historical context, time-based feature eng

Trends Data Engineering

The death of the dashboard: what replaces BI?

20 Jun, 2026 | 03 Mins read

The traditional BI dashboard — a grid of charts that a business user opens every morning to check KPIs — is losing its grip on how organizations consume data. The decline is not dramatic. No one decla

AI Enablement Operations

Your first 90 days as a Head of AI Engineering

28 Jun, 2026 | 07 Mins read

The first Head of AI Engineering at a company inherits one of three situations. Situation one: there is no AI team, no AI infrastructure, and the mandate is to build from scratch. Situation two: there

AI Enablement Operations

The RAG evaluation framework you'll actually use

08 Jul, 2026 | 06 Mins read

Most RAG systems are evaluated with vibes. An engineer runs ten queries, eyeballs the results, and declares the system "working." Three months later, a customer reports that the system confidently ret

Trends Data Engineering

Why your AI strategy needs a data strategy (not the other way around)

11 Jul, 2026 | 03 Mins read

The majority of enterprise AI strategies are built on an implicit assumption: that the organization's data is ready to support AI workloads. The assumption is almost always wrong. Data that is adequat

AI Governance Operations

How to write an AI incident response plan

12 Jul, 2026 | 07 Mins read

AI systems fail differently than traditional software. A traditional software bug produces incorrect output deterministically -- the same input always produces the same wrong output, and a fix elimina

Data Governance Data Engineering

Data Contracts: Building Trust Between Teams

29 Jan, 2024 | 03 Mins read

Data contracts are formal agreements that define the structure, semantics, quality standards, and delivery expectations for data exchanged between teams. They specify schema definitions, SLAs, ownersh

Data Engineering Synthetic Data

Building Synthetic Data Pipelines for ML Testing

24 May, 2024 | 04 Mins read

# Building Synthetic Data Pipelines for ML Testing Synthetic data addresses real ML development problems: privacy restrictions on real data, class imbalance, and edge case coverage. It does not repla

Machine Learning Data Engineering Feature Engineering

Feature Store Architectures: Building the Foundation for Enterprise ML

18 Jan, 2024 | 03 Mins read

Organizations scaling ML efforts encounter a predictable problem: feature engineering work duplicates across teams, training-serving skew causes model failures in production, and point-in-time correct

Data Engineering Temporal Data

Time-Travel Queries: Implementing Temporal Data Access

02 Oct, 2024 | 03 Mins read

Time-travel queries—the ability to access data as it existed at any point in the past—have become essential in modern data platforms. This capability transforms how organizations approach data governa

AI Infrastructure Data Engineering

Choosing a Vector Database for Production AI Applications

10 Jul, 2026 | 12 Mins read

You have a retrieval-augmented generation proof of concept that works on a laptop. The embeddings are in a CSV file, the search is brute force, and the demo impresses the steering committee. Now someo