Migrating from batch to streaming: a 6-month journey

Migrating from batch to streaming: a 6-month journey

Simor Consulting | 28 Apr, 2026 | 05 Mins read

A logistics company processing two million shipments per day ran their entire operational reporting stack on nightly batch ETL. Every morning at 6 AM, operations managers reviewed dashboards built on data that was between eight and twenty hours old. For most of the company’s history, this was fine. Routing decisions were made on historical patterns, not real-time signals.

Then the company launched a same-day delivery service. Routing decisions now needed to happen in minutes, not hours. The batch pipeline could not serve this need. The operations team started pulling manual extracts, running ad-hoc queries against the production database, and building shadow spreadsheets that combined real-time GPS data with the batch-processed shipment data. The data team had lost control of the narrative.

The CDO asked us to design a migration from batch to streaming. The constraint was that the existing batch system had to keep running — and stay correct — throughout the migration. There would be no big cutover weekend. The migration had to be gradual, reversible, and invisible to the business users who depended on the batch outputs.

Why previous attempts stalled

The data team had tried streaming twice before. The first attempt used Kafka Streams to replicate the batch transformations in real time. It worked for simple aggregations but broke on joins that required historical context — a shipment’s current GPS position is only meaningful when joined against its planned route, its delivery window, and the driver’s current capacity. The streaming version of these joins was three times more complex than the batch SQL equivalent, and the team could not prove it was producing correct results.

The second attempt used a change data capture tool to stream database changes into a real-time layer, then ran the same batch SQL against a continuously updated table. This was architecturally simpler but introduced a new problem: the continuously updated table was never quite consistent with the batch version. Small timing differences in CDC events produced aggregation results that diverged from batch by fractions of a percent. The business team did not trust fractional divergence, and the data team spent more time explaining the divergence than building features.

Both failures had the same root cause: the team tried to replicate the batch logic in a streaming runtime. Batch and streaming are not the same computation running on different engines. They are different computation models with different consistency guarantees, different state management requirements, and different failure modes. Translating batch SQL line-for-line into streaming operators does not produce a streaming system. It produces a batch system that runs continuously and fails in ways that are harder to debug.

The approach: dual-path architecture with semantic contracts

We designed a dual-path architecture. Both the batch and streaming pipelines ran simultaneously, producing outputs that served the same semantic purpose but operated on different time horizons. The batch pipeline continued to produce authoritative daily aggregates. The streaming pipeline produced near-real-time approximations that converged with the batch output over time.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

The critical piece was the harmonization layer. This layer was not a merge or a union. It was a reconciliation system that continuously compared streaming outputs against batch outputs for the same time window. When the batch pipeline completed its nightly run, the harmonization layer compared its results against the streaming aggregates that had accumulated over the same twenty-four-hour period. If the divergence exceeded a configurable threshold, the system alerted the data team and served the batch output as the authoritative view.

This gave the business two things simultaneously: real-time visibility for operational decisions, and authoritative accuracy for reporting and compliance. The operations manager could see shipment positions update every thirty seconds. The finance team still reconciled against the nightly batch. Both teams were reading from the same system, but the system served different consistency guarantees depending on the use case.

The migration sequence

We migrated one domain at a time, starting with the simplest and working toward the most complex. Shipment tracking was first — it was essentially a pass-through from GPS events to a position table, with minimal transformation. Delivery status aggregation came next. Route optimization metrics were last, because they required the deepest historical joins and had the tightest accuracy requirements.

Each domain migration followed the same protocol: build the streaming path, run it in shadow mode for two weeks while the batch path remained authoritative, validate that the harmonization layer showed acceptable divergence, then switch the dashboard to the streaming view with batch fallback. If any domain showed unacceptable divergence during shadow mode, it stayed in shadow until the streaming logic was fixed.

Total migration time was twenty-four weeks across five domains. The team spent roughly sixty percent of that time on reconciliation logic and divergence analysis, and forty percent on the streaming pipelines themselves. This ratio surprised the engineering team. They expected the streaming pipelines to be the hard part. The hard part was proving they were right.

What we gave up

Streaming pipelines cannot do everything batch pipelines can. Certain aggregations require access to the complete history of a domain — lifetime shipment counts, rolling twelve-month delivery performance, year-over-year trend analysis. These computations are natural batch operations. Forcing them into a streaming model added complexity without adding value.

The team kept these computations in the batch pipeline and accepted that their outputs would remain daily. The streaming pipeline handled the subset of computations where timeliness mattered more than completeness. The decision framework was simple: if the business action that depends on the output has a time horizon of less than four hours, it belongs in the streaming path. If the time horizon is longer, batch is fine.

The second trade-off was operational complexity. Running two pipelines that produce overlapping outputs requires discipline. Schema changes had to be applied to both paths. Business logic changes had to be validated against both outputs. The team built automated reconciliation tests that ran nightly, but the test suite added roughly fifteen percent to the ongoing maintenance cost of the data platform.

Results

Operations managers gained the ability to make same-day routing decisions based on data that was never more than five minutes old. Delivery exception rates dropped by twelve percent in the first quarter after full migration, primarily because drivers could be rerouted in response to real-time delays rather than discovering delays at the next morning’s review.

The batch pipeline continued to serve finance and compliance without any changes. The two pipelines coexisted for over a year before the team began selectively retiring batch steps where the streaming output had proven reliable over time.

The decision heuristic

Do not try to replace batch with streaming. Extend batch with streaming, and let the two coexist with a reconciliation layer that proves the streaming output is correct. The migration is not complete when the streaming pipeline works. It is complete when you can prove the streaming pipeline produces results that match the batch pipeline within acceptable bounds. If you cannot prove that, the streaming pipeline is not ready for production regardless of how fast it is.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Similar Articles

The Modern Data Stack for AI Readiness: Architecture and Implementation
The Modern Data Stack for AI Readiness: Architecture and Implementation
28 Jan, 2025 | 03 Mins read

Existing data infrastructure often cannot support ML workflows. The modern data stack offers a foundation, but it requires adaptation to become AI-ready. This article covers building a data architectu

How a retailer reduced inference latency 90% with feature store caching
How a retailer reduced inference latency 90% with feature store caching
21 Apr, 2026 | 04 Mins read

A mid-market e-commerce retailer with roughly $200M in annual revenue had invested eighteen months building a product recommendation engine. The models were accurate. Offline evaluation showed meaning

The data pipeline that cost $50K/month — and the audit that found why
The data pipeline that cost $50K/month — and the audit that found why
22 Apr, 2026 | 04 Mins read

A financial services firm running analytics on trade settlement data came to us with a specific complaint: their cloud data platform cost had tripled in eighteen months, and nobody could explain why.

dbt vs SQLMesh: which transformation tool wins in 2026?
dbt vs SQLMesh: which transformation tool wins in 2026?
23 Apr, 2026 | 06 Mins read

Every analytics team eventually faces the same choice: how do you transform raw data into something analysts can actually use? For years, dbt was the only serious answer. SQLMesh arrived with a differ

Data Lakehouse Security Best Practices
Data Lakehouse Security Best Practices
22 Feb, 2024 | 02 Mins read

Data lakehouses combine lake flexibility with warehouse performance but introduce security challenges from their hybrid nature. Securing these environments requires layered approaches covering authent

When RAG failed: a knowledge retrieval project post-mortem
When RAG failed: a knowledge retrieval project post-mortem
29 Apr, 2026 | 05 Mins read

A legal technology company had invested six months building a retrieval-augmented generation system to help contract attorneys find relevant precedent clauses across a corpus of 180,000 executed agree

From 3-hour dashboards to 3-minute insights: a BI modernization story
From 3-hour dashboards to 3-minute insights: a BI modernization story
05 May, 2026 | 05 Mins read

A manufacturing company with facilities in twelve countries ran its operational reporting on a traditional BI stack: a data warehouse, an ETL pipeline, and a dashboard tool that had been deployed six

Orchestration face-off: Airflow vs Prefect vs Dagster
Orchestration face-off: Airflow vs Prefect vs Dagster
07 May, 2026 | 06 Mins read

The orchestration market has a clear incumbent and two serious challengers. Apache Airflow has been the default choice since 2015. Prefect and Dagster both emerged to address Airflow's pain points, bu

The vector database that couldn't scale — and what we did instead
The vector database that couldn't scale — and what we did instead
12 May, 2026 | 05 Mins read

A media company with a library of twelve million articles, transcripts, and research documents had built a semantic search system on a managed vector database. The system was designed to let journalis

Semantic Layer Implementation: Challenges and Solutions
Semantic Layer Implementation: Challenges and Solutions
20 Mar, 2024 | 02 Mins read

A semantic layer provides business-friendly abstraction over technical data structures, enabling self-service analytics and consistent metric interpretation. Implementing one involves technical challe

Serverless Data Pipelines: Architecture Patterns
Serverless Data Pipelines: Architecture Patterns
05 Jun, 2024 | 08 Mins read

# Serverless Data Pipelines: Architecture Patterns Serverless computing eliminates server management and provides automatic scaling with pay-per-use billing. These benefits matter for data pipelines

Event-Driven Data Architecture
Event-Driven Data Architecture
15 Sep, 2024 | 02 Mins read

Event-driven architectures treat changes in state as events that trigger immediate actions and data flows. Rather than processing data in batches or through scheduled jobs, components react to changes

From Data Silos to Data Mesh: The Evolution of Enterprise Data Architecture
From Data Silos to Data Mesh: The Evolution of Enterprise Data Architecture
15 Feb, 2025 | 03 Mins read

Traditional centralized data architectures worked for BI but struggle with AI workloads. Centralized teams become bottlenecks as data volumes grow. Domain experts who understand the data are separated

Case Study: End-to-End RAG Platform for Customer Support
Case Study: End-to-End RAG Platform for Customer Support
05 Dec, 2025 | 05 Mins read

A SaaS company with 200 support agents and 10,000+ knowledge base articles had an 18-hour average response time and 23% first-contact resolution. Their largest enterprise client threatened to cancel a

Case Study: Building a Production AI Knowledge Layer for Financial Services
Case Study: Building a Production AI Knowledge Layer for Financial Services
01 Mar, 2026 | 10 Mins read

A regional bank's investment research team spent 60% of their time gathering information and 40% doing analysis. Analysts had to search through regulatory filings, internal research memos, market data

Feature Stores for AI: The Missing MLOps Component Reaching Maturity
Feature Stores for AI: The Missing MLOps Component Reaching Maturity
12 Mar, 2026 | 11 Mins read

A recommendation system team built their tenth model. Each model required feature engineering. Each feature engineering project started by copying code from the previous project, then modifying it for

The AI Data Pipeline: Special Considerations for Unstructured and Structured Data
The AI Data Pipeline: Special Considerations for Unstructured and Structured Data
11 May, 2026 | 13 Mins read

Data pipelines for AI are not the same as data pipelines for traditional software systems. The outputs are different. The failure modes are different. The tolerance for data quality issues is differen