A logistics company processing two million shipments per day ran their entire operational reporting stack on nightly batch ETL. Every morning at 6 AM, operations managers reviewed dashboards built on data that was between eight and twenty hours old. For most of the company’s history, this was fine. Routing decisions were made on historical patterns, not real-time signals.
Then the company launched a same-day delivery service. Routing decisions now needed to happen in minutes, not hours. The batch pipeline could not serve this need. The operations team started pulling manual extracts, running ad-hoc queries against the production database, and building shadow spreadsheets that combined real-time GPS data with the batch-processed shipment data. The data team had lost control of the narrative.
The CDO asked us to design a migration from batch to streaming. The constraint was that the existing batch system had to keep running — and stay correct — throughout the migration. There would be no big cutover weekend. The migration had to be gradual, reversible, and invisible to the business users who depended on the batch outputs.
Why previous attempts stalled
The data team had tried streaming twice before. The first attempt used Kafka Streams to replicate the batch transformations in real time. It worked for simple aggregations but broke on joins that required historical context — a shipment’s current GPS position is only meaningful when joined against its planned route, its delivery window, and the driver’s current capacity. The streaming version of these joins was three times more complex than the batch SQL equivalent, and the team could not prove it was producing correct results.
The second attempt used a change data capture tool to stream database changes into a real-time layer, then ran the same batch SQL against a continuously updated table. This was architecturally simpler but introduced a new problem: the continuously updated table was never quite consistent with the batch version. Small timing differences in CDC events produced aggregation results that diverged from batch by fractions of a percent. The business team did not trust fractional divergence, and the data team spent more time explaining the divergence than building features.
Both failures had the same root cause: the team tried to replicate the batch logic in a streaming runtime. Batch and streaming are not the same computation running on different engines. They are different computation models with different consistency guarantees, different state management requirements, and different failure modes. Translating batch SQL line-for-line into streaming operators does not produce a streaming system. It produces a batch system that runs continuously and fails in ways that are harder to debug.
The approach: dual-path architecture with semantic contracts
We designed a dual-path architecture. Both the batch and streaming pipelines ran simultaneously, producing outputs that served the same semantic purpose but operated on different time horizons. The batch pipeline continued to produce authoritative daily aggregates. The streaming pipeline produced near-real-time approximations that converged with the batch output over time.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
The critical piece was the harmonization layer. This layer was not a merge or a union. It was a reconciliation system that continuously compared streaming outputs against batch outputs for the same time window. When the batch pipeline completed its nightly run, the harmonization layer compared its results against the streaming aggregates that had accumulated over the same twenty-four-hour period. If the divergence exceeded a configurable threshold, the system alerted the data team and served the batch output as the authoritative view.
This gave the business two things simultaneously: real-time visibility for operational decisions, and authoritative accuracy for reporting and compliance. The operations manager could see shipment positions update every thirty seconds. The finance team still reconciled against the nightly batch. Both teams were reading from the same system, but the system served different consistency guarantees depending on the use case.
The migration sequence
We migrated one domain at a time, starting with the simplest and working toward the most complex. Shipment tracking was first — it was essentially a pass-through from GPS events to a position table, with minimal transformation. Delivery status aggregation came next. Route optimization metrics were last, because they required the deepest historical joins and had the tightest accuracy requirements.
Each domain migration followed the same protocol: build the streaming path, run it in shadow mode for two weeks while the batch path remained authoritative, validate that the harmonization layer showed acceptable divergence, then switch the dashboard to the streaming view with batch fallback. If any domain showed unacceptable divergence during shadow mode, it stayed in shadow until the streaming logic was fixed.
Total migration time was twenty-four weeks across five domains. The team spent roughly sixty percent of that time on reconciliation logic and divergence analysis, and forty percent on the streaming pipelines themselves. This ratio surprised the engineering team. They expected the streaming pipelines to be the hard part. The hard part was proving they were right.
What we gave up
Streaming pipelines cannot do everything batch pipelines can. Certain aggregations require access to the complete history of a domain — lifetime shipment counts, rolling twelve-month delivery performance, year-over-year trend analysis. These computations are natural batch operations. Forcing them into a streaming model added complexity without adding value.
The team kept these computations in the batch pipeline and accepted that their outputs would remain daily. The streaming pipeline handled the subset of computations where timeliness mattered more than completeness. The decision framework was simple: if the business action that depends on the output has a time horizon of less than four hours, it belongs in the streaming path. If the time horizon is longer, batch is fine.
The second trade-off was operational complexity. Running two pipelines that produce overlapping outputs requires discipline. Schema changes had to be applied to both paths. Business logic changes had to be validated against both outputs. The team built automated reconciliation tests that ran nightly, but the test suite added roughly fifteen percent to the ongoing maintenance cost of the data platform.
Results
Operations managers gained the ability to make same-day routing decisions based on data that was never more than five minutes old. Delivery exception rates dropped by twelve percent in the first quarter after full migration, primarily because drivers could be rerouted in response to real-time delays rather than discovering delays at the next morning’s review.
The batch pipeline continued to serve finance and compliance without any changes. The two pipelines coexisted for over a year before the team began selectively retiring batch steps where the streaming output had proven reliable over time.
The decision heuristic
Do not try to replace batch with streaming. Extend batch with streaming, and let the two coexist with a reconciliation layer that proves the streaming output is correct. The migration is not complete when the streaming pipeline works. It is complete when you can prove the streaming pipeline produces results that match the batch pipeline within acceptable bounds. If you cannot prove that, the streaming pipeline is not ready for production regardless of how fast it is.