The data pipeline that cost $50K/month — and the audit that found why

Simor Consulting | 22 Apr, 2026 | 04 Mins read

A financial services firm running analytics on trade settlement data came to us with a specific complaint: their cloud data platform cost had tripled in eighteen months, and nobody could explain why. The data team pointed at new features. The platform team pointed at upstream data volume growth. Finance wanted an answer that was more precise than finger-pointing.

The monthly bill was $142,000. The firm’s internal benchmark for a platform of this size and complexity was roughly $90,000. They were overspending by about $50,000 per month, or $600,000 per year. That number had been growing for three quarters, and it was accelerating.

We were asked to do a pipeline cost audit. Not a performance review. Not a migration proposal. An audit of what was actually running, what it cost, and whether anyone needed it.

The audit methodology

Most data platform audits focus on architecture: are you using the right tools, the right schema patterns, the right orchestration. That analysis has value, but it misses the most common source of cost waste in data platforms, which is execution of pipelines that serve no active consumer.

We started by building a dependency graph of every pipeline in the platform. For each pipeline, we tracked three things: what data it read, what data it wrote, and what downstream consumers read the output. We then traced the consumption chain from every dashboard, report, API, and ML feature back to the pipelines that produced its source data.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

The graph revealed what the raw compute logs could not: which execution paths had living consumers at the end, and which terminated in nothing.

What the audit found

Thirty-one percent of total compute was spent on pipelines whose output had zero active consumers. These were pipelines that had been built for dashboards that were decommissioned, ML models that were replaced, and reports that had been superseded by newer versions. Nobody deleted the pipeline when the consumer went away. The orchestration system kept running them on schedule because nothing told it to stop.

Twenty-two percent of compute was spent on redundant transformations. Two separate teams had built near-identical aggregation pipelines on the same source data, using slightly different business logic. Both pipelines ran daily. Both consumed the same raw data. Neither team knew the other’s pipeline existed, because they worked in different organizational units that shared the platform but not the catalog.

Fourteen percent of compute was spent on over-refreshed materialized views. Several views were configured to refresh every fifteen minutes, but their downstream dashboards only refreshed once per day. The views were being computed ninety-six times for every one time their output was consumed.

The remaining thirty-three percent was legitimate spend on pipelines with active consumers and appropriate refresh schedules.

Why the pipeline sprawl happened

This was not a governance failure in the traditional sense. The firm had a data catalog. They had a platform team. They had onboarding processes for new pipelines. What they lacked was a consumption-linked lifecycle model. Pipelines were created with an approval process, but they were never re-evaluated once deployed. There was no mechanism to detect when a pipeline’s consumer disappeared.

The root cause was a missing dependency. The orchestration system knew about task dependencies — which pipeline ran after which. But it did not know about consumption dependencies — which pipeline’s output was actually read by a consumer that someone cared about. Task dependencies are technical. Consumption dependencies are organizational. The platform tracked the first but not the second.

This is common. Most data platforms can tell you what runs and when. Very few can tell you what runs and why.

The fix: consumption-linked lifecycle management

We implemented a three-part system:

First, every pipeline was required to register at least one active consumer in a consumption registry. The registry was a lightweight metadata store that mapped data assets to their consumers — dashboards, APIs, ML features, scheduled reports. Pipelines without a registered consumer were flagged after thirty days and suspended after sixty days.

Second, a monitoring layer tracked actual consumption. For each output table or view, the system logged whether any query, API call, or dashboard refresh had read from it in the past billing cycle. If an output went unconsumed for a full cycle, the producing pipeline was flagged for review.

Third, cost attribution was tied to consumers, not producers. Instead of reporting “pipeline X costs $Y per month,” the reporting system showed “dashboard Z costs $Y per month, inclusive of all pipelines required to produce its data.” This made the cost of unused outputs visible to the teams that owned the dashboards, not just the platform team.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

What it cost to fix

The audit itself took three weeks. The implementation of consumption tracking and automated suspension took another four weeks. Total cost was roughly $85,000 in consulting and engineering time. The savings were immediate: $50,000 per month in the first full billing cycle after implementation, growing to $62,000 per month as additional dormant pipelines were identified and suspended.

Payback period was under two months.

What we gave up

The automated suspension system occasionally flagged pipelines that had legitimate but infrequent consumers — quarterly regulatory reports, annual compliance reviews, seasonal analytics. The team addressed this by allowing pipeline owners to declare a minimum consumption frequency. A pipeline serving a quarterly report could be marked as “consumed quarterly” and would only be flagged if it went unconsumed for two consecutive quarters.

This added a small maintenance burden. Pipeline owners had to think about consumption patterns at creation time rather than discovering them after the fact. Most teams considered this a benefit rather than a cost, because it forced a conversation about whether a pipeline was actually needed before it was built.

The decision heuristic

If your data platform cost is growing faster than your data volume, the problem is almost certainly not compute pricing or storage inefficiency. The problem is execution of unneeded work. Before optimizing individual pipelines for performance, build the consumption dependency graph. Find the dead ends. Suspend them. The savings from eliminating work nobody needs will dwarf any optimization you can apply to work that someone does need.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Take AI Readiness Assessment Schedule Technical Consultation

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.

Similar Articles

Data Architecture AI Infrastructure

The Modern Data Stack for AI Readiness: Architecture and Implementation

28 Jan, 2025 | 03 Mins read

Existing data infrastructure often cannot support ML workflows. The modern data stack offers a foundation, but it requires adaptation to become AI-ready. This article covers building a data architectu

Tooling Data Architecture

dbt vs SQLMesh: which transformation tool wins in 2026?

23 Apr, 2026 | 06 Mins read

Every analytics team eventually faces the same choice: how do you transform raw data into something analysts can actually use? For years, dbt was the only serious answer. SQLMesh arrived with a differ

Case Study AI Infrastructure

How a retailer reduced inference latency 90% with feature store caching

21 Apr, 2026 | 04 Mins read

A mid-market e-commerce retailer with roughly $200M in annual revenue had invested eighteen months building a product recommendation engine. The models were accurate. Offline evaluation showed meaning

Case Study Data Architecture

Migrating from batch to streaming: a 6-month journey

28 Apr, 2026 | 05 Mins read

A logistics company processing two million shipments per day ran their entire operational reporting stack on nightly batch ETL. Every morning at 6 AM, operations managers reviewed dashboards built on

Data Security Data Architecture

Data Lakehouse Security Best Practices

22 Feb, 2024 | 02 Mins read

Data lakehouses combine lake flexibility with warehouse performance but introduce security challenges from their hybrid nature. Securing these environments requires layered approaches covering authent

Case Study Knowledge Layer

When RAG failed: a knowledge retrieval project post-mortem

29 Apr, 2026 | 05 Mins read

A legal technology company had invested six months building a retrieval-augmented generation system to help contract attorneys find relevant precedent clauses across a corpus of 180,000 executed agree

Case Study Data Architecture

From 3-hour dashboards to 3-minute insights: a BI modernization story

05 May, 2026 | 05 Mins read

A manufacturing company with facilities in twelve countries ran its operational reporting on a traditional BI stack: a data warehouse, an ETL pipeline, and a dashboard tool that had been deployed six

Tooling Data Architecture

Orchestration face-off: Airflow vs Prefect vs Dagster

07 May, 2026 | 06 Mins read

The orchestration market has a clear incumbent and two serious challengers. Apache Airflow has been the default choice since 2015. Prefect and Dagster both emerged to address Airflow's pain points, bu

Case Study AI Infrastructure

The vector database that couldn't scale — and what we did instead

12 May, 2026 | 05 Mins read

A media company with a library of twelve million articles, transcripts, and research documents had built a semantic search system on a managed vector database. The system was designed to let journalis

Case Study AI Infrastructure

Building an AI operating system for a 10,000-person company

19 May, 2026 | 05 Mins read

A diversified industrial company with 10,000 employees across manufacturing, logistics, and field services had accumulated forty-seven separate AI projects over three years. Each business unit had bui

Tooling Data Architecture

Real-time streaming: Kafka vs Redpanda vs Pulsar

21 May, 2026 | 05 Mins read

Kafka has dominated event streaming for a decade. It processes trillions of messages daily across thousands of companies. Its dominance created an ecosystem so large that "streaming" became synonymous

Case Study Data Architecture

How we killed our ETL pipeline (and productivity went up)

26 May, 2026 | 05 Mins read

A B2B SaaS company running a customer success platform had a data pipeline that consumed sixty percent of the data engineering team's time. Not feature work. Not analytics. Pipeline maintenance. The p

Data Architecture Business Intelligence

Semantic Layer Implementation: Challenges and Solutions

20 Mar, 2024 | 02 Mins read

A semantic layer provides business-friendly abstraction over technical data structures, enabling self-service analytics and consistent metric interpretation. Implementing one involves technical challe

Serverless Data Architecture

Serverless Data Pipelines: Architecture Patterns

05 Jun, 2024 | 08 Mins read

# Serverless Data Pipelines: Architecture Patterns Serverless computing eliminates server management and provides automatic scaling with pay-per-use billing. These benefits matter for data pipelines

Data Architecture Event Processing

Event-Driven Data Architecture

15 Sep, 2024 | 02 Mins read

Event-driven architectures treat changes in state as events that trigger immediate actions and data flows. Rather than processing data in batches or through scheduled jobs, components react to changes

Data Architecture Enterprise AI

From Data Silos to Data Mesh: The Evolution of Enterprise Data Architecture

15 Feb, 2025 | 03 Mins read

Traditional centralized data architectures worked for BI but struggle with AI workloads. Centralized teams become bottlenecks as data volumes grow. Domain experts who understand the data are separated

Case Study RAG

Case Study: End-to-End RAG Platform for Customer Support

05 Dec, 2025 | 05 Mins read

A SaaS company with 200 support agents and 10,000+ knowledge base articles had an 18-hour average response time and 23% first-contact resolution. Their largest enterprise client threatened to cancel a

Knowledge Layer Case Study

Case Study: Building a Production AI Knowledge Layer for Financial Services

01 Mar, 2026 | 10 Mins read

A regional bank's investment research team spent 60% of their time gathering information and 40% doing analysis. Analysts had to search through regulatory filings, internal research memos, market data

AI Infrastructure Data Architecture

Feature Stores for AI: The Missing MLOps Component Reaching Maturity

12 Mar, 2026 | 11 Mins read

A recommendation system team built their tenth model. Each model required feature engineering. Each feature engineering project started by copying code from the previous project, then modifying it for

Data Architecture AI Infrastructure

The AI Data Pipeline: Special Considerations for Unstructured and Structured Data

11 May, 2026 | 13 Mins read

Data pipelines for AI are not the same as data pipelines for traditional software systems. The outputs are different. The failure modes are different. The tolerance for data quality issues is differen