The data pipeline that cost $50K/month — and the audit that found why

The data pipeline that cost $50K/month — and the audit that found why

Simor Consulting | 22 Apr, 2026 | 04 Mins read

A financial services firm running analytics on trade settlement data came to us with a specific complaint: their cloud data platform cost had tripled in eighteen months, and nobody could explain why. The data team pointed at new features. The platform team pointed at upstream data volume growth. Finance wanted an answer that was more precise than finger-pointing.

The monthly bill was $142,000. The firm’s internal benchmark for a platform of this size and complexity was roughly $90,000. They were overspending by about $50,000 per month, or $600,000 per year. That number had been growing for three quarters, and it was accelerating.

We were asked to do a pipeline cost audit. Not a performance review. Not a migration proposal. An audit of what was actually running, what it cost, and whether anyone needed it.

The audit methodology

Most data platform audits focus on architecture: are you using the right tools, the right schema patterns, the right orchestration. That analysis has value, but it misses the most common source of cost waste in data platforms, which is execution of pipelines that serve no active consumer.

We started by building a dependency graph of every pipeline in the platform. For each pipeline, we tracked three things: what data it read, what data it wrote, and what downstream consumers read the output. We then traced the consumption chain from every dashboard, report, API, and ML feature back to the pipelines that produced its source data.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

The graph revealed what the raw compute logs could not: which execution paths had living consumers at the end, and which terminated in nothing.

What the audit found

Thirty-one percent of total compute was spent on pipelines whose output had zero active consumers. These were pipelines that had been built for dashboards that were decommissioned, ML models that were replaced, and reports that had been superseded by newer versions. Nobody deleted the pipeline when the consumer went away. The orchestration system kept running them on schedule because nothing told it to stop.

Twenty-two percent of compute was spent on redundant transformations. Two separate teams had built near-identical aggregation pipelines on the same source data, using slightly different business logic. Both pipelines ran daily. Both consumed the same raw data. Neither team knew the other’s pipeline existed, because they worked in different organizational units that shared the platform but not the catalog.

Fourteen percent of compute was spent on over-refreshed materialized views. Several views were configured to refresh every fifteen minutes, but their downstream dashboards only refreshed once per day. The views were being computed ninety-six times for every one time their output was consumed.

The remaining thirty-three percent was legitimate spend on pipelines with active consumers and appropriate refresh schedules.

Why the pipeline sprawl happened

This was not a governance failure in the traditional sense. The firm had a data catalog. They had a platform team. They had onboarding processes for new pipelines. What they lacked was a consumption-linked lifecycle model. Pipelines were created with an approval process, but they were never re-evaluated once deployed. There was no mechanism to detect when a pipeline’s consumer disappeared.

The root cause was a missing dependency. The orchestration system knew about task dependencies — which pipeline ran after which. But it did not know about consumption dependencies — which pipeline’s output was actually read by a consumer that someone cared about. Task dependencies are technical. Consumption dependencies are organizational. The platform tracked the first but not the second.

This is common. Most data platforms can tell you what runs and when. Very few can tell you what runs and why.

The fix: consumption-linked lifecycle management

We implemented a three-part system:

First, every pipeline was required to register at least one active consumer in a consumption registry. The registry was a lightweight metadata store that mapped data assets to their consumers — dashboards, APIs, ML features, scheduled reports. Pipelines without a registered consumer were flagged after thirty days and suspended after sixty days.

Second, a monitoring layer tracked actual consumption. For each output table or view, the system logged whether any query, API call, or dashboard refresh had read from it in the past billing cycle. If an output went unconsumed for a full cycle, the producing pipeline was flagged for review.

Third, cost attribution was tied to consumers, not producers. Instead of reporting “pipeline X costs $Y per month,” the reporting system showed “dashboard Z costs $Y per month, inclusive of all pipelines required to produce its data.” This made the cost of unused outputs visible to the teams that owned the dashboards, not just the platform team.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

What it cost to fix

The audit itself took three weeks. The implementation of consumption tracking and automated suspension took another four weeks. Total cost was roughly $85,000 in consulting and engineering time. The savings were immediate: $50,000 per month in the first full billing cycle after implementation, growing to $62,000 per month as additional dormant pipelines were identified and suspended.

Payback period was under two months.

What we gave up

The automated suspension system occasionally flagged pipelines that had legitimate but infrequent consumers — quarterly regulatory reports, annual compliance reviews, seasonal analytics. The team addressed this by allowing pipeline owners to declare a minimum consumption frequency. A pipeline serving a quarterly report could be marked as “consumed quarterly” and would only be flagged if it went unconsumed for two consecutive quarters.

This added a small maintenance burden. Pipeline owners had to think about consumption patterns at creation time rather than discovering them after the fact. Most teams considered this a benefit rather than a cost, because it forced a conversation about whether a pipeline was actually needed before it was built.

The decision heuristic

If your data platform cost is growing faster than your data volume, the problem is almost certainly not compute pricing or storage inefficiency. The problem is execution of unneeded work. Before optimizing individual pipelines for performance, build the consumption dependency graph. Find the dead ends. Suspend them. The savings from eliminating work nobody needs will dwarf any optimization you can apply to work that someone does need.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Similar Articles

The Modern Data Stack for AI Readiness: Architecture and Implementation
The Modern Data Stack for AI Readiness: Architecture and Implementation
28 Jan, 2025 | 03 Mins read

Existing data infrastructure often cannot support ML workflows. The modern data stack offers a foundation, but it requires adaptation to become AI-ready. This article covers building a data architectu

How a retailer reduced inference latency 90% with feature store caching
How a retailer reduced inference latency 90% with feature store caching
21 Apr, 2026 | 04 Mins read

A mid-market e-commerce retailer with roughly $200M in annual revenue had invested eighteen months building a product recommendation engine. The models were accurate. Offline evaluation showed meaning

Data Lakehouse Security Best Practices
Data Lakehouse Security Best Practices
22 Feb, 2024 | 02 Mins read

Data lakehouses combine lake flexibility with warehouse performance but introduce security challenges from their hybrid nature. Securing these environments requires layered approaches covering authent

Semantic Layer Implementation: Challenges and Solutions
Semantic Layer Implementation: Challenges and Solutions
20 Mar, 2024 | 02 Mins read

A semantic layer provides business-friendly abstraction over technical data structures, enabling self-service analytics and consistent metric interpretation. Implementing one involves technical challe

Serverless Data Pipelines: Architecture Patterns
Serverless Data Pipelines: Architecture Patterns
05 Jun, 2024 | 08 Mins read

# Serverless Data Pipelines: Architecture Patterns Serverless computing eliminates server management and provides automatic scaling with pay-per-use billing. These benefits matter for data pipelines

Event-Driven Data Architecture
Event-Driven Data Architecture
15 Sep, 2024 | 02 Mins read

Event-driven architectures treat changes in state as events that trigger immediate actions and data flows. Rather than processing data in batches or through scheduled jobs, components react to changes

From Data Silos to Data Mesh: The Evolution of Enterprise Data Architecture
From Data Silos to Data Mesh: The Evolution of Enterprise Data Architecture
15 Feb, 2025 | 03 Mins read

Traditional centralized data architectures worked for BI but struggle with AI workloads. Centralized teams become bottlenecks as data volumes grow. Domain experts who understand the data are separated

Case Study: End-to-End RAG Platform for Customer Support
Case Study: End-to-End RAG Platform for Customer Support
05 Dec, 2025 | 05 Mins read

A SaaS company with 200 support agents and 10,000+ knowledge base articles had an 18-hour average response time and 23% first-contact resolution. Their largest enterprise client threatened to cancel a

Feature Stores for AI: The Missing MLOps Component Reaching Maturity
Feature Stores for AI: The Missing MLOps Component Reaching Maturity
12 Mar, 2026 | 11 Mins read

A recommendation system team built their tenth model. Each model required feature engineering. Each feature engineering project started by copying code from the previous project, then modifying it for

Case Study: Building a Production AI Knowledge Layer for Financial Services
Case Study: Building a Production AI Knowledge Layer for Financial Services
01 Mar, 2026 | 10 Mins read

A regional bank's investment research team spent 60% of their time gathering information and 40% doing analysis. Analysts had to search through regulatory filings, internal research memos, market data