A financial services firm running analytics on trade settlement data came to us with a specific complaint: their cloud data platform cost had tripled in eighteen months, and nobody could explain why. The data team pointed at new features. The platform team pointed at upstream data volume growth. Finance wanted an answer that was more precise than finger-pointing.
The monthly bill was $142,000. The firm’s internal benchmark for a platform of this size and complexity was roughly $90,000. They were overspending by about $50,000 per month, or $600,000 per year. That number had been growing for three quarters, and it was accelerating.
We were asked to do a pipeline cost audit. Not a performance review. Not a migration proposal. An audit of what was actually running, what it cost, and whether anyone needed it.
The audit methodology
Most data platform audits focus on architecture: are you using the right tools, the right schema patterns, the right orchestration. That analysis has value, but it misses the most common source of cost waste in data platforms, which is execution of pipelines that serve no active consumer.
We started by building a dependency graph of every pipeline in the platform. For each pipeline, we tracked three things: what data it read, what data it wrote, and what downstream consumers read the output. We then traced the consumption chain from every dashboard, report, API, and ML feature back to the pipelines that produced its source data.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
The graph revealed what the raw compute logs could not: which execution paths had living consumers at the end, and which terminated in nothing.
What the audit found
Thirty-one percent of total compute was spent on pipelines whose output had zero active consumers. These were pipelines that had been built for dashboards that were decommissioned, ML models that were replaced, and reports that had been superseded by newer versions. Nobody deleted the pipeline when the consumer went away. The orchestration system kept running them on schedule because nothing told it to stop.
Twenty-two percent of compute was spent on redundant transformations. Two separate teams had built near-identical aggregation pipelines on the same source data, using slightly different business logic. Both pipelines ran daily. Both consumed the same raw data. Neither team knew the other’s pipeline existed, because they worked in different organizational units that shared the platform but not the catalog.
Fourteen percent of compute was spent on over-refreshed materialized views. Several views were configured to refresh every fifteen minutes, but their downstream dashboards only refreshed once per day. The views were being computed ninety-six times for every one time their output was consumed.
The remaining thirty-three percent was legitimate spend on pipelines with active consumers and appropriate refresh schedules.
Why the pipeline sprawl happened
This was not a governance failure in the traditional sense. The firm had a data catalog. They had a platform team. They had onboarding processes for new pipelines. What they lacked was a consumption-linked lifecycle model. Pipelines were created with an approval process, but they were never re-evaluated once deployed. There was no mechanism to detect when a pipeline’s consumer disappeared.
The root cause was a missing dependency. The orchestration system knew about task dependencies — which pipeline ran after which. But it did not know about consumption dependencies — which pipeline’s output was actually read by a consumer that someone cared about. Task dependencies are technical. Consumption dependencies are organizational. The platform tracked the first but not the second.
This is common. Most data platforms can tell you what runs and when. Very few can tell you what runs and why.
The fix: consumption-linked lifecycle management
We implemented a three-part system:
First, every pipeline was required to register at least one active consumer in a consumption registry. The registry was a lightweight metadata store that mapped data assets to their consumers — dashboards, APIs, ML features, scheduled reports. Pipelines without a registered consumer were flagged after thirty days and suspended after sixty days.
Second, a monitoring layer tracked actual consumption. For each output table or view, the system logged whether any query, API call, or dashboard refresh had read from it in the past billing cycle. If an output went unconsumed for a full cycle, the producing pipeline was flagged for review.
Third, cost attribution was tied to consumers, not producers. Instead of reporting “pipeline X costs $Y per month,” the reporting system showed “dashboard Z costs $Y per month, inclusive of all pipelines required to produce its data.” This made the cost of unused outputs visible to the teams that owned the dashboards, not just the platform team.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
What it cost to fix
The audit itself took three weeks. The implementation of consumption tracking and automated suspension took another four weeks. Total cost was roughly $85,000 in consulting and engineering time. The savings were immediate: $50,000 per month in the first full billing cycle after implementation, growing to $62,000 per month as additional dormant pipelines were identified and suspended.
Payback period was under two months.
What we gave up
The automated suspension system occasionally flagged pipelines that had legitimate but infrequent consumers — quarterly regulatory reports, annual compliance reviews, seasonal analytics. The team addressed this by allowing pipeline owners to declare a minimum consumption frequency. A pipeline serving a quarterly report could be marked as “consumed quarterly” and would only be flagged if it went unconsumed for two consecutive quarters.
This added a small maintenance burden. Pipeline owners had to think about consumption patterns at creation time rather than discovering them after the fact. Most teams considered this a benefit rather than a cost, because it forced a conversation about whether a pipeline was actually needed before it was built.
The decision heuristic
If your data platform cost is growing faster than your data volume, the problem is almost certainly not compute pricing or storage inefficiency. The problem is execution of unneeded work. Before optimizing individual pipelines for performance, build the consumption dependency graph. Find the dead ends. Suspend them. The savings from eliminating work nobody needs will dwarf any optimization you can apply to work that someone does need.