The invisible infrastructure: why data plumbing matters more than models

The invisible infrastructure: why data plumbing matters more than models

Simor Consulting | 15 Jun, 2026 | 05 Mins read

A Fortune 500 company hired a team of twelve machine learning engineers and tasked them with building a predictive maintenance system for their manufacturing floor. The ML team spent four months evaluating model architectures — gradient boosted trees, transformers, temporal convolutional networks. They benchmarked each architecture against a curated dataset and selected the best performer. When they integrated the model with the production data pipeline, accuracy dropped from 92% to 61%.

The problem was not the model. The problem was that the production data pipeline delivered sensor readings with a twelve-minute delay, occasionally duplicated timestamps during batch processing, and applied a smoothing function that the training pipeline did not. The model was trained on clean, synchronous data and deployed against noisy, asynchronous data. No amount of architectural sophistication could compensate for the data pipeline mismatch.

This story is representative of a pattern I see so consistently that I treat it as a rule: the data plumbing determines the outcome more than the model does. And yet the industry’s attention, hiring, and prestige flow overwhelmingly toward model development, not data infrastructure.

What plumbing actually involves

Data plumbing is the work of getting data from where it originates to where the model needs it, in the condition the model requires, at the time the model needs it, with the governance controls the organization requires. This description sounds straightforward. In practice, it involves solving problems that are less intellectually glamorous than model development but more operationally demanding.

Schema evolution is one. Data sources change their structure over time. A sensor manufacturer updates firmware and the output format shifts. A business system adds a field. A third-party API changes its response structure. Each of these changes can silently break a data pipeline, and the breakage may not be detected until a model trained on the old schema starts producing degraded outputs.

Temporal alignment is another. When a model uses data from multiple sources — say, sensor readings from equipment, maintenance logs from a ticketing system, and production schedules from an ERP — the data must be temporally aligned. Each source may have a different timestamp format, a different time zone, a different latency, and a different definition of “event time” versus “processing time.” Getting this alignment right is tedious, error-prone, and absolutely critical to model accuracy.

Data quality monitoring is the ongoing work. Data quality degrades in ways that are invisible without active monitoring. A sensor starts returning zeros instead of nulls when it malfunctions. A database migration truncates decimal precision. A business rule change alters the meaning of a categorical field. Each of these issues changes the data distribution in a way that a model will not detect on its own, because the model was trained on data that included these issues in their original form.

Access governance is the work that nobody wants to do but that regulatory environments increasingly require. Who can access what data, under what conditions, with what audit trail? AI systems that consume personal data, financial data, or health data require access controls that are integrated into the data pipeline, not bolted on as an afterthought. Getting this wrong has consequences that range from regulatory fines to reputational damage.

Why plumbing gets deprioritized

The deprioritization of data plumbing has three causes, and they reinforce each other.

Prestige asymmetry. Building a novel model architecture is publishable, promotable, and demo-able. Building a robust data pipeline is none of these things. The career incentives in data engineering and data science favor model work over pipeline work, so the most talented engineers gravitate toward model development, leaving pipeline work to less experienced engineers or to no one.

Measurement asymmetry. Model accuracy is easy to measure and easy to communicate. A model that achieves 95% accuracy on a benchmark is obviously better than one that achieves 91%. Data pipeline quality is harder to measure and harder to communicate. What is the accuracy of a pipeline? What is the quality of a schema migration? These questions have answers, but the answers require more effort to produce and more context to interpret.

Vendor incentive asymmetry. Companies that sell AI tools sell model development tools. AutoML platforms, model registries, experiment trackers, and model serving infrastructure are products with clear pricing and clear marketing. Data plumbing tools are less glamorous, harder to market, and often built in-house rather than purchased. The vendor ecosystem reinforces the perception that model development is where the value is, because that is where the vendor revenue is.

The compounding effect

The consequences of poor plumbing compound over time in a way that poor model architecture does not. A bad model architecture can be replaced. The replacement is a bounded project with a clear deliverable. Bad data plumbing cannot be replaced without addressing the organizational habits, technical debt, and governance gaps that produced it. The replacement is an unbounded project with unclear deliverables, which is exactly the kind of project that organizations deprioritize.

I have seen organizations where the data plumbing debt was so severe that the data team spent sixty percent of its time on pipeline maintenance — debugging data quality issues, fixing broken integrations, manually correcting data errors — and forty percent on everything else, including model development. In these organizations, the data team’s effective capacity for building AI systems was less than half of what the headcount would suggest, because the plumbing consumed the majority of their attention.

What good plumbing looks like

Organizations that take data plumbing seriously share three characteristics.

They measure pipeline quality explicitly. Not just uptime, but data freshness, schema conformance, record completeness, and distribution stability. These metrics are tracked with the same rigor as model accuracy, because they are prerequisites for model accuracy.

They treat pipeline engineering as a specialization with its own career path. Pipeline engineers are not junior data scientists who have not been promoted yet. They are specialists with deep expertise in data movement, transformation, and governance. They have their own leveling criteria, their own technical leadership track, and their own recognition within the organization.

They invest in plumbing before models. When a new AI initiative is proposed, the first question is not “what model should we build” but “is the data ready, and if not, what work is required to make it ready.” This question is unglamorous, and it is the question that separates organizations that ship AI systems from organizations that build demos.

The uncomfortable truth is that the most important work in AI data engineering is the work that no one talks about at conferences. It is the work of getting data from source to model reliably, consistently, and with appropriate governance. Until the industry’s prestige, measurement, and incentive structures reflect this reality, AI systems will continue to underperform their potential — not because the models are bad, but because the plumbing is neglected.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Similar Articles

Why most AI transformations fail (it's not the technology)
Why most AI transformations fail (it's not the technology)
20 Apr, 2026 | 04 Mins read

The CTO of a mid-size financial services firm told me they had spent $4 million on AI tooling in eighteen months. They had three large language model providers under contract, a vector database cluste

The case for AI skepticism in your data strategy
The case for AI skepticism in your data strategy
27 Apr, 2026 | 04 Mins read

I was in a strategy session where a VP of Data told the room that generative AI would "eliminate the need for data analysts within two years." The room nodded. Budget was reallocated. Three analyst po

What we can learn from the DevOps revolution applied to AI
What we can learn from the DevOps revolution applied to AI
04 May, 2026 | 04 Mins read

In 2009, deploying software to production was an event. It involved a change request, a maintenance window, a runbook, and a prayer. Developers wrote code, then threw it over the wall to operations, w

Building a data-driven culture: lessons from 50 engagements
Building a data-driven culture: lessons from 50 engagements
13 May, 2026 | 05 Mins read

The phrase "data-driven culture" has been emptied of meaning by overuse. It appears in every strategy deck, every job posting, every conference talk. Everyone claims to want it. Almost no one can desc

The ethics of training on copyrighted data — a nuanced take
The ethics of training on copyrighted data — a nuanced take
18 May, 2026 | 05 Mins read

The legal system has not caught up with the practice of training AI models on copyrighted data, and the people building AI systems are not waiting for it. Models trained on books, articles, code repos

Why your AI team needs philosophers, not just engineers
Why your AI team needs philosophers, not just engineers
25 May, 2026 | 05 Mins read

A hiring manager at a large tech company told me they had four hundred engineers working on their AI platform and zero people with training in philosophy, ethics, or the social sciences. When I asked

The great model commoditization: what happens when everyone has GPT-5
The great model commoditization: what happens when everyone has GPT-5
30 May, 2026 | 03 Mins read

OpenAI shipped GPT-5. Anthropic shipped Claude 4. Google shipped Gemini Ultra 2. Within six weeks of each other, the three leading model providers released frontier models that are, by most benchmarks

The paradox of AI automation: more tools, less productivity?
The paradox of AI automation: more tools, less productivity?
01 Jun, 2026 | 05 Mins read

A data engineering team I worked with had adopted six AI-powered tools in twelve months. An automated code reviewer, a data quality scanner, a pipeline orchestrator with intelligent retry, a natural l

Career paths in AI data engineering: 2026 edition
Career paths in AI data engineering: 2026 edition
08 Jun, 2026 | 04 Mins read

Three years ago, "data engineer" was a coherent job title. You built pipelines, managed infrastructure, and moved data from where it was to where it needed to be. The role required SQL, Python, and a

Books every AI leader should read this year
Books every AI leader should read this year
10 Jun, 2026 | 04 Mins read

Most reading lists for AI leaders are assembled by people who sell AI. The lists are full of books about machine learning techniques, deep learning architectures, and the latest framework documentatio

Why 'AI engineer' is the fastest-growing job title (and what it means)
Why 'AI engineer' is the fastest-growing job title (and what it means)
17 Jun, 2026 | 04 Mins read

LinkedIn's latest workforce report shows "AI engineer" as the fastest-growing job title for the third consecutive quarter. Job postings containing the title increased 280% year-over-year. The growth r

Open-source sustainability: who pays for the code everyone uses?
Open-source sustainability: who pays for the code everyone uses?
22 Jun, 2026 | 05 Mins read

A critical open-source library used by thousands of companies, including several Fortune 500 firms, is maintained by one person in their spare time. This is not a hypothetical. It is a description of

DataOps: Creating Culture and Processes for Reliable Data
DataOps: Creating Culture and Processes for Reliable Data
01 Jun, 2024 | 03 Mins read

# DataOps: Creating Culture and Processes for Reliable Data Data quality issues cascade downstream. DataOps applies DevOps principles to data workflows: automation, collaboration, and continuous impr

2025 Year-in-Review & 2026 Trends in Data & AI Architecture
2025 Year-in-Review & 2026 Trends in Data & AI Architecture
19 Dec, 2025 | 03 Mins read

2025 was the year AI moved from experimentation to industrialization. While 2024 saw the explosion of generative AI capabilities, 2025 was about making those capabilities production-ready, cost-effect

The AI Operating System: Why Companies Need an AI Foundation Layer
The AI Operating System: Why Companies Need an AI Foundation Layer
05 Jan, 2026 | 16 Mins read

A financial services firm spent eight months building an AI-powered document analysis system. When it came time to deploy, they discovered their retrieval system had no governance layer, their agent h

AI Enablement Programs: Building Organizational Capability, Not Just Technology
AI Enablement Programs: Building Organizational Capability, Not Just Technology
19 Mar, 2026 | 11 Mins read

A technology company built an impressive AI platform. They had GPU clusters, fine-tuning pipelines, evaluation frameworks, and a growing model registry. They opened access to any team that wanted to u