A diversified industrial company with 10,000 employees across manufacturing, logistics, and field services had accumulated forty-seven separate AI projects over three years. Each business unit had built its own models, its own training pipelines, and its own serving infrastructure. The projects ranged from predictive maintenance on factory equipment to demand forecasting for warehouse staffing. Most showed promising results in pilot. Fewer than ten were in production. None shared infrastructure.
The CTO’s office commissioned a review that revealed the scale of the duplication. Four different teams had built demand forecasting models using three different frameworks and two different feature pipelines, all trained on overlapping data from the same ERP system. Three teams had built natural language processing pipelines for document classification, each using a different embedding model and a different vector store. The total annual spend on AI infrastructure was $4.2 million. The review estimated that the same capability set could be delivered for $1.8 million if the teams shared a common platform.
The political obstacle was not cost. It was autonomy. Each business unit had invested in its own AI team and was reluctant to surrender control to a central platform. The manufacturing division did not trust the logistics division’s feature pipeline. The field services division did not want to wait for a central team to provision infrastructure. Previous attempts at shared platforms had failed because they were experienced as bottlenecks, not enablers.
The design constraint: federated ownership, shared infrastructure
The platform design had to solve two problems simultaneously. First, it had to reduce duplication and cost by providing shared infrastructure for common AI tasks. Second, it had to preserve each business unit’s autonomy to develop, deploy, and iterate on their own models without depending on a central team.
These requirements are in tension. Centralization reduces cost but creates bottlenecks. Decentralization preserves autonomy but duplicates effort. The platform had to find the middle ground: centralize what is expensive to duplicate and cheap to standardize, and decentralize what is specific to each business unit and requires local knowledge.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
We identified five capabilities that met the centralization criteria: feature storage, model registry, training infrastructure, serving infrastructure, and monitoring. These capabilities are expensive to build and maintain individually, are generic enough to serve all business units, and do not require domain-specific knowledge to operate.
Feature storage: a shared feature store that held reusable feature sets computed from common data sources. The ERP data, the IoT sensor data, and the customer data were all centralized sources. Computing features from these sources once and sharing them across teams eliminated the most common source of duplication.
Model registry: a shared catalog of trained models with versioning, metadata, and deployment status. This gave each team visibility into what other teams had built. A logistics engineer looking for a demand forecasting model could discover that the manufacturing team had already built one on the same ERP data, and could evaluate whether to reuse it rather than build from scratch.
Training infrastructure: a shared GPU pool with queue-based allocation. Instead of each team provisioning and paying for their own GPU instances, the platform provided a pool that scaled with aggregate demand. A team running a large training job used more GPUs. A team between experiments used none. The pool was sized for peak aggregate demand, which was forty percent lower than the sum of individual peak demands.
Serving infrastructure: a shared inference layer that provided standardized endpoints for model deployment. Each team deployed their models to the serving infrastructure using a configuration file. The platform handled load balancing, autoscaling, and canary deployments. Teams did not need to manage their own serving stacks.
Monitoring: a shared observability layer that tracked data drift, model performance degradation, and cost per inference across all deployed models. This gave the central platform team visibility into platform health without requiring access to individual model logic.
What was kept federated
Model development was entirely federated. Each team chose their own framework, their own training methodology, and their own evaluation criteria. The platform provided training infrastructure, not training guidance. A team that wanted to use PyTorch competed for the same GPU pool as a team that wanted to use scikit-learn. The platform did not care.
Business logic was federated. The feature store provided raw and lightly transformed features from common data sources. Each team composed these features into domain-specific feature sets using their own transformation logic. The manufacturing team’s definition of “machine utilization” was different from the logistics team’s definition, and both could coexist in the feature store as separate feature sets derived from the same raw data.
Deployment decisions were federated. Each team decided when to deploy, when to roll back, and what performance thresholds to enforce. The serving infrastructure provided the mechanism. The team provided the judgment.
What we gave up
The shared platform introduced a dependency that did not exist before. If the feature store went down, all teams that depended on it lost access to shared features. The platform team had to maintain higher availability than any individual team had previously maintained on their own. The SLA was set at 99.9 percent uptime, which required active monitoring and a dedicated on-call rotation.
The second trade-off was velocity during the transition period. Teams that had been developing on their own infrastructure experienced a slowdown during migration to the shared platform. The migration required adapting training pipelines to use the shared feature store, re-deploying models to the shared serving infrastructure, and integrating with the shared monitoring layer. This migration effort took between four and eight weeks per team, depending on the complexity of their existing stack.
The third trade-off was governance overhead. The platform required a lightweight governance model: who could publish features to the shared store, who could allocate GPU time, who could deploy to the serving infrastructure. The governance model was intentionally minimal — self-service with guardrails rather than approval gates — but it was a new layer of process that teams had to learn.
Results
After eighteen months, thirty-one of the forty-seven AI projects had migrated to the shared platform. The remaining sixteen were either decommissioned or in maintenance mode with no plans for active development. Annual AI infrastructure spend dropped from $4.2 million to $2.1 million — not quite the $1.8 million theoretical minimum, but close enough to justify the investment.
The more significant outcome was cross-pollination. Three business units adopted demand forecasting models that had been built by other teams, adapting them to their own data rather than building from scratch. The field services division used the manufacturing division’s predictive maintenance features to build a vehicle maintenance model. These reuse patterns had been impossible when each team operated in isolation.
Model deployment frequency increased from an average of once per quarter per team to twice per month. The shared serving infrastructure removed the deployment bottleneck that had previously required each team to manage their own production environment.
The decision heuristic
Build a shared platform for AI infrastructure when you have more than three teams independently building models on overlapping data. The signal is not the number of projects. The signal is the number of projects that duplicate data pipelines, feature computation, or serving infrastructure. If two teams are independently computing features from the same ERP system, the platform is already overdue. Centralize infrastructure. Federate intelligence. Let each team own their models and their domain logic, but share the expensive, generic plumbing.