Building an AI operating system for a 10,000-person company

Building an AI operating system for a 10,000-person company

Simor Consulting | 19 May, 2026 | 05 Mins read

A diversified industrial company with 10,000 employees across manufacturing, logistics, and field services had accumulated forty-seven separate AI projects over three years. Each business unit had built its own models, its own training pipelines, and its own serving infrastructure. The projects ranged from predictive maintenance on factory equipment to demand forecasting for warehouse staffing. Most showed promising results in pilot. Fewer than ten were in production. None shared infrastructure.

The CTO’s office commissioned a review that revealed the scale of the duplication. Four different teams had built demand forecasting models using three different frameworks and two different feature pipelines, all trained on overlapping data from the same ERP system. Three teams had built natural language processing pipelines for document classification, each using a different embedding model and a different vector store. The total annual spend on AI infrastructure was $4.2 million. The review estimated that the same capability set could be delivered for $1.8 million if the teams shared a common platform.

The political obstacle was not cost. It was autonomy. Each business unit had invested in its own AI team and was reluctant to surrender control to a central platform. The manufacturing division did not trust the logistics division’s feature pipeline. The field services division did not want to wait for a central team to provision infrastructure. Previous attempts at shared platforms had failed because they were experienced as bottlenecks, not enablers.

The design constraint: federated ownership, shared infrastructure

The platform design had to solve two problems simultaneously. First, it had to reduce duplication and cost by providing shared infrastructure for common AI tasks. Second, it had to preserve each business unit’s autonomy to develop, deploy, and iterate on their own models without depending on a central team.

These requirements are in tension. Centralization reduces cost but creates bottlenecks. Decentralization preserves autonomy but duplicates effort. The platform had to find the middle ground: centralize what is expensive to duplicate and cheap to standardize, and decentralize what is specific to each business unit and requires local knowledge.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

We identified five capabilities that met the centralization criteria: feature storage, model registry, training infrastructure, serving infrastructure, and monitoring. These capabilities are expensive to build and maintain individually, are generic enough to serve all business units, and do not require domain-specific knowledge to operate.

Feature storage: a shared feature store that held reusable feature sets computed from common data sources. The ERP data, the IoT sensor data, and the customer data were all centralized sources. Computing features from these sources once and sharing them across teams eliminated the most common source of duplication.

Model registry: a shared catalog of trained models with versioning, metadata, and deployment status. This gave each team visibility into what other teams had built. A logistics engineer looking for a demand forecasting model could discover that the manufacturing team had already built one on the same ERP data, and could evaluate whether to reuse it rather than build from scratch.

Training infrastructure: a shared GPU pool with queue-based allocation. Instead of each team provisioning and paying for their own GPU instances, the platform provided a pool that scaled with aggregate demand. A team running a large training job used more GPUs. A team between experiments used none. The pool was sized for peak aggregate demand, which was forty percent lower than the sum of individual peak demands.

Serving infrastructure: a shared inference layer that provided standardized endpoints for model deployment. Each team deployed their models to the serving infrastructure using a configuration file. The platform handled load balancing, autoscaling, and canary deployments. Teams did not need to manage their own serving stacks.

Monitoring: a shared observability layer that tracked data drift, model performance degradation, and cost per inference across all deployed models. This gave the central platform team visibility into platform health without requiring access to individual model logic.

What was kept federated

Model development was entirely federated. Each team chose their own framework, their own training methodology, and their own evaluation criteria. The platform provided training infrastructure, not training guidance. A team that wanted to use PyTorch competed for the same GPU pool as a team that wanted to use scikit-learn. The platform did not care.

Business logic was federated. The feature store provided raw and lightly transformed features from common data sources. Each team composed these features into domain-specific feature sets using their own transformation logic. The manufacturing team’s definition of “machine utilization” was different from the logistics team’s definition, and both could coexist in the feature store as separate feature sets derived from the same raw data.

Deployment decisions were federated. Each team decided when to deploy, when to roll back, and what performance thresholds to enforce. The serving infrastructure provided the mechanism. The team provided the judgment.

What we gave up

The shared platform introduced a dependency that did not exist before. If the feature store went down, all teams that depended on it lost access to shared features. The platform team had to maintain higher availability than any individual team had previously maintained on their own. The SLA was set at 99.9 percent uptime, which required active monitoring and a dedicated on-call rotation.

The second trade-off was velocity during the transition period. Teams that had been developing on their own infrastructure experienced a slowdown during migration to the shared platform. The migration required adapting training pipelines to use the shared feature store, re-deploying models to the shared serving infrastructure, and integrating with the shared monitoring layer. This migration effort took between four and eight weeks per team, depending on the complexity of their existing stack.

The third trade-off was governance overhead. The platform required a lightweight governance model: who could publish features to the shared store, who could allocate GPU time, who could deploy to the serving infrastructure. The governance model was intentionally minimal — self-service with guardrails rather than approval gates — but it was a new layer of process that teams had to learn.

Results

After eighteen months, thirty-one of the forty-seven AI projects had migrated to the shared platform. The remaining sixteen were either decommissioned or in maintenance mode with no plans for active development. Annual AI infrastructure spend dropped from $4.2 million to $2.1 million — not quite the $1.8 million theoretical minimum, but close enough to justify the investment.

The more significant outcome was cross-pollination. Three business units adopted demand forecasting models that had been built by other teams, adapting them to their own data rather than building from scratch. The field services division used the manufacturing division’s predictive maintenance features to build a vehicle maintenance model. These reuse patterns had been impossible when each team operated in isolation.

Model deployment frequency increased from an average of once per quarter per team to twice per month. The shared serving infrastructure removed the deployment bottleneck that had previously required each team to manage their own production environment.

The decision heuristic

Build a shared platform for AI infrastructure when you have more than three teams independently building models on overlapping data. The signal is not the number of projects. The signal is the number of projects that duplicate data pipelines, feature computation, or serving infrastructure. If two teams are independently computing features from the same ERP system, the platform is already overdue. Centralize infrastructure. Federate intelligence. Let each team own their models and their domain logic, but share the expensive, generic plumbing.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Similar Articles

Building AI-Ready Data Pipelines: Key Architecture Considerations
Building AI-Ready Data Pipelines: Key Architecture Considerations
04 Mar, 2025 | 02 Mins read

Data pipelines built for business intelligence often fail when supporting AI workloads. The root cause is usually architectural: BI pipelines assume bounded, relatively static datasets, while AI syste

The Modern Data Stack for AI Readiness: Architecture and Implementation
The Modern Data Stack for AI Readiness: Architecture and Implementation
28 Jan, 2025 | 03 Mins read

Existing data infrastructure often cannot support ML workflows. The modern data stack offers a foundation, but it requires adaptation to become AI-ready. This article covers building a data architectu

How a retailer reduced inference latency 90% with feature store caching
How a retailer reduced inference latency 90% with feature store caching
21 Apr, 2026 | 04 Mins read

A mid-market e-commerce retailer with roughly $200M in annual revenue had invested eighteen months building a product recommendation engine. The models were accurate. Offline evaluation showed meaning

The data pipeline that cost $50K/month — and the audit that found why
The data pipeline that cost $50K/month — and the audit that found why
22 Apr, 2026 | 04 Mins read

A financial services firm running analytics on trade settlement data came to us with a specific complaint: their cloud data platform cost had tripled in eighteen months, and nobody could explain why.

The 7-step vector database selection checklist
The 7-step vector database selection checklist
26 Apr, 2026 | 06 Mins read

Most vector database selection failures come down to one mistake: picking the technology before mapping the workload. Teams benchmark embedding search speed on a curated dataset, pick the fastest opti

Migrating from batch to streaming: a 6-month journey
Migrating from batch to streaming: a 6-month journey
28 Apr, 2026 | 05 Mins read

A logistics company processing two million shipments per day ran their entire operational reporting stack on nightly batch ETL. Every morning at 6 AM, operations managers reviewed dashboards built on

When RAG failed: a knowledge retrieval project post-mortem
When RAG failed: a knowledge retrieval project post-mortem
29 Apr, 2026 | 05 Mins read

A legal technology company had invested six months building a retrieval-augmented generation system to help contract attorneys find relevant precedent clauses across a corpus of 180,000 executed agree

The open-source LLM landscape just shifted — again
The open-source LLM landscape just shifted — again
02 May, 2026 | 03 Mins read

Three releases in the last six weeks have redrawn the open-source LLM map. Meta shipped Llama 4 with a mixture-of-experts architecture that narrows the gap with proprietary frontier models. Mistral re

From 3-hour dashboards to 3-minute insights: a BI modernization story
From 3-hour dashboards to 3-minute insights: a BI modernization story
05 May, 2026 | 05 Mins read

A manufacturing company with facilities in twelve countries ran its operational reporting on a traditional BI stack: a data warehouse, an ETL pipeline, and a dashboard tool that had been deployed six

Build vs buy: a decision tree for AI infrastructure
Build vs buy: a decision tree for AI infrastructure
03 May, 2026 | 06 Mins read

Every AI infrastructure team eventually faces the same argument. One faction wants to build a custom solution because the commercial options do not handle their specific requirements. The other factio

Why every cloud provider launched an AI operating system this year
Why every cloud provider launched an AI operating system this year
09 May, 2026 | 03 Mins read

AWS announced Bedrock Studio. Google shipped Vertex AI Platform as a unified surface. Azure consolidated its AI offerings under a single "AI Foundry" brand. Databricks, Snowflake, and even Cloudflare

The vector database that couldn't scale — and what we did instead
The vector database that couldn't scale — and what we did instead
12 May, 2026 | 05 Mins read

A media company with a library of twelve million articles, transcripts, and research documents had built a semantic search system on a managed vector database. The system was designed to let journalis

LLM evaluation platforms compared: LangSmith, Braintrust, Patronus
LLM evaluation platforms compared: LangSmith, Braintrust, Patronus
14 May, 2026 | 05 Mins read

Building an LLM application is the easy part. Knowing whether it works — whether it still works after you change a prompt, swap a model, or add a tool — is the hard part. LLM evaluation platforms exis

The A2A protocol and what it means for enterprise AI
The A2A protocol and what it means for enterprise AI
16 May, 2026 | 03 Mins read

Google published the Agent-to-Agent (A2A) protocol specification in late 2025 and, as of this quarter, has secured endorsement from over fifty technology companies including Salesforce, SAP, ServiceNo

A cost optimization framework for LLM inference
A cost optimization framework for LLM inference
24 May, 2026 | 06 Mins read

LLM inference costs follow a pattern that catches teams off guard. The first prototype costs almost nothing -- a few hundred dollars a month during development. The pilot scales to a few thousand. Pro

The Rise of GPU Databases for AI Workloads
The Rise of GPU Databases for AI Workloads
22 Jan, 2024 | 03 Mins read

Traditional relational database management systems were designed for an era of megabyte-scale datasets and batch reporting. AI workloads demand processing terabyte-scale datasets with complex analytic

Vector Databases: The Missing Piece in Your AI Infrastructure
Vector Databases: The Missing Piece in Your AI Infrastructure
12 Jan, 2024 | 02 Mins read

Vector databases index and query high-dimensional vector embeddings. Unlike traditional databases that excel at exact matches, vector databases enable similarity search: finding items conceptually clo

Case Study: End-to-End RAG Platform for Customer Support
Case Study: End-to-End RAG Platform for Customer Support
05 Dec, 2025 | 05 Mins read

A SaaS company with 200 support agents and 10,000+ knowledge base articles had an 18-hour average response time and 23% first-contact resolution. Their largest enterprise client threatened to cancel a

Designing the Enterprise Knowledge Layer: Beyond RAG
Designing the Enterprise Knowledge Layer: Beyond RAG
16 Jan, 2026 | 14 Mins read

Most teams implement retrieval-augmented generation and call it a knowledge layer. Give the model access to a vector database, stuff in some documents, and ship. This approach works for demos. It fall

AI Agent Orchestration Patterns: From Chaining to Multi-Agent Systems
AI Agent Orchestration Patterns: From Chaining to Multi-Agent Systems
27 Jan, 2026 | 13 Mins read

A software debugging agent receives a bug report. It needs to search code, understand the error, propose a fix, write tests, and summarize for the developer. None of these steps are independent. Each

AI Infrastructure for Legacy Systems: Modernizing 20-Year-Old ERPs with AI
AI Infrastructure for Legacy Systems: Modernizing 20-Year-Old ERPs with AI
18 Feb, 2026 | 13 Mins read

A manufacturing company runs their operations on an ERP system installed in 2004. The vendor still supports it. The team knows how to maintain it. The integrations are stable. It works. The problem i

Case Study: Building a Production AI Knowledge Layer for Financial Services
Case Study: Building a Production AI Knowledge Layer for Financial Services
01 Mar, 2026 | 10 Mins read

A regional bank's investment research team spent 60% of their time gathering information and 40% doing analysis. Analysts had to search through regulatory filings, internal research memos, market data

Feature Stores for AI: The Missing MLOps Component Reaching Maturity
Feature Stores for AI: The Missing MLOps Component Reaching Maturity
12 Mar, 2026 | 11 Mins read

A recommendation system team built their tenth model. Each model required feature engineering. Each feature engineering project started by copying code from the previous project, then modifying it for

Tool Calling and Function Calling: Connecting AI to Enterprise Systems
Tool Calling and Function Calling: Connecting AI to Enterprise Systems
28 Mar, 2026 | 14 Mins read

A language model that only generates text is not enough for most enterprise problems. The real value emerges when an AI system can look up your customer record, check inventory levels across warehouse

The AI Data Pipeline: Special Considerations for Unstructured and Structured Data
The AI Data Pipeline: Special Considerations for Unstructured and Structured Data
11 May, 2026 | 13 Mins read

Data pipelines for AI are not the same as data pipelines for traditional software systems. The outputs are different. The failure modes are different. The tolerance for data quality issues is differen

AI Observability: Monitoring Hallucinations, Latency, and Cost at Scale
AI Observability: Monitoring Hallucinations, Latency, and Cost at Scale
30 Apr, 2026 | 09 Mins read

Traditional software monitoring tracks CPU utilization, memory consumption, request rates, and error counts. These metrics tell you whether your service is running and whether it is handling load. The

Semantic Caching for AI: Reducing Latency and Cost with Meaning-Based Retrieval
Semantic Caching for AI: Reducing Latency and Cost with Meaning-Based Retrieval
19 May, 2026 | 07 Mins read

Every repeated question your AI system answers is money spent and latency incurred that you did not need to. If a thousand users ask the same question in a week, running it through the language model

Evaluating LLM Providers for Enterprise: A Framework Beyond Benchmark
Evaluating LLM Providers for Enterprise: A Framework Beyond Benchmark
08 Apr, 2026 | 10 Mins read

Benchmark scores tell you how a model performs on problems that someone else chose. Your enterprise systems present different problems: your proprietary terminology, your specific data distributions,