Scaling Machine Learning Infrastructure: From POC to Production

Simor Consulting | 10 May, 2024 | 04 Mins read

Scaling Machine Learning Infrastructure: From POC to Production

Moving a machine learning model from notebook to production exposes gaps that notebooks hide. Data scientists produce working models in controlled environments; those models often fail in production due to infrastructure gaps.

This covers key challenges and patterns for scaling ML from experimentation to production.

The ML Scaling Challenge

Teams encounter predictable problems when scaling ML:

POC-Production Gap: Models trained in notebooks fail with real traffic
Infrastructure Requirements: ML workloads need GPU scheduling, feature stores, and model serving infrastructure
Operational Complexity: Models drift and need retraining pipelines
Cross-Functional Dependencies: ML systems span data engineering, software engineering, and domain expertise

Architectural Components for Production ML

A robust ML infrastructure includes several key components:

1. Data Management Layer

Reliable data infrastructure is the foundation of any ML system:

# Example: Feature store integration in training workflow
from feast import FeatureStore
import pandas as pd

# Initialize feature store client
fs = FeatureStore(repo_path="./feature_repo")

# Create training dataset with historical features
entity_df = pd.DataFrame({
    "customer_id": ["1001", "1002", "1003"],
    "event_timestamp": pd.to_datetime([
        "2023-01-01 10:00:00",
        "2023-01-01 11:00:00",
        "2023-01-01 12:00:00"
    ])
})

# Retrieve training features
training_df = fs.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_features:recency",
        "customer_features:frequency",
        "customer_features:monetary_value",
        "transaction_features:avg_transaction_value"
    ]
).to_df()

# Train model with consistent features
model = train_model(training_df)

Key components:

Feature Stores: Centralized repositories for feature definitions, computation, and serving
Data Versioning: Systems to track and reproduce datasets used for specific model versions
Data Quality Monitoring: Automated checks to ensure data quality and detect drift

2. Model Development Environment

Standardized, reproducible environments for model development:

# Example: Containerized development environment
# docker-compose.yml for local ML development
version: "3"
services:
  jupyter:
    build:
      context: ./docker
      dockerfile: Dockerfile.jupyter
    volumes:
      - ./notebooks:/home/jovyan/notebooks
      - ./data:/home/jovyan/data
      - ./src:/home/jovyan/src
    ports:
      - "8888:8888"
    environment:
      - JUPYTER_ENABLE_LAB=yes

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.3.0
    ports:
      - "5000:5000"
    volumes:
      - ./mlflow:/mlflow
    environment:
      - MLFLOW_TRACKING_URI=/mlflow
    command: mlflow server --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts --host 0.0.0.0

Key components:

Development Containers: Standardized environments with consistent dependencies
Experiment Tracking: Tools to log hyperparameters, metrics, and artifacts
Model Registry: Central repository for model versions and metadata
Collaborative Notebooks: Environments that support collaboration while encouraging production practices

3. CI/CD Pipeline for ML

Automating the ML lifecycle:

# Example: GitHub Actions workflow for ML pipeline
name: ML Pipeline

on:
  push:
    branches: [main]
    paths:
      - "src/**"
      - "config/**"
      - "tests/**"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements-dev.txt

      - name: Run tests
        run: pytest tests/

  train:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Train model
        run: python src/train.py --config config/training_config.yaml

      - name: Register model
        run: python src/register_model.py --model-path ./models/latest.pkl

  deploy:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2

      - name: Deploy model
        run: python src/deploy.py --environment staging

Key components:

Automated Testing: Validating data pipelines, model quality, and system behavior
Model Training Pipelines: Reproducible workflows for model training and evaluation
Deployment Automation: Consistent processes for deploying models to various environments
Infrastructure-as-Code: Defining ML infrastructure components as code

4. Scalable Model Serving

Deploying models to handle production workloads:

# Example: Model serving with FastAPI
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
import time

app = FastAPI(title="Customer Churn Prediction API")

# Load model on startup
model = joblib.load("models/churn_model_v1.pkl")

class PredictionRequest(BaseModel):
    customer_id: str
    features: dict

class PredictionResponse(BaseModel):
    customer_id: str
    churn_probability: float
    prediction_time: str
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        # Extract features in the correct order
        feature_array = np.array([
            request.features.get(f, 0) for f in model.feature_names_
        ]).reshape(1, -1)

        # Record timestamp for monitoring
        start_time = time.time()

        # Make prediction
        probability = model.predict_proba(feature_array)[0, 1]

        # Calculate prediction latency
        prediction_time = f"{(time.time() - start_time) * 1000:.2f}ms"

        return PredictionResponse(
            customer_id=request.customer_id,
            churn_probability=float(probability),
            prediction_time=prediction_time,
            model_version="v1"
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Prediction error: {str(e)}")

Key considerations:

Serving Patterns: Batch, real-time, or hybrid prediction serving
Scaling Strategies: Horizontal scaling for handling variable workloads
Resource Optimization: Balancing performance and cost
Latency Requirements: Meeting application-specific response time needs

5. Monitoring and Observability

Systems to monitor model performance and health:

# Example: Model monitoring pipeline
import pandas as pd
import numpy as np
from evidently.model_profile import Profile
from evidently.model_profile.sections import DataDriftProfileSection, PerformanceProfileSection
from evidently.pipeline.column_mapping import ColumnMapping

def monitor_model_drift(reference_data, current_data, categorical_features, numerical_features):
    """Analyze data drift between reference and current datasets"""

    column_mapping = ColumnMapping(
        prediction='prediction',
        target='target',
        categorical_features=categorical_features,
        numerical_features=numerical_features
    )

    # Calculate data drift metrics
    data_drift_profile = Profile(sections=[DataDriftProfileSection()])
    data_drift_profile.calculate(reference_data, current_data, column_mapping=column_mapping)

    # Check if drift exceeds threshold
    drift_report = data_drift_profile.json()
    drift_score = drift_report['metrics']['data_drift']['data_drift_score']

    if drift_score > 0.2:  # Threshold for significant drift
        alert_data_drift(drift_score, drift_report)

    # Calculate performance metrics if target is available
    if 'target' in current_data.columns:
        perf_profile = Profile(sections=[PerformanceProfileSection()])
        perf_profile.calculate(reference_data, current_data, column_mapping=column_mapping)

        perf_report = perf_profile.json()
        current_auc = perf_report['metrics']['performance']['auc']
        reference_auc = perf_report['metrics']['performance']['reference_auc']

        performance_drop = reference_auc - current_auc
        if performance_drop > 0.05:  # Threshold for performance degradation
            alert_performance_drop(current_auc, reference_auc, perf_report)

    return drift_report

Key components:

Data Drift Detection: Monitoring for shifts in input data distributions
Model Performance Tracking: Measuring model quality metrics over time
Resource Utilization: Monitoring compute, memory, and storage usage
Alerting and Dashboards: Making ML system health visible to stakeholders

Scaling Strategies for ML Infrastructure

Approaches to scaling vary by organization size:

For Startups and Small Teams

Focus on agility while building production-ready foundations:

Managed Services: Use cloud-based ML platforms to reduce operational overhead
Minimal Workflows: Create simple but consistent processes for deploying models
Core Metrics: Monitor essential model performance and data quality metrics
Flexible Infrastructure: Build for current needs while planning for future growth

# Example: Simple model deployment with BentoML
bentoml build
bentoml containerize churn_predictor:latest
docker run -p 3000:3000 churn_predictor:latest

For Mid-Sized Organizations

Balance customization and standardization:

Hybrid Infrastructure: Combine managed services with custom components
Team-Specific Workflows: Create specialized workflows for different ML use cases
Centralized Governance: Implement shared standards for model management
Scalable Serving: Introduce more sophisticated serving architectures

For Enterprise Organizations

Implement comprehensive ML platforms:

Custom ML Platforms: Build internal platforms tailored to organization needs
Multi-Team Governance: Create governance structures across multiple ML teams
Self-Service: Enable teams to deploy models through standardized interfaces
Global-Regional Infrastructure: Distribute ML infrastructure to support global operations

ML Infrastructure Maturity Model

A framework for assessing and improving ML infrastructure:

Level 1: Initial

Manual model training and deployment
Limited documentation and reproducibility
Ad-hoc monitoring and maintenance

Level 2: Repeatable

Basic automation for training pipelines
Version control for code and models
Simple monitoring for model performance

Level 3: Defined

Standardized development environments
Consistent CI/CD pipelines for models
Regular retraining processes
Comprehensive monitoring

Level 4: Managed

Feature store implementation
Automated quality gates throughout the pipeline
Self-healing infrastructure
Performance optimization

Level 5: Optimized

Multi-environment deployment strategies
Automated experimentation
Advanced observability and explainability
Continuous infrastructure optimization

Decision Rules

Use this checklist to decide where to focus:

If models work in notebooks but fail in production, fix the data pipeline first
If training takes too long, add GPU scheduling and distributed training
If latency spikes in production, profile your model serving layer
If models drift quickly, implement automated retraining
If teams step on each other, introduce a model registry and feature store

Start with the minimum viable infrastructure. Add complexity only when you hit specific problems.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review