Deploying ML Models on Kubernetes: Best Practices

Deploying ML Models on Kubernetes: Best Practices

Simor Consulting | 06 May, 2024 | 03 Mins read

Deploying ML Models on Kubernetes: Best Practices

ML models in production need orchestration, scaling, and monitoring infrastructure. Kubernetes provides these capabilities, though the learning curve is steep and operational complexity is significant.

This covers practical approaches to ML model deployment on Kubernetes, from containerization to advanced orchestration.

Why Kubernetes for ML Model Deployment?

Kubernetes offers specific advantages for ML deployments:

  1. Scalability: Automatically scale model serving based on traffic patterns
  2. Resource Efficiency: Optimize GPU/CPU allocation across multiple models
  3. Reproducibility: Consistent environments from development to production
  4. High Availability: Robust failover and self-healing capabilities
  5. Workflow Integration: Integration with CI/CD and MLOps pipelines

Core Components for ML on Kubernetes

1. Containerized Model Serving

Packaging models for deployment:

# Example: Dockerfile for model serving with TensorFlow Serving
FROM tensorflow/serving:2.11.0

# Copy model artifacts
COPY ./saved_model /models/my_model/1

# Model name configuration
ENV MODEL_NAME=my_model

# Port configuration
EXPOSE 8500 8501

# Start serving
CMD ["tensorflow_model_server", \
     "--model_name=${MODEL_NAME}", \
     "--model_base_path=/models/${MODEL_NAME}", \
     "--rest_api_port=8501", \
     "--port=8500"]

For PyTorch models, a custom serving solution might use FastAPI:

# model_server.py
from fastapi import FastAPI, HTTPException
import torch
from pydantic import BaseModel
import numpy as np
import os
import time

app = FastAPI()

model_path = os.environ.get("MODEL_PATH", "/models/model.pt")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load(model_path, map_location=device)
model.eval()

class PredictionRequest(BaseModel):
    inputs: list

class PredictionResponse(BaseModel):
    predictions: list
    model_version: str
    prediction_time: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        start_time = time.time()
        input_tensor = torch.tensor(request.inputs, dtype=torch.float32).to(device)

        with torch.no_grad():
            outputs = model(input_tensor)

        predictions = outputs.cpu().numpy().tolist()

        return PredictionResponse(
            predictions=predictions,
            model_version=os.environ.get("MODEL_VERSION", "unknown"),
            prediction_time=time.time() - start_time
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Prediction error: {str(e)}")

2. Kubernetes Deployment Manifests

Basic deployment configuration:

# Example: Kubernetes deployment for model serving
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detection-model
  labels:
    app: fraud-detection
    component: model-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-detection
      component: model-serving
  template:
    metadata:
      labels:
        app: fraud-detection
        component: model-serving
    spec:
      containers:
        - name: model-server
          image: acr.io/company/fraud-detection:v1.2.3
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              cpu: "2"
              memory: "4Gi"
              nvidia.com/gpu: 1
            requests:
              cpu: "1"
              memory: "2Gi"
          ports:
            - containerPort: 8501
              name: http
          readinessProbe:
            httpGet:
              path: /health
              port: 8501
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8501
            initialDelaySeconds: 60
            periodSeconds: 15

3. Model Storage and Versioning

Options for managing model artifacts:

  1. Container-Based: Package models within containers

    • Pros: Simplicity, versioning with container tags
    • Cons: Large container sizes, tight coupling of model and code
  2. Volume-Based: Store models on persistent volumes

    • Pros: Separation of models from code, easier updates
    • Cons: Additional complexity for volume management
  3. Cloud Storage Integration: Pull models from S3, GCS, etc.

    • Pros: Clean separation, flexible versioning
    • Cons: Potential latency, additional authentication requirements

Advanced Deployment Patterns

1. Auto-scaling for Variable Workloads

Horizontal Pod Autoscaler configuration:

# Example: HPA based on CPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-detection-model
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

2. Canary Deployments and A/B Testing

Implementing progressive rollouts:

# Example: Model canary deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-model-v2
  labels:
    app: fraud-detection
    version: v2
spec:
  replicas: 1
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: fraud-model-route
spec:
  hosts:
    - fraud-model-service
  http:
    - route:
        - destination:
            host: fraud-model-service
            subset: v1
          weight: 90
        - destination:
            host: fraud-model-service
            subset: v2
          weight: 10

3. Multi-Model Serving

Efficiently hosting multiple models:

# Example: Multi-model server configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-model-server
spec:
  template:
    spec:
      containers:
        - name: model-server
          image: triton-inference-server:22.01
          ports:
            - containerPort: 8000
            - containerPort: 8001
            - containerPort: 8002
          env:
            - name: MODEL_REPOSITORY_PATH
              value: "/models"

Resource Optimization

GPU Sharing and Allocation

# Example: GPU time-slicing for multiple pods
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-shared-model
spec:
  template:
    spec:
      containers:
        - name: model-server
          resources:
            limits:
              nvidia.com/gpu: 1
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "0"
            - name: NVIDIA_MPS_ACTIVE_THREAD_PERCENTAGE
              value: "30"

Monitoring and Observability

Model-Specific Metrics

# Example: FastAPI metrics endpoint
from prometheus_fastapi_instrumentator import Instrumentator
import prometheus_client

prediction_latency = prometheus_client.Histogram(
    "prediction_latency_seconds",
    "Time spent processing prediction",
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    labelnames=["model_version", "model_name"]
)

prediction_counter = prometheus_client.Counter(
    "prediction_requests_total",
    "Total number of prediction requests",
    labelnames=["model_version", "model_name", "status"]
)

Instrumentator().instrument(app).expose(app)

Decision Rules

Use this checklist for Kubernetes ML deployment decisions:

  1. If latency is critical, profile the model serving layer before scaling
  2. If GPU utilization is low, consider batching requests or model multiplexing
  3. If deployment times are slow, use pre-built container images with models baked in
  4. If you need zero-downtime updates, implement readiness probes and rolling deployments
  5. If costs are high, use spot instances with proper shutdown handling
  6. If starting out, use a managed Kubernetes service (EKS, GKE, AKS) instead of self-managed

Kubernetes adds operational complexity. Only use it if your deployment needs exceed what simpler solutions provide.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Similar Articles

MLOps vs DataOps: Understanding the Differences and Overlaps
MLOps vs DataOps: Understanding the Differences and Overlaps
08 Feb, 2024 | 03 Mins read

DataOps and MLOps both aim to improve reliability and efficiency in data-centric workflows, but they address different parts of the data science lifecycle. Understanding their boundaries helps organiz

Scaling Machine Learning Infrastructure: From POC to Production
Scaling Machine Learning Infrastructure: From POC to Production
10 May, 2024 | 04 Mins read

# Scaling Machine Learning Infrastructure: From POC to Production Moving a machine learning model from notebook to production exposes gaps that notebooks hide. Data scientists produce working models

Incremental ML: Continuous Learning Systems
Incremental ML: Continuous Learning Systems
12 Jul, 2024 | 11 Mins read

Traditional ML trains on historical data, deploys, and waits until performance degrades. This fails in dynamic environments where data patterns evolve. Incremental ML continuously updates models as ne

Serverless Machine Learning: Patterns with AWS Lambda, GCP Cloud Run & Azure Functions
Serverless Machine Learning: Patterns with AWS Lambda, GCP Cloud Run & Azure Functions
18 Jul, 2025 | 05 Mins read

A social media analytics company watched their Kubernetes cluster fail to handle traffic spikes from trending topics. The cluster would scale from 50 to 500 pods in minutes, but not fast enough to pre

AI Observability: Monitoring Drift, Data Quality & Model Performance
AI Observability: Monitoring Drift, Data Quality & Model Performance
12 Sep, 2025 | 02 Mins read

An insurance company's premium pricing model had been quietly going haywire for two weeks. Young drivers in high-risk areas were getting bargain prices while safe drivers faced astronomical quotes. By