Deploying ML Models on Kubernetes: Best Practices

Simor Consulting | 06 May, 2024 | 03 Mins read

Deploying ML Models on Kubernetes: Best Practices

ML models in production need orchestration, scaling, and monitoring infrastructure. Kubernetes provides these capabilities, though the learning curve is steep and operational complexity is significant.

This covers practical approaches to ML model deployment on Kubernetes, from containerization to advanced orchestration.

Why Kubernetes for ML Model Deployment?

Kubernetes offers specific advantages for ML deployments:

Scalability: Automatically scale model serving based on traffic patterns
Resource Efficiency: Optimize GPU/CPU allocation across multiple models
Reproducibility: Consistent environments from development to production
High Availability: Robust failover and self-healing capabilities
Workflow Integration: Integration with CI/CD and MLOps pipelines

Core Components for ML on Kubernetes

1. Containerized Model Serving

Packaging models for deployment:

# Example: Dockerfile for model serving with TensorFlow Serving
FROM tensorflow/serving:2.11.0

# Copy model artifacts
COPY ./saved_model /models/my_model/1

# Model name configuration
ENV MODEL_NAME=my_model

# Port configuration
EXPOSE 8500 8501

# Start serving
CMD ["tensorflow_model_server", \
     "--model_name=${MODEL_NAME}", \
     "--model_base_path=/models/${MODEL_NAME}", \
     "--rest_api_port=8501", \
     "--port=8500"]

For PyTorch models, a custom serving solution might use FastAPI:

# model_server.py
from fastapi import FastAPI, HTTPException
import torch
from pydantic import BaseModel
import numpy as np
import os
import time

app = FastAPI()

model_path = os.environ.get("MODEL_PATH", "/models/model.pt")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load(model_path, map_location=device)
model.eval()

class PredictionRequest(BaseModel):
    inputs: list

class PredictionResponse(BaseModel):
    predictions: list
    model_version: str
    prediction_time: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        start_time = time.time()
        input_tensor = torch.tensor(request.inputs, dtype=torch.float32).to(device)

        with torch.no_grad():
            outputs = model(input_tensor)

        predictions = outputs.cpu().numpy().tolist()

        return PredictionResponse(
            predictions=predictions,
            model_version=os.environ.get("MODEL_VERSION", "unknown"),
            prediction_time=time.time() - start_time
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Prediction error: {str(e)}")

2. Kubernetes Deployment Manifests

Basic deployment configuration:

# Example: Kubernetes deployment for model serving
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detection-model
  labels:
    app: fraud-detection
    component: model-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-detection
      component: model-serving
  template:
    metadata:
      labels:
        app: fraud-detection
        component: model-serving
    spec:
      containers:
        - name: model-server
          image: acr.io/company/fraud-detection:v1.2.3
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              cpu: "2"
              memory: "4Gi"
              nvidia.com/gpu: 1
            requests:
              cpu: "1"
              memory: "2Gi"
          ports:
            - containerPort: 8501
              name: http
          readinessProbe:
            httpGet:
              path: /health
              port: 8501
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8501
            initialDelaySeconds: 60
            periodSeconds: 15

3. Model Storage and Versioning

Options for managing model artifacts:

Container-Based: Package models within containers
- Pros: Simplicity, versioning with container tags
- Cons: Large container sizes, tight coupling of model and code
Volume-Based: Store models on persistent volumes
- Pros: Separation of models from code, easier updates
- Cons: Additional complexity for volume management
Cloud Storage Integration: Pull models from S3, GCS, etc.
- Pros: Clean separation, flexible versioning
- Cons: Potential latency, additional authentication requirements

Advanced Deployment Patterns

1. Auto-scaling for Variable Workloads

Horizontal Pod Autoscaler configuration:

# Example: HPA based on CPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-detection-model
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

2. Canary Deployments and A/B Testing

Implementing progressive rollouts:

# Example: Model canary deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-model-v2
  labels:
    app: fraud-detection
    version: v2
spec:
  replicas: 1
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: fraud-model-route
spec:
  hosts:
    - fraud-model-service
  http:
    - route:
        - destination:
            host: fraud-model-service
            subset: v1
          weight: 90
        - destination:
            host: fraud-model-service
            subset: v2
          weight: 10

3. Multi-Model Serving

Efficiently hosting multiple models:

# Example: Multi-model server configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-model-server
spec:
  template:
    spec:
      containers:
        - name: model-server
          image: triton-inference-server:22.01
          ports:
            - containerPort: 8000
            - containerPort: 8001
            - containerPort: 8002
          env:
            - name: MODEL_REPOSITORY_PATH
              value: "/models"

Resource Optimization

# Example: GPU time-slicing for multiple pods
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-shared-model
spec:
  template:
    spec:
      containers:
        - name: model-server
          resources:
            limits:
              nvidia.com/gpu: 1
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "0"
            - name: NVIDIA_MPS_ACTIVE_THREAD_PERCENTAGE
              value: "30"

Monitoring and Observability

Model-Specific Metrics

# Example: FastAPI metrics endpoint
from prometheus_fastapi_instrumentator import Instrumentator
import prometheus_client

prediction_latency = prometheus_client.Histogram(
    "prediction_latency_seconds",
    "Time spent processing prediction",
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    labelnames=["model_version", "model_name"]
)

prediction_counter = prometheus_client.Counter(
    "prediction_requests_total",
    "Total number of prediction requests",
    labelnames=["model_version", "model_name", "status"]
)

Instrumentator().instrument(app).expose(app)

Decision Rules

Use this checklist for Kubernetes ML deployment decisions:

If latency is critical, profile the model serving layer before scaling
If GPU utilization is low, consider batching requests or model multiplexing
If deployment times are slow, use pre-built container images with models baked in
If you need zero-downtime updates, implement readiness probes and rolling deployments
If costs are high, use spot instances with proper shutdown handling
If starting out, use a managed Kubernetes service (EKS, GKE, AKS) instead of self-managed

Kubernetes adds operational complexity. Only use it if your deployment needs exceed what simpler solutions provide.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.

Similar Articles

DataOps MLOps

MLOps vs DataOps: Understanding the Differences and Overlaps

08 Feb, 2024 | 03 Mins read

DataOps and MLOps both aim to improve reliability and efficiency in data-centric workflows, but they address different parts of the data science lifecycle. Understanding their boundaries helps organiz

Tooling MLOps

Feature store comparison: Feast, Tecton, Hopsworks

20 May, 2026 | 05 Mins read

Feature stores solve a specific problem: the features you use to train a model must be the same features you use to serve it. When the training pipeline computes features differently than the serving

Case Study MLOps

The $2M model that never made it to production

09 Jun, 2026 | 05 Mins read

A retail chain with 400 stores spent two years and $2.1 million building an inventory optimization model. The model was technically excellent. It reduced predicted stockouts by thirty-two percent and

Tooling MLOps

Model serving: vLLM, TGI, Triton — which fits your stack?

18 Jun, 2026 | 05 Mins read

Serving a language model in production is an infrastructure problem, not a model problem. The model weights are the same regardless of how you serve them. What differs is throughput (how many requests

Tooling MLOps

CI/CD for ML: MLflow vs Weights & Biases vs Neptune

25 Jun, 2026 | 05 Mins read

Machine learning teams face a version control problem that Git does not solve. Git tracks code changes, but ML experiments change more than code — they change hyperparameters, datasets, model architec

MLOps Infrastructure

Scaling Machine Learning Infrastructure: From POC to Production

10 May, 2024 | 04 Mins read

# Scaling Machine Learning Infrastructure: From POC to Production Moving a machine learning model from notebook to production exposes gaps that notebooks hide. Data scientists produce working models

Machine Learning MLOps

Incremental ML: Continuous Learning Systems

12 Jul, 2024 | 11 Mins read

Traditional ML trains on historical data, deploys, and waits until performance degrades. This fails in dynamic environments where data patterns evolve. Incremental ML continuously updates models as ne

Serverless MLOps

Serverless Machine Learning: Patterns with AWS Lambda, GCP Cloud Run & Azure Functions

18 Jul, 2025 | 05 Mins read

A social media analytics company watched their Kubernetes cluster fail to handle traffic spikes from trending topics. The cluster would scale from 50 to 500 pods in minutes, but not fast enough to pre

Observability MLOps

AI Observability: Monitoring Drift, Data Quality & Model Performance

12 Sep, 2025 | 02 Mins read

An insurance company's premium pricing model had been quietly going haywire for two weeks. Young drivers in high-risk areas were getting bargain prices while safe drivers faced astronomical quotes. By