Deploying ML Models on Kubernetes: Best Practices
ML models in production need orchestration, scaling, and monitoring infrastructure. Kubernetes provides these capabilities, though the learning curve is steep and operational complexity is significant.
This covers practical approaches to ML model deployment on Kubernetes, from containerization to advanced orchestration.
Why Kubernetes for ML Model Deployment?
Kubernetes offers specific advantages for ML deployments:
- Scalability: Automatically scale model serving based on traffic patterns
- Resource Efficiency: Optimize GPU/CPU allocation across multiple models
- Reproducibility: Consistent environments from development to production
- High Availability: Robust failover and self-healing capabilities
- Workflow Integration: Integration with CI/CD and MLOps pipelines
Core Components for ML on Kubernetes
1. Containerized Model Serving
Packaging models for deployment:
# Example: Dockerfile for model serving with TensorFlow Serving
FROM tensorflow/serving:2.11.0
# Copy model artifacts
COPY ./saved_model /models/my_model/1
# Model name configuration
ENV MODEL_NAME=my_model
# Port configuration
EXPOSE 8500 8501
# Start serving
CMD ["tensorflow_model_server", \
"--model_name=${MODEL_NAME}", \
"--model_base_path=/models/${MODEL_NAME}", \
"--rest_api_port=8501", \
"--port=8500"]
For PyTorch models, a custom serving solution might use FastAPI:
# model_server.py
from fastapi import FastAPI, HTTPException
import torch
from pydantic import BaseModel
import numpy as np
import os
import time
app = FastAPI()
model_path = os.environ.get("MODEL_PATH", "/models/model.pt")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load(model_path, map_location=device)
model.eval()
class PredictionRequest(BaseModel):
inputs: list
class PredictionResponse(BaseModel):
predictions: list
model_version: str
prediction_time: float
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
try:
start_time = time.time()
input_tensor = torch.tensor(request.inputs, dtype=torch.float32).to(device)
with torch.no_grad():
outputs = model(input_tensor)
predictions = outputs.cpu().numpy().tolist()
return PredictionResponse(
predictions=predictions,
model_version=os.environ.get("MODEL_VERSION", "unknown"),
prediction_time=time.time() - start_time
)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Prediction error: {str(e)}")
2. Kubernetes Deployment Manifests
Basic deployment configuration:
# Example: Kubernetes deployment for model serving
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-detection-model
labels:
app: fraud-detection
component: model-serving
spec:
replicas: 3
selector:
matchLabels:
app: fraud-detection
component: model-serving
template:
metadata:
labels:
app: fraud-detection
component: model-serving
spec:
containers:
- name: model-server
image: acr.io/company/fraud-detection:v1.2.3
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: 1
requests:
cpu: "1"
memory: "2Gi"
ports:
- containerPort: 8501
name: http
readinessProbe:
httpGet:
path: /health
port: 8501
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8501
initialDelaySeconds: 60
periodSeconds: 15
3. Model Storage and Versioning
Options for managing model artifacts:
-
Container-Based: Package models within containers
- Pros: Simplicity, versioning with container tags
- Cons: Large container sizes, tight coupling of model and code
-
Volume-Based: Store models on persistent volumes
- Pros: Separation of models from code, easier updates
- Cons: Additional complexity for volume management
-
Cloud Storage Integration: Pull models from S3, GCS, etc.
- Pros: Clean separation, flexible versioning
- Cons: Potential latency, additional authentication requirements
Advanced Deployment Patterns
1. Auto-scaling for Variable Workloads
Horizontal Pod Autoscaler configuration:
# Example: HPA based on CPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fraud-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-detection-model
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
2. Canary Deployments and A/B Testing
Implementing progressive rollouts:
# Example: Model canary deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-model-v2
labels:
app: fraud-detection
version: v2
spec:
replicas: 1
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: fraud-model-route
spec:
hosts:
- fraud-model-service
http:
- route:
- destination:
host: fraud-model-service
subset: v1
weight: 90
- destination:
host: fraud-model-service
subset: v2
weight: 10
3. Multi-Model Serving
Efficiently hosting multiple models:
# Example: Multi-model server configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-model-server
spec:
template:
spec:
containers:
- name: model-server
image: triton-inference-server:22.01
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
env:
- name: MODEL_REPOSITORY_PATH
value: "/models"
Resource Optimization
GPU Sharing and Allocation
# Example: GPU time-slicing for multiple pods
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-shared-model
spec:
template:
spec:
containers:
- name: model-server
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: NVIDIA_MPS_ACTIVE_THREAD_PERCENTAGE
value: "30"
Monitoring and Observability
Model-Specific Metrics
# Example: FastAPI metrics endpoint
from prometheus_fastapi_instrumentator import Instrumentator
import prometheus_client
prediction_latency = prometheus_client.Histogram(
"prediction_latency_seconds",
"Time spent processing prediction",
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
labelnames=["model_version", "model_name"]
)
prediction_counter = prometheus_client.Counter(
"prediction_requests_total",
"Total number of prediction requests",
labelnames=["model_version", "model_name", "status"]
)
Instrumentator().instrument(app).expose(app)
Decision Rules
Use this checklist for Kubernetes ML deployment decisions:
- If latency is critical, profile the model serving layer before scaling
- If GPU utilization is low, consider batching requests or model multiplexing
- If deployment times are slow, use pre-built container images with models baked in
- If you need zero-downtime updates, implement readiness probes and rolling deployments
- If costs are high, use spot instances with proper shutdown handling
- If starting out, use a managed Kubernetes service (EKS, GKE, AKS) instead of self-managed
Kubernetes adds operational complexity. Only use it if your deployment needs exceed what simpler solutions provide.