Time series forecasting requires specialized pipeline architecture. Unlike standard batch processing, time series work demands strict chronological ordering, historical context, time-based feature engineering, and walk-forward validation. This article covers architecture and best practices.
Unique Requirements
Time series pipelines differ from standard data pipelines:
- Temporal ordering: Data must be processed in strict chronological order
- Historical context: Models require extensive historical data for pattern recognition
- Feature engineering complexity: Time-based features like lags, windows, and seasonality
- Retraining cadence: Regular model updates as new data arrives
- Time-based validation: Walk-forward validation instead of random splitting
- Regular reforecasting: Predictions updated as the time horizon shifts
Pipeline Architecture
[Data Sources] -> [Ingestion] -> [Storage] -> [Feature Engineering] -> [Training] -> [Forecasting] -> [Serving]
↓ ↑ ↓ ↓ ↓
[Cleaning] [Feature Store] [Registry] [Monitoring] [Visualization]
Data Ingestion and Collection
Time series data arrives via:
- Continuous streaming: Real-time data points
- Periodic batches: Scheduled updates
- Event-triggered: Updates based on specific events
- Hybrid: Combining streaming and batch
Data Storage
Time series storage requirements:
- Time-based partitioning: Organizing data by time intervals
- Compression: Efficient storage for high-volume series
- Retention policies: Automated archiving of older data
- Backfilling capabilities: Handling late-arriving data
Technologies: InfluxDB, TimescaleDB, Prometheus for time-series databases; Parquet with time partitioning for data lakes.
Feature Engineering
Time series-specific features:
- Temporal features: Hour, day, month, day-of-week, cyclical encoding
- Lag features: Previous values (t-1, t-2, t-n), moving averages
- Seasonal features: Seasonal indicators, Fourier terms, holiday flags
- External variables: Weather, economic indicators
# Create lag features
for lag in [1, 7, 14, 28]:
data[f'lag_{lag}'] = data['value'].shift(lag)
# Cyclical encoding
data['hour_sin'] = np.sin(2 * np.pi * data['hour'] / 24)
data['hour_cos'] = np.cos(2 * np.pi * data['hour'] / 24)
Model Training Patterns
Sliding Window Training
Train on fixed windows, slide as new data arrives:
- Fixed window of historical data
- Window slides forward maintaining consistent size
Expanding Window Training
Start with initial window, add new data while keeping all history:
- Training set grows over time
- More data available for recent patterns
Walk-Forward Validation
Train on t0 to t1, validate on t1 to t2, retrain on t0 to t2, validate on t2 to t3, repeat.
Orchestration and Scheduling
Time series pipelines require:
- Time-based scheduling: Regular retraining and forecasting
- Dependency management: External features available before training
- Backfilling capabilities: Recreating forecasts for historical periods
- Retraining triggers: Data-driven or time-based triggers
Tools: Apache Airflow, Prefect, Dagster.
Monitoring
Key Metrics
- Forecast accuracy: MAPE, RMSE, MAE with time decay
- Data health: Freshness, missingness patterns, drift detection
- Operational: Pipeline latency, retraining frequency, serving performance
Decision Rules
- If your forecast accuracy degrades over time without detection, your monitoring lacks forecast-specific metrics.
- If retraining takes more than 1 hour, your feature computation pipeline needs optimization.
- If you cannot reproduce historical forecasts for the same timestamps, your pipeline lacks reproducibility.
- If you handle more than 10,000 time series to forecast, distributed training infrastructure becomes necessary.