Traditional ETL processes operate on batch schedules, identifying changes through comparison mechanisms. Change Data Capture (CDC) identifies and captures changes as they occur, enabling immediate propagation without batch delays. This article covers CDC architectures and implementation.
Understanding CDC
CDC identifies and captures changes made to database records, delivering them to downstream consumers. Changes include:
- Inserts: New records added to a table
- Updates: Modifications to existing records
- Deletes: Removal of records
- Schema changes: Alterations to table structure
Why Traditional ETL Falls Short
Batch ETL has inherent limitations:
- Data arrives with delay equal to batch frequency
- Comparison mechanisms require full table scans or timestamp columns
- Processing entire datasets for incremental changes wastes resources
- Downstream systems operate on stale data between batches
CDC eliminates these delays by capturing changes immediately.
CDC Implementation Approaches
Log-Based CDC
Reading database transaction logs provides complete change capture:
- MySQL binlog, PostgreSQL WAL, Oracle redo logs
- No modifications to source tables
- Captures all changes including those not involving timestamps
Technologies: Debezium, Maxwell, Oracle GoldenGate.
Trigger-Based CDC
Database triggers capture changes to shadow tables:
- Guaranteed capture of all changes
- Additional load on source database
- Requires schema modifications
Timestamp-Based CDC
Polling for changes using timestamp columns:
- Simple implementation
- Requires timestamp columns on all tables
- May miss updates that do not modify timestamp
Comparison-Based CDC
Comparing snapshots to identify changes:
- Works without log access
- Resource-intensive for large tables
- Risk of missing intermediate changes
Architecture Patterns
Event Streaming Architecture
CDC events flow through streaming infrastructure:
[Source Database] -> [CDC Connector] -> [Kafka Topic] -> [Stream Processor] -> [Target Systems]
Benefits:
- Decoupling producers from consumers
- Multiple consumers from single source
- Replay capability for reprocessing
- Durability and fault tolerance
Real-Time ETL
CDC feeding directly into transformation:
[Source] -> [CDC] -> [Kafka] -> [Flink/Spark] -> [Data Warehouse]
Transformations applied to streaming data before landing.
Audit and Replication
CDC for data replication and audit trails:
- Real-time replica to analytics database
- Complete audit trail of all changes
- Support for compliance requirements
Key Technologies
Debezium
Open-source CDC platform:
- MySQL, PostgreSQL, MongoDB, SQL Server support
- Kafka Connect integration
- Schema history management
- Filtering and transformation
AWS DMS
Managed CDC service:
- Supports various database sources
- Managed infrastructure
- Integration with AWS services
Oracle GoldenGate
Enterprise CDC solution:
- Comprehensive database support
- High performance
- Advanced conflict resolution
Implementation Considerations
Schema Evolution
CDC must handle schema changes:
- Schema registry for evolution management
- Backward compatibility strategies
- Migration procedures for breaking changes
Data Consistency
Ensuring consistency across distributed systems:
- Transaction boundaries across databases
- Idempotent consumer processing
- Exactly-once semantics where required
Performance Impact
Log-based CDC typically has minimal impact:
- Reading logs rather than tables
- Batch commit processing
- Parallel extraction for large volumes
Decision Rules
- If your data warehouse is more than 1 hour behind source systems, CDC reduces the lag.
- If you need to capture deletes (not just updates) from source tables, CDC with log-based capture is necessary.
- If your source database cannot handle trigger-based overhead, log-based CDC is the alternative.
- If you need to reconstruct historical states from change events, you need a change history store, not just CDC.