Organizations often treat data quality as secondary—something to address after building pipelines and training models. This perspective misunderstands modern data systems. In a world where ML models make millions of automated decisions, where real-time analytics drive operations, and where data products impact customer experiences, data quality isn’t just important—it’s existential.
The insurance company’s fraud detection model began flagging every third claim as fraudulent when a corrupted data feed silently injected nulls where policy amounts should have been. A retail giant discovered their inventory optimization system had been making decisions based on duplicate transaction records for six months. A healthcare provider found patient risk scores skewed because lab results were recorded with inconsistent units. A financial services firm’s trading algorithms went haywire when market data feeds started dropping decimal places intermittently.
These aren’t edge cases. They represent the daily reality of working with data at scale.
The Hidden Cost of Bad Data
Quality is a continuous spectrum, not binary. Data isn’t “good” or “bad”—it exists on a continuum of fitness for specific purposes. Fraud detection could tolerate some missing demographic data but required complete and accurate transaction amounts.
Quality degrades over time. Even perfect data at ingestion deteriorates through transformations, joins, and aggregations. Each processing step introduces opportunities for quality degradation.
Quality has feedback loops. Poor data leads to poor model performance, which generates poor predictions, which when fed back create even worse data quality.
The Promise and Challenge of Automation
Manual review proved unsustainable. The volume of data made manual inspection impossible. The velocity of changes meant damage was already done by the time humans detected issues.
Automation introduced new complexities: how do you automatically check quality for datasets you’ve never seen before? How do you define rules specific enough to catch real issues but general enough to avoid false positives?
The goal wasn’t to create a system that automatically determined “good” data, but infrastructure that continuously applied human-defined quality standards at scale.
Great Expectations
Great Expectations brought software engineering practices to data quality. What attracted teams wasn’t just capabilities but philosophy: data quality should be defined as code, versioned, tested, and deployed like any software artifact.
Expectations as First-Class Citizens: Quality rules became explicit, documented expectations. “The claim_amount column should never be null” became a versioned, reusable expectation.
Data Documentation as a Side Effect: Well-defined expectations naturally documented data contracts.
Probabilistic Thinking: Great Expectations allowed nuanced rules like “99.5% of claim amounts should fall between $100 and $100,000.”
Their first implementation focused on the fraud detection pipeline:
- Claim amounts must be positive numbers
- Policy numbers must match active policies
- Claim dates must be within policy coverage periods
- Customer IDs must exist in the customer database
- No duplicate claim IDs within a 30-day window
Defining these rules revealed complex edge cases. Weekend claims sometimes arrived Monday with backdated timestamps. Certain policy types allowed $0 claims. Legitimate customers had multiple IDs due to system migrations.
Soda Alternative
While Great Expectations served batch processing needs well, Soda offered different advantages for streaming data and SQL-heavy transformations.
Where Great Expectations felt like a programming framework, Soda felt like a query engine. Quality checks expressed in SQL-like syntax were more approachable for data analysts. Soda integrated naturally with dbt transformations, allowing quality checks to be embedded in transformation pipelines.
Teams often used both tools: Great Expectations for validating raw data ingestion and complex Python transformations, Soda for SQL-based quality checks and warehouse-resident validation.
Building Quality Gates That Scale
First attempts often fail—running all quality checks on all data all the time hit scalability walls. Quality checks took longer than actual data processing. Too many alerts meant real issues got lost.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Risk-Based Validation: Not all data deserved equal scrutiny. Payment data feeding fraud models received comprehensive validation. Reference data updated quarterly got lighter checks.
Progressive Validation: Quality checks staged throughout the pipeline rather than front-loaded. Basic structural validation at ingestion. Business logic validation after transformations. Statistical validation before model training.
Smart Sampling: For high-volume streams, intelligent sampling validated statistical properties of batches. Anomaly detection identified batches requiring deeper inspection.
Circuit Breaker Patterns: When quality checks consistently failed, circuit breakers prevented system overload. Data sources were temporarily quarantined while alerts triggered human investigation.
Schema Evolution
Initial quality gates assumed stable schemas—a naive assumption in dynamic business environments. When claims systems added new fields for telemedicine visits, quality gates rejected all new data as invalid.
They developed patterns for managing schema evolution:
Versioned Expectations: Quality rules versioned alongside schema versions. New expectation suites created while maintaining old ones for historical data.
Backward Compatibility Windows: New fields optional during transition periods. Warnings about missing new fields but no failures. After migration windows closed, warnings became errors.
Schema Registries: Central registry tracked all versions of all datasets. Quality gates consulted the registry to apply appropriate validation rules.
Gradual Rollouts: Schema changes rolled out progressively. New versions validated in parallel with old versions before switching over.
Handling Data Drift
Static quality rules couldn’t handle dynamic business data. Customer behavior changed seasonally. Product mixes evolved. Quality gates needed to distinguish between legitimate drift and quality issues.
Baseline Learning: Quality gates learned normal patterns from historical data. Statistical models captured typical distributions, correlations, and temporal patterns.
Adaptive Thresholds: Instead of hard-coded limits, thresholds adapted based on recent history. If average claim amounts gradually increased due to inflation, quality gates adjusted rather than triggering false alarms.
Seasonal Awareness: Quality gates incorporated temporal context to avoid false positives during expected variation periods.
Change Point Detection: Algorithms identified when data patterns shifted significantly, triggering alerts for human review.
The Human Element
Sophisticated automation couldn’t replace human judgment. The challenge was creating systems that combined automated detection with human expertise.
Building Quality Culture
Quality Champions: Each team designated champions who understood both technical tools and business context.
Quality Reviews: Regular reviews brought together data producers, consumers, and platform teams to review metrics and evolve standards.
Incident Post-Mortems: Quality failures treated as learning opportunities. Blameless post-mortems identified root causes and systemic improvements.
Quality Metrics in Performance Reviews: Individual and team performance metrics included data quality components.
Feedback Loop Challenge
Producer Scorecards: Data producers received regular reports on quality metrics, highlighting trends and improvement opportunities.
Consumer Feedback Channels: Data consumers could easily report quality issues discovered during analysis.
Automated Root Cause Analysis: When quality gates failed, automated systems traced issues back through the pipeline.
Real-World Patterns
Success Patterns
Start with High-Value, High-Risk Data: Focus quality efforts on data feeding critical decisions. The fraud detection pipeline’s improvement immediately demonstrated value.
Incremental Automation: Progressive automation—manual processes documented, then scripted, then automated.
Context-Aware Validation: Quality rules that understood business context caught real issues while minimizing false positives. A claim amount of $1 million was suspicious for auto insurance but normal for commercial property.
Proactive Monitoring: The best quality gates prevented issues rather than just detecting them.
Anti-Patterns to Avoid
One-Size-Fits-All Quality: Applying same standards to all data regardless of use case created unnecessary overhead.
Alert Fatigue: Too many low-value alerts trained teams to ignore quality warnings.
Technology-First Thinking: Starting with tools rather than understanding requirements led to implementations that were sophisticated but irrelevant.
Perfection Paralysis: Imperfect automation improved iteratively was better than perfect plans never executed.
Measuring Success
Technical Metrics
Detection Rate: Percentage of known quality issues automated gates caught. Synthetic bad data tested detection capabilities.
False Positive Rate: How often quality gates flagged good data as bad. High false positive rates eroded trust.
Processing Overhead: Performance impact of quality checks. Continuous optimization minimized latency and resource consumption.
Coverage: Percentage of data flows with quality gates. Tracked growth over time and identified protection gaps.
Business Metrics
Incident Reduction: Dramatic reduction in data-related incidents. Mean time between failures tracked.
Decision Accuracy: Better data quality improved model performance and business decisions.
Operational Efficiency: Automated quality gates reduced manual investigation time.
Scaling Across the Organization
Platform Thinking
Self-Service Quality: Teams could define and deploy quality gates without platform team involvement.
Reusable Components: Common quality checks packaged as reusable components. Checking referential integrity, validating date formats, detecting outliers became drop-in modules.
Central Monitoring: Unified dashboard showed quality metrics across all pipelines.
Shared Learning: Quality rules discovered by one team shared across organization.
Federation Model
Central Standards, Local Implementation: Platform team defined standards and provided tools. Individual teams implemented within those standards.
Community of Practice: Regular meetings shared learnings and evolved standards.
Center of Excellence: Small expert team provided consultation and training.
Future Directions
ML-Powered Quality Detection
Anomaly Detection Models: Unsupervised learning identified unusual patterns rule-based systems missed.
Automated Rule Generation: ML systems suggested new quality rules based on observed data patterns.
Real-Time Quality Loops
Streaming Quality Gates: Quality checks ran continuously on streaming data. Issues detected within seconds rather than hours.
Dynamic Remediation: Some quality issues automatically corrected in real-time.
Quality as Code Evolution
Quality Testing: Quality rules themselves tested using synthetic data and mutation testing.
Quality Contracts: Formal contracts between data producers and consumers specified quality expectations.
Decision Framework
Implement automated quality gates when:
- Data feeds ML models that make automated decisions
- Data volume makes manual inspection impossible
- Data velocity means issues cause damage before humans can react
- Multiple data sources create compounding quality risks
Choose Great Expectations when:
- Validating raw data ingestion from diverse sources
- Complex Python-based transformations need validation
- Team has engineering capacity for framework adoption
- Expectations need to be versioned and tested like code
Choose Soda when:
- Data resides in data warehouse and SQL-based checks are sufficient
- Team is more comfortable with SQL than Python APIs
- Integration with dbt transformations is important
- Self-service by analysts is priority over engineering flexibility
Implement risk-based validation when:
- Not all data has equal downstream impact
- Resource constraints prevent comprehensive checking of everything
- Some data sources have proven reliability while others are risky
- Different teams own different data with different quality standards
Use adaptive thresholds when:
- Data patterns legitimately change over time
- Seasonal variations create expected fluctuation
- Business changes (new products, markets) alter normal ranges
- Static thresholds create unacceptable false positive rates
Build feedback loops when:
- Data producers need visibility into how their data is used
- Root cause analysis takes too long without automated tracing
- Quality issues repeat because underlying causes aren’t addressed
- SLA compliance needs to be measured and enforced