Automated Data Quality Gates with Great Expectations & Soda

Simor Consulting | 28 Apr, 2025 | 07 Mins read

Organizations often treat data quality as secondary—something to address after building pipelines and training models. This perspective misunderstands modern data systems. In a world where ML models make millions of automated decisions, where real-time analytics drive operations, and where data products impact customer experiences, data quality isn’t just important—it’s existential.

The insurance company’s fraud detection model began flagging every third claim as fraudulent when a corrupted data feed silently injected nulls where policy amounts should have been. A retail giant discovered their inventory optimization system had been making decisions based on duplicate transaction records for six months. A healthcare provider found patient risk scores skewed because lab results were recorded with inconsistent units. A financial services firm’s trading algorithms went haywire when market data feeds started dropping decimal places intermittently.

These aren’t edge cases. They represent the daily reality of working with data at scale.

The Hidden Cost of Bad Data

Quality is a continuous spectrum, not binary. Data isn’t “good” or “bad”—it exists on a continuum of fitness for specific purposes. Fraud detection could tolerate some missing demographic data but required complete and accurate transaction amounts.

Quality degrades over time. Even perfect data at ingestion deteriorates through transformations, joins, and aggregations. Each processing step introduces opportunities for quality degradation.

Quality has feedback loops. Poor data leads to poor model performance, which generates poor predictions, which when fed back create even worse data quality.

The Promise and Challenge of Automation

Manual review proved unsustainable. The volume of data made manual inspection impossible. The velocity of changes meant damage was already done by the time humans detected issues.

Automation introduced new complexities: how do you automatically check quality for datasets you’ve never seen before? How do you define rules specific enough to catch real issues but general enough to avoid false positives?

The goal wasn’t to create a system that automatically determined “good” data, but infrastructure that continuously applied human-defined quality standards at scale.

Great Expectations

Great Expectations brought software engineering practices to data quality. What attracted teams wasn’t just capabilities but philosophy: data quality should be defined as code, versioned, tested, and deployed like any software artifact.

Expectations as First-Class Citizens: Quality rules became explicit, documented expectations. “The claim_amount column should never be null” became a versioned, reusable expectation.

Data Documentation as a Side Effect: Well-defined expectations naturally documented data contracts.

Probabilistic Thinking: Great Expectations allowed nuanced rules like “99.5% of claim amounts should fall between $100 and $100,000.”

Their first implementation focused on the fraud detection pipeline:

Claim amounts must be positive numbers
Policy numbers must match active policies
Claim dates must be within policy coverage periods
Customer IDs must exist in the customer database
No duplicate claim IDs within a 30-day window

Defining these rules revealed complex edge cases. Weekend claims sometimes arrived Monday with backdated timestamps. Certain policy types allowed $0 claims. Legitimate customers had multiple IDs due to system migrations.

Soda Alternative

While Great Expectations served batch processing needs well, Soda offered different advantages for streaming data and SQL-heavy transformations.

Where Great Expectations felt like a programming framework, Soda felt like a query engine. Quality checks expressed in SQL-like syntax were more approachable for data analysts. Soda integrated naturally with dbt transformations, allowing quality checks to be embedded in transformation pipelines.

Teams often used both tools: Great Expectations for validating raw data ingestion and complex Python transformations, Soda for SQL-based quality checks and warehouse-resident validation.

Building Quality Gates That Scale

First attempts often fail—running all quality checks on all data all the time hit scalability walls. Quality checks took longer than actual data processing. Too many alerts meant real issues got lost.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Risk-Based Validation: Not all data deserved equal scrutiny. Payment data feeding fraud models received comprehensive validation. Reference data updated quarterly got lighter checks.

Progressive Validation: Quality checks staged throughout the pipeline rather than front-loaded. Basic structural validation at ingestion. Business logic validation after transformations. Statistical validation before model training.

Smart Sampling: For high-volume streams, intelligent sampling validated statistical properties of batches. Anomaly detection identified batches requiring deeper inspection.

Circuit Breaker Patterns: When quality checks consistently failed, circuit breakers prevented system overload. Data sources were temporarily quarantined while alerts triggered human investigation.

Schema Evolution

Initial quality gates assumed stable schemas—a naive assumption in dynamic business environments. When claims systems added new fields for telemedicine visits, quality gates rejected all new data as invalid.

They developed patterns for managing schema evolution:

Versioned Expectations: Quality rules versioned alongside schema versions. New expectation suites created while maintaining old ones for historical data.

Backward Compatibility Windows: New fields optional during transition periods. Warnings about missing new fields but no failures. After migration windows closed, warnings became errors.

Schema Registries: Central registry tracked all versions of all datasets. Quality gates consulted the registry to apply appropriate validation rules.

Gradual Rollouts: Schema changes rolled out progressively. New versions validated in parallel with old versions before switching over.

Handling Data Drift

Static quality rules couldn’t handle dynamic business data. Customer behavior changed seasonally. Product mixes evolved. Quality gates needed to distinguish between legitimate drift and quality issues.

Baseline Learning: Quality gates learned normal patterns from historical data. Statistical models captured typical distributions, correlations, and temporal patterns.

Adaptive Thresholds: Instead of hard-coded limits, thresholds adapted based on recent history. If average claim amounts gradually increased due to inflation, quality gates adjusted rather than triggering false alarms.

Seasonal Awareness: Quality gates incorporated temporal context to avoid false positives during expected variation periods.

Change Point Detection: Algorithms identified when data patterns shifted significantly, triggering alerts for human review.

The Human Element

Sophisticated automation couldn’t replace human judgment. The challenge was creating systems that combined automated detection with human expertise.

Building Quality Culture

Quality Champions: Each team designated champions who understood both technical tools and business context.

Quality Reviews: Regular reviews brought together data producers, consumers, and platform teams to review metrics and evolve standards.

Incident Post-Mortems: Quality failures treated as learning opportunities. Blameless post-mortems identified root causes and systemic improvements.

Quality Metrics in Performance Reviews: Individual and team performance metrics included data quality components.

Feedback Loop Challenge

Producer Scorecards: Data producers received regular reports on quality metrics, highlighting trends and improvement opportunities.

Consumer Feedback Channels: Data consumers could easily report quality issues discovered during analysis.

Automated Root Cause Analysis: When quality gates failed, automated systems traced issues back through the pipeline.

Real-World Patterns

Success Patterns

Start with High-Value, High-Risk Data: Focus quality efforts on data feeding critical decisions. The fraud detection pipeline’s improvement immediately demonstrated value.

Incremental Automation: Progressive automation—manual processes documented, then scripted, then automated.

Context-Aware Validation: Quality rules that understood business context caught real issues while minimizing false positives. A claim amount of $1 million was suspicious for auto insurance but normal for commercial property.

Proactive Monitoring: The best quality gates prevented issues rather than just detecting them.

Anti-Patterns to Avoid

One-Size-Fits-All Quality: Applying same standards to all data regardless of use case created unnecessary overhead.

Alert Fatigue: Too many low-value alerts trained teams to ignore quality warnings.

Technology-First Thinking: Starting with tools rather than understanding requirements led to implementations that were sophisticated but irrelevant.

Perfection Paralysis: Imperfect automation improved iteratively was better than perfect plans never executed.

Measuring Success

Technical Metrics

Detection Rate: Percentage of known quality issues automated gates caught. Synthetic bad data tested detection capabilities.

False Positive Rate: How often quality gates flagged good data as bad. High false positive rates eroded trust.

Processing Overhead: Performance impact of quality checks. Continuous optimization minimized latency and resource consumption.

Coverage: Percentage of data flows with quality gates. Tracked growth over time and identified protection gaps.

Business Metrics

Incident Reduction: Dramatic reduction in data-related incidents. Mean time between failures tracked.

Decision Accuracy: Better data quality improved model performance and business decisions.

Operational Efficiency: Automated quality gates reduced manual investigation time.

Scaling Across the Organization

Platform Thinking

Self-Service Quality: Teams could define and deploy quality gates without platform team involvement.

Reusable Components: Common quality checks packaged as reusable components. Checking referential integrity, validating date formats, detecting outliers became drop-in modules.

Central Monitoring: Unified dashboard showed quality metrics across all pipelines.

Shared Learning: Quality rules discovered by one team shared across organization.

Federation Model

Central Standards, Local Implementation: Platform team defined standards and provided tools. Individual teams implemented within those standards.

Community of Practice: Regular meetings shared learnings and evolved standards.

Center of Excellence: Small expert team provided consultation and training.

Future Directions

ML-Powered Quality Detection

Anomaly Detection Models: Unsupervised learning identified unusual patterns rule-based systems missed.

Automated Rule Generation: ML systems suggested new quality rules based on observed data patterns.

Real-Time Quality Loops

Streaming Quality Gates: Quality checks ran continuously on streaming data. Issues detected within seconds rather than hours.

Dynamic Remediation: Some quality issues automatically corrected in real-time.

Quality as Code Evolution

Quality Testing: Quality rules themselves tested using synthetic data and mutation testing.

Quality Contracts: Formal contracts between data producers and consumers specified quality expectations.

Decision Framework

Implement automated quality gates when:

Data feeds ML models that make automated decisions
Data volume makes manual inspection impossible
Data velocity means issues cause damage before humans can react
Multiple data sources create compounding quality risks

Choose Great Expectations when:

Validating raw data ingestion from diverse sources
Complex Python-based transformations need validation
Team has engineering capacity for framework adoption
Expectations need to be versioned and tested like code

Choose Soda when:

Data resides in data warehouse and SQL-based checks are sufficient
Team is more comfortable with SQL than Python APIs
Integration with dbt transformations is important
Self-service by analysts is priority over engineering flexibility

Implement risk-based validation when:

Not all data has equal downstream impact
Resource constraints prevent comprehensive checking of everything
Some data sources have proven reliability while others are risky
Different teams own different data with different quality standards

Use adaptive thresholds when:

Data patterns legitimately change over time
Seasonal variations create expected fluctuation
Business changes (new products, markets) alter normal ranges
Static thresholds create unacceptable false positive rates

Build feedback loops when:

Data producers need visibility into how their data is used
Root cause analysis takes too long without automated tracing
Quality issues repeat because underlying causes aren’t addressed
SLA compliance needs to be measured and enforced

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.

Similar Articles

AI Infrastructure Tooling

AI Agent Platforms Compared: CrewAI, AutoGen, and LangGraph for Mid-Market Operations

10 Jul, 2026 | 08 Mins read

You have signed off on an AI initiative. Your team has a real workflow in mind — say, triaging inbound operations tickets, drafting first-pass vendor reviews, or reconciling exception cases across thr

AI Infrastructure Tooling

Practical LLM Evaluation Metrics Beyond Vibes: Building a Repeatable Scoring Pipeline

10 Jul, 2026 | 11 Mins read

The demo looked great. The model summarized the document cleanly, answered the test question correctly, and produced prose that read well enough to ship. Two weeks later it is in production, and the c

Tooling Data Architecture

dbt vs SQLMesh: which transformation tool wins in 2026?

23 Apr, 2026 | 06 Mins read

Every analytics team eventually faces the same choice: how do you transform raw data into something analysts can actually use? For years, dbt was the only serious answer. SQLMesh arrived with a differ

Tooling Vector Databases

Vector database showdown: Pinecone, Weaviate, Qdrant, Milvus

06 May, 2026 | 05 Mins read

Every team building retrieval-augmented generation or semantic search eventually needs a vector database. The market has consolidated around four serious options: Pinecone, Weaviate, Qdrant, and Milvu

Tooling Data Architecture

Orchestration face-off: Airflow vs Prefect vs Dagster

07 May, 2026 | 06 Mins read

The orchestration market has a clear incumbent and two serious challengers. Apache Airflow has been the default choice since 2015. Prefect and Dagster both emerged to address Airflow's pain points, bu

Tooling AI Infrastructure

LLM evaluation platforms compared: LangSmith, Braintrust, Patronus

14 May, 2026 | 06 Mins read

Building an LLM application is the easy part. Knowing whether it works — whether it still works after you change a prompt, swap a model, or add a tool — is the hard part. LLM evaluation platforms exis

Tooling MLOps

Feature store comparison: Feast, Tecton, Hopsworks

20 May, 2026 | 05 Mins read

Feature stores solve a specific problem: the features you use to train a model must be the same features you use to serve it. When the training pipeline computes features differently than the serving

Tooling Data Architecture

Real-time streaming: Kafka vs Redpanda vs Pulsar

21 May, 2026 | 05 Mins read

Kafka has dominated event streaming for a decade. It processes trillions of messages daily across thousands of companies. Its dominance created an ecosystem so large that "streaming" became synonymous

Tooling AI Infrastructure

The observability stack: Datadog vs Grafana vs Monte Carlo

28 May, 2026 | 07 Mins read

Observability is not one problem — it is three. Infrastructure observability watches your servers, containers, and network. Application observability watches your code, APIs, and user-facing behavior.

Tooling AI Infrastructure

RAG frameworks head-to-head: LlamaIndex vs Haystack vs Semantic Kernel

04 Jun, 2026 | 05 Mins read

Retrieval-augmented generation is simple in theory: retrieve relevant documents, stuff them into a prompt, get a grounded answer. In practice, the retrieval step is where most RAG applications fail. T

Tooling Data Architecture

Data cataloging tools: Atlan, Alation, DataHub, Amundsen

11 Jun, 2026 | 05 Mins read

A data catalog solves a trust problem. When an analyst cannot find the right table, does not know what a column means, or cannot tell whether data is fresh, they either guess or ask someone. Both outc

Tooling MLOps

Model serving: vLLM, TGI, Triton — which fits your stack?

18 Jun, 2026 | 05 Mins read

Serving a language model in production is an infrastructure problem, not a model problem. The model weights are the same regardless of how you serve them. What differs is throughput (how many requests

Tooling MLOps

CI/CD for ML: MLflow vs Weights & Biases vs Neptune

25 Jun, 2026 | 05 Mins read

Machine learning teams face a version control problem that Git does not solve. Git tracks code changes, but ML experiments change more than code — they change hyperparameters, datasets, model architec

Tooling AI Infrastructure

Synthetic data tools: Gretel, Mostly AI, Tonic

09 Jul, 2026 | 05 Mins read

Real data is expensive, restricted, and often unusable. Privacy regulations block access to customer records. Data sharing agreements prevent using production data in development environments. Class i

Tooling AI Infrastructure

Graph databases for AI: Neo4j vs Amazon Neptune vs ArangoDB

02 Jul, 2026 | 05 Mins read

Graph databases went from niche to essential as AI applications discovered that relationships matter. RAG applications that only search by vector similarity miss the connections between entities. Reco

Tooling Data Architecture

Data quality platforms: Great Expectations vs Soda vs Monte Carlo

15 Jul, 2026 | 06 Mins read

Data quality failures are expensive and silent. A broken pipeline does not crash — it produces wrong data that flows into dashboards, models, and decisions. The error is discovered weeks later when a

Tooling AI Infrastructure

LLM gateway comparison: LiteLLM, Portkey, Martian

29 Jun, 2026 | 07 Mins read

A production AI application calls multiple LLM providers. The primary model is GPT-4o for complex reasoning, but simple classification tasks use Claude Haiku for cost savings, and the fallback for rat

Data Quality Automation

Data Quality Monitoring Automation

01 Jul, 2024 | 11 Mins read

Data quality determines decision quality. Poor data leads to flawed analytics and misguided business decisions. Manual data quality reviews don't scale and catch issues too late. This article covers

Data Quality AI Automation

AI-Driven Data Quality Enhancement

12 Oct, 2024 | 05 Mins read

Data quality problems cost organizations between 15% and 25% of revenue. The global cost of bad data runs into trillions annually. Traditional data quality approaches—manual review, rule-based validat

Data Quality AI Strategy

Designing for Data Quality: How to Build Reliable AI Systems

26 Feb, 2025 | 02 Mins read

Most ML projects fail not because of flawed algorithms but because of poor data quality. Data scientists typically spend 80% of their time on data preparation, and even small data quality issues drama