Metadata-Driven ELT: Designing Declarative Pipelines

Simor Consulting | 15 Aug, 2025 | 03 Mins read

A data engineer at an e-commerce company stared at a mess of SQL scripts, Python notebooks, and configuration files. What started as a simple ETL job had mutated into a hydra of interdependent transformations that only she understood. The company’s acquisition meant integrating four new data sources by month’s end—a task requiring manually coding hundreds of new transformations.

She had become a single point of failure.

Why Pipeline Code Becomes Unmaintainable

The copy-paste catastrophe: Every new pipeline began by copying an existing one and modifying it. By the hundredth, they had a maintenance nightmare. A bug fix in the original template required manually updating every copy.

The tribal knowledge trap: Pipeline logic lived in code, but reasoning lived in developers’ heads. Why did the customer transformation aggregate by email AND phone number? Because two years ago they discovered duplicate accounts. This knowledge was not documented.

The configuration explosion: Source tables, target schemas, transformation rules, quality checks—scattered across property files, environment variables, hard-coded constants, and inline comments.

The testing vacuum: How do you test a pipeline that’s mostly configuration? Most bugs were discovered in production when business users noticed incorrect numbers.

Declarative Design

Traditional imperative pipelines tell the system HOW to transform data step by step. Declarative pipelines describe WHAT the data should look like, letting the system figure out how to get there.

This mirrors other domain revolutions:

Infrastructure as Code replaced manual server configuration
SQL replaced procedural data manipulation
React replaced jQuery DOM manipulation
Kubernetes replaced deployment scripts

Metadata-Driven Architecture

The transformation began with a powerful idea: if data has schemas, why don’t transformations? Build a system where transformations are data, not code.

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Source Metadata

Every data source is fully described:

source:
  name: shopify_orders
  type: postgres
  connection: ${SHOPIFY_DB_URL}
  schema:
    orders:
      columns:
        - name: order_id
          type: bigint
          nullable: false
          description: "Unique order identifier"
        - name: customer_email
          type: varchar(255)
          description: "Customer email address"
          pii: true
        - name: order_total
          type: decimal(10,2)
        - name: created_at
          type: timestamp
      primary_key: [order_id]
      statistics:
        row_count: 45000000
        update_frequency: "streaming"

Transformation Metadata

Transformations become declarative specifications:

transformation:
  name: customer_lifetime_value
  description: "Calculate customer lifetime value metrics"
  inputs:
    - source: shopify_orders
      alias: orders
    - source: customer_profiles
      alias: customers

  output:
    schema: analytics
    table: customer_ltv

  logic:
    type: sql_select
    select:
      - expression: customers.customer_id
        alias: customer_id
      - expression: SUM(orders.order_total)
        alias: lifetime_value
      - expression: AVG(orders.order_total)
        alias: average_order_value

    joins:
      - type: left
        table: orders
        on: customers.email = orders.customer_email

    where:
      - orders.status != 'cancelled'

    group_by:
      - customers.customer_id

  quality_checks:
    - type: not_null
      columns: [customer_id]
    - type: unique
      columns: [customer_id]
    - type: range
      column: lifetime_value
      min: 0
      max: 1000000

Generation Engine

With metadata in place, the generation engine produces pipelines:

class MetadataDrivenPipeline:
    def __init__(self, metadata_store):
        self.metadata_store = metadata_store
        self.sql_generator = SQLGenerator()
        self.dag_generator = DAGGenerator()
        self.test_generator = TestGenerator()

    def generate_pipeline(self, transformation_name):
        transform_meta = self.metadata_store.get_transformation(transformation_name)
        source_meta = [
            self.metadata_store.get_source(input['source'])
            for input in transform_meta['inputs']
        ]
        target_meta = self.metadata_store.get_target(transform_meta['output']['table'])

        sql = self.sql_generator.generate(transform_meta, source_meta, target_meta)
        dag = self.dag_generator.generate(
            transformation_name,
            dependencies=self.resolve_dependencies(transform_meta),
            schedule=transform_meta.get('schedule', '@daily'),
            sql=sql
        )
        tests = self.test_generator.generate(transform_meta, target_meta)
        docs = self.generate_documentation(transform_meta, source_meta, target_meta)

        return {'sql': sql, 'dag': dag, 'tests': tests, 'docs': docs}

SQL Generation

class SQLGenerator:
    def generate(self, transform_meta, source_meta, target_meta):
        select_clause = self.build_select_clause(transform_meta['logic']['select'])
        from_clause = self.build_from_clause(
            transform_meta['logic']['from'],
            transform_meta['logic'].get('joins', [])
        )
        where_clause = self.build_where_clause(
            transform_meta['logic'].get('where', [])
        )
        group_by_clause = self.build_group_by_clause(
            transform_meta['logic'].get('group_by', [])
        )
        hints = self.generate_query_hints(transform_meta, source_meta)

        sql = f"""
        {hints}
        CREATE OR REPLACE TABLE {transform_meta['output']['schema']}.{transform_meta['output']['table']} AS
        SELECT {select_clause}
        FROM {from_clause}
        {where_clause}
        {group_by_clause}
        """
        return sql

Schema Evolution

Schema changes that once required manual updates now happen automatically:

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Business Rules as Metadata

Business logic becomes reusable metadata:

business_rules:
  - name: customer_segmentation
    parameters:
      - name: ltv_threshold_high
        type: decimal
        default: 1000

    logic: |
      CASE
        WHEN lifetime_value >= ${ltv_threshold_high} THEN 'high_value'
        WHEN lifetime_value >= ${ltv_threshold_low} THEN 'medium_value'
        ELSE 'low_value'
      END

  - name: fraud_risk_score
    inputs:
      - order_count
      - days_since_first_order
    logic: |
      LEAST(1.0,
        (CASE WHEN order_count > 10 AND days_since_first_order < 7 THEN 0.5 ELSE 0 END)
      )

Quality Rules as Metadata

quality_rules:
  - name: email_format
    type: regex
    pattern: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

  - name: address_validation
    type: complex
    checks:
      - street_address IS NOT NULL
      - city IS NOT NULL
      - state_code IN (SELECT code FROM reference.state_codes)

Decision Rules

Adopt metadata-driven ELT when:

Pipeline count exceeds 50
Multiple developers maintain overlapping pipelines
Schema changes are frequent
Documentation lags behind implementation
Testing is manual or absent

Stick with imperative pipelines when:

Pipeline count is under 20
Business logic is stable
Team is small and cohesive
Documentation is not required
Development speed is critical

The underlying principle: transformations should be data, not code. When transformation logic lives in metadata, you get automatic documentation, lineage, testing, and generation.

Start with one critical pipeline. Prove value before scaling.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.