Composable Data Governance: Leveraging OpenMetadata & DataHub

Simor Consulting | 16 May, 2025 | 07 Mins read

Data governance fails for predictable reasons. Organizations run quarterly committee meetings while their data infrastructure changes daily. They document schemas manually while automated systems generate new data assets faster than humans can track them. They build centralized taxonomies while distributed teams create undocumented shadow schemas in production.

The mismatch between governance cadence and data velocity isn’t a people problem. It’s an architecture problem. Modern metadata platforms solve it by automating discovery, maintaining live lineage, and treating governance as infrastructure rather than documentation.

The Problem with Traditional Governance

Traditional governance assumes humans can track what exists. They cannot. A financial services firm had 47 definitions of “customer lifetime value” across departments. A healthcare network could not trace which patient records fed their quality metrics. Data scientists at a retail chain spent 60% of their time finding data before analysis began.

Four mismatches cause these failures:

Cadence mismatch: Governance processes operate on quarterly cycles. Data systems change hourly. Policies documented last month describe systems that no longer exist.

Scale mismatch: Manual documentation cannot track exponential data growth. When every microservice and pipeline creates new assets, human-scale governance collapses.

Complexity mismatch: Point-to-point integrations create n-squared relationship complexity. Traditional lineage tracking becomes computationally intractable as systems interconnect.

Incentive mismatch: Data teams treat governance as overhead that slows delivery. They document because auditors require it, not because the documentation has value.

Metadata Platforms as Infrastructure

Modern metadata platforms treat governance as infrastructure rather than documentation. They crawl data systems automatically, extract metadata continuously, and maintain living records that evolve with the systems they describe.

OpenMetadata and DataHub are the two mature open-source options. Both solve the automation problem. They differ in philosophy and architecture.

OpenMetadata

OpenMetadata originated from Uber’s internal metadata platform. Its design philosophy emphasizes open standards and accessibility.

Architecture

OpenMetadata uses a connector-based architecture:

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Connectors ingest from common sources: Snowflake, PostgreSQL, Kafka, Tableau. A typical installation discovers thousands of tables and columns within days.

Discovery Capabilities

OpenMetadata’s automated discovery surfaces assets humans miss. After installation, one manufacturing company found:

3,847 tables across 89 schemas
45,000+ columns with inferred data types
234 views with complex dependencies
1,200+ stored procedures

ML-powered classification identifies sensitive data automatically. It recognizes PII patterns, financial data classifications, and intellectual property markers. That same company discovered customer data in 147 tables across systems they did not know contained PII.

The catalog captures:

Technical metadata: Data types, constraints, relationships, distributions, null patterns, cardinality.

Business metadata: Descriptions, tags, ownership, retention policies, regulatory classifications.

Usage metadata: Query frequency, access patterns, most-used columns, unused assets.

Quality metadata: Automated profiling flags high-null columns, suspicious distributions, inconsistent schemas.

Self-Service Discovery

OpenMetadata’s search interface handles natural language. A query for “customer order history” returns not just exact matches but related datasets identified through usage patterns and foreign key relationships.

Discovery time drops from days to minutes. Analysts find valuable datasets they did not know existed.

Collaborative Features

The platform shifts governance from central committees to distributed teams:

Crowd-sourced documentation: Users annotate datasets they actually use. The best documentation comes from practitioners, not from stewards who have never queried the data.

In-context discussion: Questions attach to datasets. Answers come from owners who understand the data, not from documentation written for auditors.

Emergent folksonomies: Teams create tags like “golden-source,” “derived-metric,” “deprecated-use-v2” organically. These prove more useful than rigid taxonomies designed by committee.

Change subscriptions: Stakeholders subscribe to datasets and receive alerts on schema changes, quality issues, documentation updates.

The governance team’s role shifts from documentation police to enablement partner.

DataHub

DataHub came from LinkedIn, where it manages metadata at petabyte scale. Its architecture reflects lessons from operating at LinkedIn’s data volumes.

Stream-Oriented Architecture

DataHub treats metadata as a stream of events. Schema updates, ownership transfers, and lineage changes become events in the metadata stream.

This approach provides:

Temporal metadata: Complete history of all changes. You see not just current schemas but how they evolved over time.

Real-time updates: Changes propagate immediately. When an engineer updates a pipeline, downstream consumers see lineage changes instantly.

Event-driven integration: Other systems subscribe to metadata events. Schema changes trigger documentation updates. New datasets trigger access control policies.

Scalable ingestion: The stream architecture handles millions of metadata events daily.

Lineage Tracking

DataHub tracks lineage at multiple levels:

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Field-level lineage: Tracks column flows, not just table dependencies. You can trace how customer_id propagates, transforms, and joins across systems.

Code-level lineage: Parses SQL, Spark jobs, and pipeline code to extract transformation logic automatically.

Cross-platform lineage: Tracks data across Kafka, Spark, Snowflake, and Tableau by understanding each platform’s semantics.

Impact analysis: Before modifying a schema, engineers see all downstream dependencies—reports, models, applications.

Extensibility

DataHub’s plugin architecture allows deep customization:

Custom metadata models: Extend base entities with domain-specific properties. Equipment schemas include calibration schedules. Sensor data includes precision specifications.

Custom connectors: Write ingestion logic for proprietary systems. Legacy MES and custom IoT platforms integrate without vendor support.

Domain-specific processors: Implement custom handling for manufacturing patterns. Time-series data receives specialized treatment. Batch process data links to production runs.

Bi-directional integration: Push results from data quality platforms to DataHub. Read ownership information for access control. Create unified governance ecosystems.

Federation

DataHub supports federated deployments for organizations with autonomous business units:

Each unit runs its own DataHub instance with local autonomy. A central instance aggregates metadata enterprise-wide while respecting access boundaries. Units choose which metadata to share globally. Sensitive IP remains local. Cross-unit lineage tracks data flows across boundaries.

Composable Governance Architecture

The choice between OpenMetadata and DataHub is often false. Organizations use both for different use cases:

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

OpenMetadata for self-service: Analytics teams, data scientists, and business users get OpenMetadata’s intuitive interface and collaborative features.

DataHub for core platform: The central data platform team uses DataHub for scalability and extensibility. Critical lineage tracking runs on DataHub.

Specialized solutions: IoT platforms use time-series-specific metadata stores. ML platforms use MLflow for model metadata. These integrate through common APIs.

Metadata exchange layer: OpenLineage and OpenMetadata APIs synchronize metadata across platforms. Updates propagate while maintaining semantic consistency.

Governance as a Service

This composable architecture enables self-service governance:

Automated onboarding: New data sources are discovered and cataloged automatically.

Policy enforcement: PII detection triggers access controls. Quality thresholds generate alerts. Retention policies initiate archival workflows.

Continuous compliance: GDPR reports generate automatically. SOX controls validate continuously. Regulatory requirements translate to automated checks.

Proactive issue detection: Orphaned datasets, quality degradation, and policy violations are caught before they cause incidents.

Network Effects

More adoption amplifies value:

Collective intelligence: Every team that documents data makes discovery easier for others. Usage patterns reveal hidden connections.

Emergent standards: Naming conventions, tagging strategies, and documentation templates spread through observation rather than mandate.

Cross-functional insights: Manufacturing data combined with sales data reveals optimization opportunities.

Trust through transparency: Teams verify sources, understand transformations, trace quality metrics. This trust enables ambitious cross-functional work.

Implementation Lessons

Organizations that succeed with metadata platforms share patterns:

Lead with Value, Not Compliance

Successful adoptions start with teams seeking value. When analytics teams adopt OpenMetadata to solve discovery problems, usage grows organically. When governance enables self-service rather than enforcing controls, adoption follows.

Lead with use cases that provide immediate time savings. Let compliance benefits emerge as side effects.

Incremental Rollout

Big-bang rollouts fail. Incremental adoption works:

Pilot with willing teams first. Early adopters provide feedback and become champions.

Start read-only. Write capabilities and enforcement come after trust is established.

Expand by use case. Each new capability—lineage tracking, quality monitoring, access control—rolls out separately.

Adapt based on feedback. Each phase teaches lessons that improve the next.

Invest in Integration

The highest-value investments:

Automate ingestion: Manual metadata entry never scales. Every automation improvement pays dividends in accuracy and coverage.

Bi-directional sync: Metadata flows both ways. Operational system changes update metadata stores. Metadata updates trigger operational changes.

API-first design: Everything accessible via APIs enables custom integrations, automated workflows, and future flexibility.

Culture Beats Technology

The most advanced metadata platform fails without cultural change:

Ownership culture: Clear data ownership drives metadata quality. When teams own their data assets, they maintain accurate documentation.

Transparency default: Shifting from “need-to-know” to “open-by-default” accelerates discovery and innovation.

Continuous improvement: Metadata quality improves through iteration, not perfection-seeking. Start with basic documentation and enrich over time.

Advanced Patterns

Active Metadata

Move from passive metadata (documentation about data) to active metadata (metadata that drives behavior):

Schema evolution: Metadata changes trigger downstream adaptations. Source schema changes cause transformation logic to adjust automatically.

Dynamic access control: Metadata classifications determine access rights in real-time. Reclassification triggers permission updates across all systems.

Automated remediation: Quality issues trigger automated fixes. Data drift detection initiates re-sync jobs.

Intelligent routing: High-value data receives premium processing. Sensitive data follows secure paths. Low-quality data quarantines for cleanup.

Metadata-Driven Development

Code generation: Transformation logic generates from metadata mappings.

Test generation: Data validation tests generate from metadata constraints.

Documentation generation: Technical docs stay synchronized with implementations.

Impact simulation: Proposed changes simulate using metadata before engineers write code.

Governance Analytics

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Coverage analytics: Which datasets lack documentation? Which systems are not integrated? These gaps direct governance effort.

Quality trending: Documentation completeness, lineage accuracy, and classification coverage become measurable objectives.

ROI measurement: Time saved in discovery, incidents prevented through lineage analysis, compliance costs avoided through automation.

Decision Rules

Choose OpenMetadata when:

Your teams need intuitive self-service interfaces
You prioritize collaborative documentation
You want rapid out-of-the-box deployment

Choose DataHub when:

You operate at petabyte scale
You need advanced lineage tracking
You require deep customization through plugins

Use both when:

Different teams have different needs
Core platform differs from self-service requirements
Specialized domains need tailored solutions

The underlying principle: governance infrastructure must match data infrastructure cadence. Manual governance fails because it cannot scale. Automated, composable governance succeeds because it operates at data velocity.

Modern metadata platforms are mature. The patterns are proven. The choice is not whether to automate governance but how quickly to begin.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.