Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt

Use this file to discover all available pages before exploring further.

Data Governance Overview

Assess and improve the quality and completeness of your enterprise data assets with Data Governance. Profile data, enrich metadata using LLMs, generate data quality (DQ) rules, and produce scorecards, all orchestrated through automated workflows. This solution implements two of the CRAFT modules: data profiling, coverage, and policy-compliance checks implement CRAFT Assess; metadata enrichment, DQ rule generation, and classification implement CRAFT Enrich.
Data Governance uses Prefect for workflow orchestration with ConcurrentTaskRunner for flow-level parallelism and asyncio.Semaphore for task-level concurrency control. Workflows run on a Kubernetes work pool for production-grade scalability.

Who Is It For

Data Stewards

Profile data assets, review LLM-generated metadata enrichment, and track data quality scores across your organization’s data landscape.

Compliance Officers

Generate data quality rules, review scorecards, and ensure metadata standards are met across the organization for regulatory compliance.

Data Engineers

Configure data connections, manage profiling workflows, and integrate data quality checks into your data pipelines.

Data Analysts

Discover and understand data assets through enriched metadata, quality scores, and automated profiling summaries.

Key Capabilities

Data Profiling

Automatically analyze data assets to understand structure, distributions, patterns, and anomalies. Profiling produces statistical summaries that feed into enrichment and DQ rule generation.

Metadata Enrichment

Use LLMs to automatically enrich metadata with business descriptions, data classifications, semantic types, and governance annotations. Schema-aware prompts ensure high-quality results.

DQ Rule Generation

Automatically generate data quality rules based on profiling results and schema analysis. LLMs produce rules that are contextually relevant to the data’s business meaning and relationships.

Scorecard Reporting

Generate data quality scorecards that track metrics, violations, and trends across data assets. Scorecards provide executive-level visibility into data health.

Service Architecture

API ModulePurpose
WorkflowsTrigger and monitor workflow runs
MetricsTrack data quality metrics
Data AssetsManage registered data assets
ViolationsTrack DQ rule violations
Talk to MetadataLLM-powered metadata interaction
Each domain module follows a consistent structure:
  • models/requests/ — Pydantic request models
  • models/responses/ — Pydantic response models
  • services/ — Business logic
  • routes.py — FastAPI router
  • dependencies.py — Dependency injection

Workflow Orchestration

Prefect orchestrates all data processing workflows with a hybrid concurrency model:
Flows use ConcurrentTaskRunner to run multiple table-level tasks in parallel. The concurrency limit is configurable via config.yaml (e.g., DQ_RULE_GEN_MAX_CONCURRENT_TABLES).This provides:
  • Parallel processing across tables
  • Visibility in the Prefect UI
  • Configurable concurrency limits
Within each Prefect task, asyncio.Semaphore controls fine-grained concurrency for operations like LLM API calls. Limits are configurable via config.yaml (e.g., DQ_RULE_GEN_MAX_CONCURRENT_RULES).This prevents:
  • API rate limit exhaustion
  • Memory pressure from too many concurrent operations
  • Unpredictable costs from uncontrolled LLM calls
Production workflows run on a Kubernetes work pool named kubernetes-pool. Flow definitions are managed as code in scripts/prefect/deploy.py (single source of truth).Key details:
  • Work pool is auto-created during setup
  • Docker images are tagged with git commit hashes for traceability
  • Secrets are injected via Helm secretKeyRef into pod environment variables

DQ Rule Generation

DQ rule generation combines profiling results with LLM intelligence:
1

Schema Analysis

The system loads the complete schema context (prepare_tables_and_schema()) including all tables, columns, foreign keys, and relationships.
2

Table Selection

Users can select specific tables for rule generation, or process the entire database.
3

LLM-Powered Generation

Schema context and profiling statistics are sent to the LLM, which generates data quality rules that understand business semantics and cross-table relationships.
4

Rule Validation

Generated rules are validated against the schema and persisted to the database. Rules include dimension (completeness, accuracy, consistency, etc.) and severity levels.

Database Architecture

AspectDetails
ORMSQLAlchemy 2.0+ with async support
DatabasePostgreSQL 18+ (datareadiness database)
MigrationsAlembic with clean two-migration approach: initial schema + seed data
Primary keysUUID v4 using PostgreSQL gen_random_uuid()
TimestampsTIMESTAMPTZ (TIMESTAMP WITH TIME ZONE)
Seed data112 records: 11 workflows + 11 configs + 12 dimensions + 16 metrics + 37 violations + 25 DQ tags
The project uses a clean two-migration approach: one migration for the complete schema (tables, indexes, constraints) and a second for seed data. Seed data uses idempotent INSERT ... ON CONFLICT DO NOTHING for safe re-runs.

Configuration

Data Governance uses OmegaConf for hierarchical configuration:
SourcePriorityExample
Environment variablesHighestOPENAI_API_KEY, REDIS_PASSWORD
config.yamlDefaultDatabase URLs, concurrency limits, workflow parameters
Internal references${.field_name} for self-referencing config values
Configuration is loaded once at module import via common/config.py. Environment variables can be referenced in YAML using ${oc.env:VAR_NAME, default} syntax.

Supported Data Sources

Data Governance connects to external data sources for profiling, enrichment, and DQ rule generation. Supported connector types:
ConnectorAuthNotes
PostgreSQLUsername + passwordNative async (asyncpg)
RedshiftUsername + passwordVia redshift_connector
SnowflakeRSA key-pair onlyasyncio.to_thread() wrapping (no native async driver); account identifier format (e.g., ORGNAME-ACCOUNTNAME); service users must be TYPE=SERVICE with a mounted .p8 private key
Connector selection is determined by the data connection type registered in the Assets service. Credentials are injected via Kubernetes secrets at runtime — never stored in the database.

Platform Integration

Data Governance integrates with the em-runtime platform for shared services:
CapabilityPlatform ServiceIntegration
AuthenticationGovernance (8001)JWT validation via Keycloak
AuthorizationGovernance (8001)Permission checks via OpenFGA SDK
Data connectionsAssets (8002)Registered data source configurations (PostgreSQL, Redshift, Snowflake)
SecretsInfisical or ESO + GCP Secret Manager, see Secrets ManagementCredentials for data connections and API keys

Tech Stack

TechnologyPurpose
Python 3.12+Service and workflow implementation
FastAPIAsync REST API framework
PrefectWorkflow orchestration with Kubernetes work pool
PostgreSQL 18+Primary data store
Redis 7+Caching and messaging
SQLAlchemy 2.0+Async ORM with asyncpg
OmegaConfHierarchical YAML configuration
OpenAI APILLM-powered enrichment and rule generation
Docker ComposeLocal development (PostgreSQL, Redis, Prefect)
uvPython workspace monorepo management
pytest + memrayTesting with memory profiling

Next Steps

Data Profiling

Learn how automated profiling analyzes your data assets.

Data Enrichment

Understand LLM-powered metadata enrichment workflows.

Workflows

Configure and monitor Prefect-orchestrated data governance workflows.

Data Source Setup

Connect your data sources to the platform.