Documentation Index
Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt
Use this file to discover all available pages before exploring further.
Data Governance Overview
Assess and improve the quality and completeness of your enterprise data assets with Data Governance. Profile data, enrich metadata using LLMs, generate data quality (DQ) rules, and produce scorecards, all orchestrated through automated workflows. This solution implements two of the CRAFT modules: data profiling, coverage, and policy-compliance checks implement CRAFT Assess; metadata enrichment, DQ rule generation, and classification implement CRAFT Enrich.Data Governance uses Prefect for workflow orchestration with
ConcurrentTaskRunner for flow-level parallelism and asyncio.Semaphore for task-level concurrency control. Workflows run on a Kubernetes work pool for production-grade scalability.Who Is It For
Data Stewards
Profile data assets, review LLM-generated metadata enrichment, and track data quality scores across your organization’s data landscape.
Compliance Officers
Generate data quality rules, review scorecards, and ensure metadata standards are met across the organization for regulatory compliance.
Data Engineers
Configure data connections, manage profiling workflows, and integrate data quality checks into your data pipelines.
Data Analysts
Discover and understand data assets through enriched metadata, quality scores, and automated profiling summaries.
Key Capabilities
Data Profiling
Automatically analyze data assets to understand structure, distributions, patterns, and anomalies. Profiling produces statistical summaries that feed into enrichment and DQ rule generation.
Metadata Enrichment
Use LLMs to automatically enrich metadata with business descriptions, data classifications, semantic types, and governance annotations. Schema-aware prompts ensure high-quality results.
DQ Rule Generation
Automatically generate data quality rules based on profiling results and schema analysis. LLMs produce rules that are contextually relevant to the data’s business meaning and relationships.
Scorecard Reporting
Generate data quality scorecards that track metrics, violations, and trends across data assets. Scorecards provide executive-level visibility into data health.
Service Architecture
| API Module | Purpose |
|---|---|
| Workflows | Trigger and monitor workflow runs |
| Metrics | Track data quality metrics |
| Data Assets | Manage registered data assets |
| Violations | Track DQ rule violations |
| Talk to Metadata | LLM-powered metadata interaction |
models/requests/— Pydantic request modelsmodels/responses/— Pydantic response modelsservices/— Business logicroutes.py— FastAPI routerdependencies.py— Dependency injection
Workflow Orchestration
Prefect orchestrates all data processing workflows with a hybrid concurrency model:Flow-Level Concurrency
Flow-Level Concurrency
Flows use
ConcurrentTaskRunner to run multiple table-level tasks in parallel. The concurrency limit is configurable via config.yaml (e.g., DQ_RULE_GEN_MAX_CONCURRENT_TABLES).This provides:- Parallel processing across tables
- Visibility in the Prefect UI
- Configurable concurrency limits
Task-Level Concurrency
Task-Level Concurrency
Within each Prefect task,
asyncio.Semaphore controls fine-grained concurrency for operations like LLM API calls. Limits are configurable via config.yaml (e.g., DQ_RULE_GEN_MAX_CONCURRENT_RULES).This prevents:- API rate limit exhaustion
- Memory pressure from too many concurrent operations
- Unpredictable costs from uncontrolled LLM calls
Kubernetes Work Pool
Kubernetes Work Pool
Production workflows run on a Kubernetes work pool named
kubernetes-pool. Flow definitions are managed as code in scripts/prefect/deploy.py (single source of truth).Key details:- Work pool is auto-created during setup
- Docker images are tagged with git commit hashes for traceability
- Secrets are injected via Helm
secretKeyRefinto pod environment variables
DQ Rule Generation
DQ rule generation combines profiling results with LLM intelligence:Schema Analysis
The system loads the complete schema context (
prepare_tables_and_schema()) including all tables, columns, foreign keys, and relationships.Table Selection
Users can select specific tables for rule generation, or process the entire database.
LLM-Powered Generation
Schema context and profiling statistics are sent to the LLM, which generates data quality rules that understand business semantics and cross-table relationships.
Database Architecture
| Aspect | Details |
|---|---|
| ORM | SQLAlchemy 2.0+ with async support |
| Database | PostgreSQL 18+ (datareadiness database) |
| Migrations | Alembic with clean two-migration approach: initial schema + seed data |
| Primary keys | UUID v4 using PostgreSQL gen_random_uuid() |
| Timestamps | TIMESTAMPTZ (TIMESTAMP WITH TIME ZONE) |
| Seed data | 112 records: 11 workflows + 11 configs + 12 dimensions + 16 metrics + 37 violations + 25 DQ tags |
The project uses a clean two-migration approach: one migration for the complete schema (tables, indexes, constraints) and a second for seed data. Seed data uses idempotent
INSERT ... ON CONFLICT DO NOTHING for safe re-runs.Configuration
Data Governance uses OmegaConf for hierarchical configuration:| Source | Priority | Example |
|---|---|---|
| Environment variables | Highest | OPENAI_API_KEY, REDIS_PASSWORD |
config.yaml | Default | Database URLs, concurrency limits, workflow parameters |
| Internal references | — | ${.field_name} for self-referencing config values |
common/config.py. Environment variables can be referenced in YAML using ${oc.env:VAR_NAME, default} syntax.
Supported Data Sources
Data Governance connects to external data sources for profiling, enrichment, and DQ rule generation. Supported connector types:| Connector | Auth | Notes |
|---|---|---|
| PostgreSQL | Username + password | Native async (asyncpg) |
| Redshift | Username + password | Via redshift_connector |
| Snowflake | RSA key-pair only | asyncio.to_thread() wrapping (no native async driver); account identifier format (e.g., ORGNAME-ACCOUNTNAME); service users must be TYPE=SERVICE with a mounted .p8 private key |
Platform Integration
Data Governance integrates with the em-runtime platform for shared services:| Capability | Platform Service | Integration |
|---|---|---|
| Authentication | Governance (8001) | JWT validation via Keycloak |
| Authorization | Governance (8001) | Permission checks via OpenFGA SDK |
| Data connections | Assets (8002) | Registered data source configurations (PostgreSQL, Redshift, Snowflake) |
| Secrets | Infisical or ESO + GCP Secret Manager, see Secrets Management | Credentials for data connections and API keys |
Tech Stack
| Technology | Purpose |
|---|---|
| Python 3.12+ | Service and workflow implementation |
| FastAPI | Async REST API framework |
| Prefect | Workflow orchestration with Kubernetes work pool |
| PostgreSQL 18+ | Primary data store |
| Redis 7+ | Caching and messaging |
| SQLAlchemy 2.0+ | Async ORM with asyncpg |
| OmegaConf | Hierarchical YAML configuration |
| OpenAI API | LLM-powered enrichment and rule generation |
| Docker Compose | Local development (PostgreSQL, Redis, Prefect) |
| uv | Python workspace monorepo management |
| pytest + memray | Testing with memory profiling |
Next Steps
Data Profiling
Learn how automated profiling analyzes your data assets.
Data Enrichment
Understand LLM-powered metadata enrichment workflows.
Workflows
Configure and monitor Prefect-orchestrated data governance workflows.
Data Source Setup
Connect your data sources to the platform.

