Data Governance Overview

Assess and improve the quality and completeness of your enterprise data assets with Data Governance. Profile data, enrich metadata using LLMs, generate data quality (DQ) rules, and produce scorecards, all orchestrated through automated workflows. This solution implements two of the CRAFT modules: data profiling, coverage, and policy-compliance checks implement CRAFT Assess; metadata enrichment, DQ rule generation, and classification implement CRAFT Enrich.

Data Governance uses Prefect for workflow orchestration with ConcurrentTaskRunner for flow-level parallelism and asyncio.Semaphore for task-level concurrency control. Workflows run on a Kubernetes work pool for production-grade scalability.

Who Is It For

Data Stewards

Profile data assets, review LLM-generated metadata enrichment, and track data quality scores across your organization’s data landscape.

Compliance Officers

Generate data quality rules, review scorecards, and ensure metadata standards are met across the organization for regulatory compliance.

Data Engineers

Configure data connections, manage profiling workflows, and integrate data quality checks into your data pipelines.

Data Analysts

Discover and understand data assets through enriched metadata, quality scores, and automated profiling summaries.

Key Capabilities

Data Profiling

Automatically analyze data assets to understand structure, distributions, patterns, and anomalies. Profiling produces statistical summaries that feed into enrichment and DQ rule generation.

Metadata Enrichment

Use LLMs to automatically enrich metadata with business descriptions, data classifications, semantic types, and governance annotations. Schema-aware prompts ensure high-quality results.

DQ Rule Generation

Automatically generate data quality rules based on profiling results and schema analysis. LLMs produce rules that are contextually relevant to the data’s business meaning and relationships.

Scorecard Reporting

Generate data quality scorecards that track metrics, violations, and trends across data assets. Scorecards provide executive-level visibility into data health.

Service Architecture

API Module	Purpose
Workflows	Trigger and monitor workflow runs
Metrics	Track data quality metrics
Data Assets	Manage registered data assets; batch profile lookups via `POST /profiles/columns/batch` and `POST /profiles/tables/batch`
Violations	Track DQ rule violations
Talk to Metadata	LLM-powered metadata interaction

Each domain module follows a consistent structure:

models/requests/ — Pydantic request models
models/responses/ — Pydantic response models
services/ — Business logic
routes.py — FastAPI router
dependencies.py — Dependency injection

Workflow Orchestration

Prefect orchestrates all data processing workflows with a hybrid concurrency model:

Flow-Level Concurrency

Flows use ConcurrentTaskRunner to run multiple table-level tasks in parallel. The concurrency limit is configurable via config.yaml (e.g., DQ_RULE_GEN_MAX_CONCURRENT_TABLES).This provides:

Parallel processing across tables
Visibility in the Prefect UI
Configurable concurrency limits

Task-Level Concurrency

Within each Prefect task, asyncio.Semaphore controls fine-grained concurrency for operations like LLM API calls. Limits are configurable via config.yaml (e.g., DQ_RULE_GEN_MAX_CONCURRENT_RULES).This prevents:

API rate limit exhaustion
Memory pressure from too many concurrent operations
Unpredictable costs from uncontrolled LLM calls

Kubernetes Work Pool

Production workflows run on a Kubernetes work pool named kubernetes-pool. Flow definitions are managed as code in scripts/prefect/deploy.py (single source of truth).Key details:

Work pool is auto-created during setup
Docker images are tagged with git commit hashes for traceability
Secrets are injected via Helm secretKeyRef into pod environment variables

Error Handling

Workflow and infrastructure errors are wrapped in a WorkflowRuntimeError before being returned to API consumers. This normalizes raw Prefect and infrastructure stack traces into structured error responses, making errors actionable without exposing internal implementation details.

DQ Rule Generation

DQ rule generation combines profiling results with LLM intelligence:

Schema Analysis

The system loads the complete schema context (prepare_tables_and_schema()) including all tables, columns, foreign keys, and relationships.

Table Selection

Users can select specific tables for rule generation, or process the entire database. When selected_tables is specified in event-triggered ingestion, only those tables are processed — the filter is enforced server-side.

LLM-Powered Generation

Schema context and profiling statistics are sent to the LLM, which generates data quality rules that understand business semantics and cross-table relationships.

Rule Validation

Generated rules are validated against the schema and persisted to the database. Rules include dimension (completeness, accuracy, consistency, etc.) and severity levels.

Database Architecture

Aspect	Details
ORM	SQLAlchemy 2.0+ with async support
Database	PostgreSQL 18+ (`datareadiness` database)
Migrations	Alembic with clean two-migration approach: initial schema + seed data
Primary keys	UUID v4 using PostgreSQL `gen_random_uuid()`
Timestamps	`TIMESTAMPTZ` (TIMESTAMP WITH TIME ZONE)
Seed data	112 records: 11 workflows + 11 configs + 12 dimensions + 16 metrics + 37 violations + 25 DQ tags

The project uses a clean two-migration approach: one migration for the complete schema (tables, indexes, constraints) and a second for seed data. Seed data uses idempotent INSERT ... ON CONFLICT DO NOTHING for safe re-runs.

Configuration

Data Governance uses OmegaConf for hierarchical configuration:

Source	Priority	Example
Environment variables	Highest	`VERTEXAI_PROJECT` / `VERTEXAI_LOCATION` (LLM, CRAFT default), `REDIS_PASSWORD`; `OPENAI_API_KEY` only if a non-Vertex provider is configured
`config.yaml`	Default	Database URLs, concurrency limits, workflow parameters
Internal references	—	`${.field_name}` for self-referencing config values

Configuration is loaded once at module import via common/config.py. Environment variables can be referenced in YAML using ${oc.env:VAR_NAME, default} syntax.

Supported Data Sources

Data Governance connects to external data sources for profiling, enrichment, and DQ rule generation. Supported connector types:

Connector	Auth	Notes
PostgreSQL	Username + password	Native async (`asyncpg`)
Redshift	Username + password	Via `redshift_connector`
Snowflake	RSA key-pair only	`asyncio.to_thread()` wrapping (no native async driver); account identifier format (e.g., `ORGNAME-ACCOUNTNAME`); service users must be `TYPE=SERVICE` with a mounted `.p8` private key

Connector selection is determined by the data connection type registered in the Assets service. Credentials are injected via Kubernetes secrets at runtime — never stored in the database.

Platform Integration

Data Governance integrates with the em-runtime platform for shared services:

Capability	Platform Service	Integration
Authentication	Governance (8000)	JWT validation via Keycloak
Authorization	Governance (8000)	Permission checks via OpenFGA SDK
Data connections	Assets (8000)	Registered data source configurations (PostgreSQL, Redshift, Snowflake)
Secrets	Infisical or ESO + GCP Secret Manager, see Secrets Management	Credentials for data connections and API keys

Tech Stack

Technology	Purpose
Python 3.12+	Service and workflow implementation
FastAPI	Async REST API framework
Prefect	Workflow orchestration with Kubernetes work pool
PostgreSQL 18+	Primary data store
Redis 7+	Caching and messaging
SQLAlchemy 2.0+	Async ORM with `asyncpg`
OmegaConf	Hierarchical YAML configuration
OpenAI API	LLM-powered enrichment and rule generation
Docker Compose	Local development (PostgreSQL, Redis, Prefect)
uv	Python workspace monorepo management
pytest + memray	Testing with memory profiling

Next Steps

Data Profiling

Learn how automated profiling analyzes your data assets.

Data Enrichment

Understand LLM-powered metadata enrichment workflows.

Workflows

Configure and monitor Prefect-orchestrated data governance workflows.

Data Source Setup

Connect your data sources to the platform.

​Data Governance Overview

​Who Is It For

Data Stewards

Compliance Officers

Data Engineers

Data Analysts

​Key Capabilities

Data Profiling

Metadata Enrichment

DQ Rule Generation

Scorecard Reporting

​Service Architecture

​Workflow Orchestration

​DQ Rule Generation

​Database Architecture

​Configuration

​Supported Data Sources

​Platform Integration

​Tech Stack

​Next Steps

Data Profiling

Data Enrichment

Workflows

Data Source Setup

Data Governance Overview

Who Is It For

Key Capabilities

Service Architecture

Workflow Orchestration

DQ Rule Generation

Database Architecture

Configuration

Supported Data Sources

Platform Integration

Tech Stack

Next Steps