Metadata Enrichment

The metadata enrichment pipeline uses LLMs to automatically generate descriptions, classifications, and data quality rule suggestions for profiled data assets. This transforms raw schema information into rich, governed metadata that helps data teams understand and trust their data.

What Gets Enriched

Table Metadata
Column Metadata

Enrichment	Description
Business description	Human-readable description of the table’s purpose and contents
Domain classification	Business domain the table belongs to (e.g., Sales, Finance, HR)
Sensitivity level	Data classification level (Public, Internal, Confidential, Restricted)
Data steward suggestion	Recommended owner based on domain and content
Relationship summary	Description of how the table relates to other tables

Enrichment	Description
Business description	What the column represents in business terms
Data type assessment	Whether the declared type matches actual content
PII classification	Whether the column contains personally identifiable information
Format pattern	Expected format (email, phone, date, currency, etc.)
Valid value range	Expected min/max or enumerated values
DQ rule suggestions	Recommended data quality rules for this column

Enrichment Pipeline

The enrichment workflow is orchestrated by Prefect and uses the OpenAI API for LLM-powered analysis:

Input preparation

The workflow gathers schema information and profiling results for the target tables. Schema context (all tables and relationships) is assembled using prepare_tables_and_schema() to give the LLM full database context.

LLM enrichment

Schema and profiling data are sent to the LLM with structured prompts. The LLM generates metadata enrichments based on:

Column names and types
Profiling statistics (completeness, uniqueness, patterns)
Table relationships and foreign keys
Cross-table context (schema_info for understanding relationships)

Validation

Generated metadata is validated against the schema. PII classifications are cross-checked with column patterns and profiling data.

Storage

Enriched metadata is stored in the Data Governance database, linked to the corresponding data asset records.

Schema Context for Better Enrichment

The enrichment pipeline provides full schema context to the LLM for higher-quality results:

from dq.utils.schema_helpers import prepare_tables_and_schema

# Gather schema context
schema_info, tables_info = await prepare_tables_and_schema(connection)

# schema_info: all tables in the database (for relationship understanding)
# tables_info: selected tables for enrichment

# Pass both to the LLM prompt
prompt = build_enrichment_prompt(
    tables=tables_info,
    schema_context=schema_info  # LLM understands cross-table relationships
)

Providing full schema context allows the LLM to understand foreign key relationships and generate more accurate business descriptions. For example, it can identify that customer_id in the orders table references the customers table and generate appropriate descriptions.

Concurrency Control

Enrichment workflows use the same hybrid concurrency model as profiling:

Level	Mechanism	Control
Flow-level	`ConcurrentTaskRunner`	Max concurrent table enrichment tasks
Task-level	`asyncio.Semaphore`	Max concurrent LLM API calls within each task

The task-level semaphore is particularly important for enrichment to avoid hitting LLM API rate limits:

@task
async def enrich_table(table_info, schema_context):
    semaphore = asyncio.Semaphore(config.max_concurrent_llm_calls)
    columns = table_info["columns"]

    async def enrich_column(column):
        async with semaphore:
            return await llm_enrich(column, schema_context)

    results = await asyncio.gather(*[enrich_column(c) for c in columns])
    return results

LLM Configuration

The enrichment pipeline uses LiteLLM and is provider-agnostic — defaults to Vertex AI but works with any LiteLLM-supported provider:

Setting	Source	Description
`VERTEXAI_PROJECT` / `VERTEXAI_LOCATION`	Environment variables	Vertex AI project and region (default `global`). LiteLLM also accepts the Google Cloud standards `GOOGLE_CLOUD_PROJECT` / `GOOGLE_CLOUD_LOCATION`.
`GOOGLE_APPLICATION_CREDENTIALS`	Environment variable	Path to ADC JSON (local dev). Use Workload Identity in production.
`OPENAI_API_KEY` / `ANTHROPIC_API_KEY`	Environment variables	Only required if you swap the model to a non-Vertex provider
Model	`config.yaml`	LiteLLM model identifier (e.g., `vertex_ai/gemini-3.5-flash`, `vertex_ai/claude-opus-4-8`, `openai/gpt-5.5`)
Max tokens	`config.yaml`	Response token limit
Temperature	`config.yaml`	Creativity level (low for factual metadata)

Enrichment prompts include database schema information and profiling statistics. Ensure your LLM provider’s data handling policies are compatible with the sensitivity level of your data. For Restricted data, consider using self-hosted LLMs.

DQ Rule Generation

As part of enrichment, the LLM suggests data quality (DQ) rules for each column:

Rule Type	Example
Completeness	”email column should be 100% non-null”
Format	”phone column should match pattern XXX-XXX-XXXX”
Range	”age column should be between 0 and 150”
Uniqueness	”customer_id should be unique”
Referential	”product_id should exist in products table”
Consistency	”end_date should be after start_date”

Generated rules are stored as structured metadata and can be used by the DQ monitoring system to track quality over time.

Enrichment Results

Access enrichment results via the Data Governance API:

{
  "table": "customers",
  "enriched_at": "2026-04-03T12:00:00Z",
  "business_description": "Customer master data containing contact information and account details for all registered customers.",
  "domain": "Sales",
  "sensitivity": "Confidential",
  "columns": [
    {
      "name": "email",
      "business_description": "Primary email address for customer communication",
      "pii_classification": "PII",
      "suggested_rules": [
        {"type": "completeness", "threshold": 1.0},
        {"type": "format", "pattern": "email"}
      ]
    }
  ]
}

Next Steps

Data Profiling

Profile your data before enrichment to provide statistical context.

Workflows

Learn about the Prefect orchestration system that runs enrichment.

GDPR Compliance

Understand PII classification in the context of GDPR.

Data Classification

Review how enrichment outputs map to data classification levels.

​Metadata Enrichment

​What Gets Enriched

​Enrichment Pipeline

​Schema Context for Better Enrichment

​Concurrency Control

​LLM Configuration

​DQ Rule Generation

​Enrichment Results

​Next Steps

Data Profiling

Workflows

GDPR Compliance

Data Classification

Metadata Enrichment

What Gets Enriched

Enrichment Pipeline

Schema Context for Better Enrichment

Concurrency Control

LLM Configuration

DQ Rule Generation

Enrichment Results

Next Steps