Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt

Use this file to discover all available pages before exploring further.

Metadata Enrichment

The metadata enrichment pipeline uses LLMs to automatically generate descriptions, classifications, and data quality rule suggestions for profiled data assets. This transforms raw schema information into rich, governed metadata that helps data teams understand and trust their data.

What Gets Enriched

EnrichmentDescription
Business descriptionHuman-readable description of the table’s purpose and contents
Domain classificationBusiness domain the table belongs to (e.g., Sales, Finance, HR)
Sensitivity levelData classification level (Public, Internal, Confidential, Restricted)
Data steward suggestionRecommended owner based on domain and content
Relationship summaryDescription of how the table relates to other tables

Enrichment Pipeline

The enrichment workflow is orchestrated by Prefect and uses the OpenAI API for LLM-powered analysis:
1

Input preparation

The workflow gathers schema information and profiling results for the target tables. Schema context (all tables and relationships) is assembled using prepare_tables_and_schema() to give the LLM full database context.
2

LLM enrichment

Schema and profiling data are sent to the LLM with structured prompts. The LLM generates metadata enrichments based on:
  • Column names and types
  • Profiling statistics (completeness, uniqueness, patterns)
  • Table relationships and foreign keys
  • Cross-table context (schema_info for understanding relationships)
3

Validation

Generated metadata is validated against the schema. PII classifications are cross-checked with column patterns and profiling data.
4

Storage

Enriched metadata is stored in the Data Governance database, linked to the corresponding data asset records.

Schema Context for Better Enrichment

The enrichment pipeline provides full schema context to the LLM for higher-quality results:
from dq.utils.schema_helpers import prepare_tables_and_schema

# Gather schema context
schema_info, tables_info = await prepare_tables_and_schema(connection)

# schema_info: all tables in the database (for relationship understanding)
# tables_info: selected tables for enrichment

# Pass both to the LLM prompt
prompt = build_enrichment_prompt(
    tables=tables_info,
    schema_context=schema_info  # LLM understands cross-table relationships
)
Providing full schema context allows the LLM to understand foreign key relationships and generate more accurate business descriptions. For example, it can identify that customer_id in the orders table references the customers table and generate appropriate descriptions.

Concurrency Control

Enrichment workflows use the same hybrid concurrency model as profiling:
LevelMechanismControl
Flow-levelConcurrentTaskRunnerMax concurrent table enrichment tasks
Task-levelasyncio.SemaphoreMax concurrent LLM API calls within each task
The task-level semaphore is particularly important for enrichment to avoid hitting LLM API rate limits:
@task
async def enrich_table(table_info, schema_context):
    semaphore = asyncio.Semaphore(config.max_concurrent_llm_calls)
    columns = table_info["columns"]

    async def enrich_column(column):
        async with semaphore:
            return await llm_enrich(column, schema_context)

    results = await asyncio.gather(*[enrich_column(c) for c in columns])
    return results

LLM Configuration

The enrichment pipeline uses the OpenAI API (configurable):
SettingSourceDescription
OPENAI_API_KEYEnvironment variableAPI key for LLM access
Modelconfig.yamlModel identifier (e.g., gpt-4o)
Max tokensconfig.yamlResponse token limit
Temperatureconfig.yamlCreativity level (low for factual metadata)
Enrichment prompts include database schema information and profiling statistics. Ensure your LLM provider’s data handling policies are compatible with the sensitivity level of your data. For Restricted data, consider using self-hosted LLMs.

DQ Rule Generation

As part of enrichment, the LLM suggests data quality (DQ) rules for each column:
Rule TypeExample
Completeness”email column should be 100% non-null”
Format”phone column should match pattern XXX-XXX-XXXX”
Range”age column should be between 0 and 150”
Uniqueness”customer_id should be unique”
Referential”product_id should exist in products table”
Consistency”end_date should be after start_date”
Generated rules are stored as structured metadata and can be used by the DQ monitoring system to track quality over time.

Enrichment Results

Access enrichment results via the Data Governance API:
{
  "table": "customers",
  "enriched_at": "2026-04-03T12:00:00Z",
  "business_description": "Customer master data containing contact information and account details for all registered customers.",
  "domain": "Sales",
  "sensitivity": "Confidential",
  "columns": [
    {
      "name": "email",
      "business_description": "Primary email address for customer communication",
      "pii_classification": "PII",
      "suggested_rules": [
        {"type": "completeness", "threshold": 1.0},
        {"type": "format", "pattern": "email"}
      ]
    }
  ]
}

Next Steps

Data Profiling

Profile your data before enrichment to provide statistical context.

Workflows

Learn about the Prefect orchestration system that runs enrichment.

GDPR Compliance

Understand PII classification in the context of GDPR.

Data Classification

Review how enrichment outputs map to data classification levels.