Documentation Index
Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt
Use this file to discover all available pages before exploring further.
Metadata Enrichment
The metadata enrichment pipeline uses LLMs to automatically generate descriptions, classifications, and data quality rule suggestions for profiled data assets. This transforms raw schema information into rich, governed metadata that helps data teams understand and trust their data.What Gets Enriched
- Table Metadata
- Column Metadata
| Enrichment | Description |
|---|---|
| Business description | Human-readable description of the table’s purpose and contents |
| Domain classification | Business domain the table belongs to (e.g., Sales, Finance, HR) |
| Sensitivity level | Data classification level (Public, Internal, Confidential, Restricted) |
| Data steward suggestion | Recommended owner based on domain and content |
| Relationship summary | Description of how the table relates to other tables |
Enrichment Pipeline
The enrichment workflow is orchestrated by Prefect and uses the OpenAI API for LLM-powered analysis:Input preparation
The workflow gathers schema information and profiling results for the target tables. Schema context (all tables and relationships) is assembled using
prepare_tables_and_schema() to give the LLM full database context.LLM enrichment
Schema and profiling data are sent to the LLM with structured prompts. The LLM generates metadata enrichments based on:
- Column names and types
- Profiling statistics (completeness, uniqueness, patterns)
- Table relationships and foreign keys
- Cross-table context (schema_info for understanding relationships)
Validation
Generated metadata is validated against the schema. PII classifications are cross-checked with column patterns and profiling data.
Schema Context for Better Enrichment
The enrichment pipeline provides full schema context to the LLM for higher-quality results:Concurrency Control
Enrichment workflows use the same hybrid concurrency model as profiling:| Level | Mechanism | Control |
|---|---|---|
| Flow-level | ConcurrentTaskRunner | Max concurrent table enrichment tasks |
| Task-level | asyncio.Semaphore | Max concurrent LLM API calls within each task |
LLM Configuration
The enrichment pipeline uses the OpenAI API (configurable):| Setting | Source | Description |
|---|---|---|
OPENAI_API_KEY | Environment variable | API key for LLM access |
| Model | config.yaml | Model identifier (e.g., gpt-4o) |
| Max tokens | config.yaml | Response token limit |
| Temperature | config.yaml | Creativity level (low for factual metadata) |
DQ Rule Generation
As part of enrichment, the LLM suggests data quality (DQ) rules for each column:| Rule Type | Example |
|---|---|
| Completeness | ”email column should be 100% non-null” |
| Format | ”phone column should match pattern XXX-XXX-XXXX” |
| Range | ”age column should be between 0 and 150” |
| Uniqueness | ”customer_id should be unique” |
| Referential | ”product_id should exist in products table” |
| Consistency | ”end_date should be after start_date” |
Enrichment Results
Access enrichment results via the Data Governance API:Next Steps
Data Profiling
Profile your data before enrichment to provide statistical context.
Workflows
Learn about the Prefect orchestration system that runs enrichment.
GDPR Compliance
Understand PII classification in the context of GDPR.
Data Classification
Review how enrichment outputs map to data classification levels.

