Documentation Index
Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt
Use this file to discover all available pages before exploring further.
Data Profiling
Data profiling is the foundation of the Data Governance solution. It analyzes connected databases to produce statistical profiles of tables and columns, identifying data quality issues, completeness gaps, and structural anomalies.What Data Profiling Captures
- Column-Level Metrics
- Table-Level Metrics
For each column in a profiled table, the system computes:
| Metric | Description |
|---|---|
| Completeness | Percentage of non-null values |
| Uniqueness | Ratio of distinct values to total rows |
| Data type distribution | Actual types vs declared column type |
| Min / Max / Mean | Statistical bounds for numeric columns |
| Standard deviation | Spread of numeric values |
| Pattern analysis | Common formats (email, phone, date patterns) |
| Top N values | Most frequent values and their counts |
| Outlier detection | Values beyond 2 standard deviations |
Profiling Workflow
Data profiling is orchestrated as a Prefect workflow:Select data assets
An administrator or data steward selects the tables and schemas to profile via the Data Governance API.
Profiling workflow starts
The Prefect flow launches with
ConcurrentTaskRunner for parallel table processing. Each table is profiled as an independent task.Column analysis
For each table, the workflow queries the database to compute column-level statistics. Queries are optimized using sampling for large tables.
Results stored
Profiling results are stored in the Data Governance database (
datareadiness) with timestamps for historical tracking.Concurrency Control
Profiling workflows use hybrid concurrency to efficiently process large databases without overwhelming the source:| Level | Mechanism | Configuration |
|---|---|---|
| Flow-level | ConcurrentTaskRunner(max_workers=N) | config.yaml |
| Task-level | asyncio.Semaphore(N) | config.yaml |
Integration with Data Connections
Profiling uses data connections registered in the platform’s Assets service:- The data steward selects a registered data connection
- The profiling workflow retrieves connection credentials from the Secrets API at runtime
- A read-only database session is established for profiling queries
- All profiling queries use the read-only user configured during data source setup
Profiling Results
Profiling results are accessible via the Data Governance API:Scheduling Profiles
Profiling can be scheduled for recurring execution via the platform’s scheduling system:- Daily profiles: Track data quality trends over time
- Weekly profiles: Standard cadence for most datasets
- On-demand: Triggered manually for new data sources or after schema changes
Next Steps
Data Enrichment
Enrich metadata after profiling with LLM-powered descriptions.
Workflows
Learn about the Prefect workflow orchestration system.
Data Source Setup
Connect a database to start profiling.
Data Classification
Understand how profiling data is classified and protected.

