Text-to-SQL Agent

The Text2SQL Agent is a specialized A2A agent that converts natural-language questions into SQL queries. It uses schema-aware generation, SQL validation via sqlglot, and safe execution against customer databases.

Pipeline versions

Two Text2SQL pipelines are available, controlled by a single configuration switch:

Legacy pipeline (default, in production): the generate → validate → execute flow described below. A single schema is fetched per question, validated, and executed.
Multi-schema pipeline (in rollout, behind a flag): introduces schema selection and enrichment steps so questions spanning multiple schemas are handled without requiring the user to pre-select a schema. The architecture is decided (ADRs 004/005) and the scaffolding is in place; production rollout is in progress.

Switch via text2sql.pipeline.use_new_pipeline in config.yaml. The legacy pipeline is the supported path in current deployments. The multi-schema pipeline introduces:

Schema table selection (ADR 004): a schema-selection step narrows the catalog to the schemas relevant to the question, using semantic search over the data catalog.
Raw DDL storage (ADR 005): column-level DDL is stored and retrieved for the selected schemas, giving the LLM richer type and constraint context without fetching the entire catalog.
CatalogSearchResult model and pipeline state fields (pruned_schema, enriched_schema, etc.) carrying multi-schema context across steps.
A new allow_multi_schema flag on CatalogSchemaProvider that bypasses the single-schema guard when the pipeline is active.

The rest of this page describes the legacy pipeline, which remains the default.

Pipeline Overview

The Text2SQL agent follows a generate, validate, execute pipeline built on the commons.pipeline.Pipeline framework:

Schema retrieval

The agent fetches the target database schema (tables, columns, types, relationships, foreign keys) from the data connection. This schema context is included in the LLM prompt for accurate SQL generation.

SQL generation

The LLM generates a SQL query based on the natural-language question and schema context. The prompt includes:

Database schema (tables, columns, types)
Table relationships and foreign keys
Previous conversation context (for follow-up questions)
Database dialect (PostgreSQL)

SQL validation

The generated SQL is validated using sqlglot:

Syntax validation: Ensures the SQL is syntactically correct
Dialect validation: Confirms compatibility with the target database dialect
Safety checks: Prevents destructive operations (DROP, DELETE, TRUNCATE, ALTER)
If validation fails, the agent retries generation with the error feedback

Query execution

The validated SQL is executed against the customer database via the data connection. The agent uses a read-only connection with query timeouts to prevent runaway queries.

Result formatting

Query results are formatted as structured data (DataPart) and returned to the calling agent via A2A events.

Schema-Aware Generation

The quality of generated SQL depends heavily on the schema context provided to the LLM. The Text2SQL agent builds a comprehensive schema representation:

# Schema context includes:
# - Table names and descriptions
# - Column names, types, and constraints
# - Primary and foreign key relationships
# - Sample values for enum-like columns
# - Table-level comments and documentation

For better SQL generation quality, add comments and descriptions to your database schema. The Text2SQL agent uses these to understand the semantic meaning of tables and columns.

SQL Validation

The agent uses sqlglot for multi-layer validation:

Check	Purpose
Parse	Verify the SQL is syntactically valid
Transpile	Confirm compatibility with the PostgreSQL dialect
Safety	Block destructive statements (DDL, DML mutations)
Column validation	Verify referenced columns exist in the schema

Safety Rules

The Text2SQL agent enforces strict safety rules:

Allowed	Blocked
`SELECT` queries	`INSERT`, `UPDATE`, `DELETE`
`WITH` (CTEs)	`DROP`, `ALTER`, `TRUNCATE`
Aggregate functions	`CREATE`, `GRANT`, `REVOKE`
Window functions	Raw `EXECUTE`, dynamic SQL
Subqueries	Statements that modify data or schema

Pipeline Framework

The Text2SQL agent uses the commons.pipeline.Pipeline state machine for managing the generate-validate-execute flow:

from commons.pipeline import Pipeline, Transition, StepError

class Text2SQLPipeline(Pipeline):
    async def generate(self, context):
        sql = await self.llm.generate_sql(context.question, context.schema)
        return Transition(goto="validate", data={"sql": sql})

    async def validate(self, context):
        try:
            validated = sqlglot.transpile(context.sql, read="postgres")
            return Transition(goto="execute", data={"sql": validated})
        except sqlglot.errors.ParseError as e:
            # Retry generation with error feedback
            return Transition(goto="generate", data={"error": str(e)})

    async def execute(self, context):
        results = await self.db.execute(context.sql)
        return Transition(goto="break", data={"results": results})

Steps return Transition objects that specify the next step (goto), with "break" signaling pipeline completion and "error" signaling failure. Built-in cooperative cancellation allows long-running pipelines to be interrupted gracefully.

Event Streaming

The Text2SQL agent emits A2A events at each pipeline step for real-time progress:

Step	Event	Message
Schema retrieval	`TaskStatusUpdateEvent`	”Analyzing database schema…”
SQL generation	`TaskStatusUpdateEvent`	”Generating SQL query…”
Validation	`TaskStatusUpdateEvent`	”Validating query…”
Execution	`TaskStatusUpdateEvent`	”Executing query…”
Results	`TaskArtifactUpdateEvent`	Query results as DataPart

Error Handling

The agent uses typed error hierarchies for recoverable failures:

Error Type	Handling
`SQLValidationError`	Retry SQL generation with validation error context
`QueryExecutionError`	Return user-friendly error message with the failed SQL
`SchemaRetrievalError`	Check data connection health and credentials
`QueryTimeoutError`	Suggest query simplification or filtering
Unexpected errors	Wrapped in `StepError` by the pipeline framework

Agent Card

The Text2SQL agent exposes its capabilities via the standard A2A Agent Card:

{
  "name": "text2sql",
  "description": "Converts natural-language questions to SQL queries",
  "url": "http://text2sql:8001",
  "skills": [
    {
      "id": "nl-to-sql",
      "name": "Natural Language to SQL",
      "description": "Generate and execute SQL from natural-language questions",
      "tags": ["sql", "database", "analytics"]
    }
  ]
}

Sample Data Preview

The Talk2Data Service exposes a POST /talk2data/v1/sample endpoint that returns a preview of rows from a database table without running a full analytical query. This is used by the frontend to let users inspect table contents before composing a natural-language question. Request fields:

Field	Type	Description
`table_fqn`	string	Fully qualified table name: `database.schema.table`
`resource_uri`	string	Data connection resource URI (from the Assets service)
`limit`	integer	Number of rows to return (default varies by connection type)

Error responses:

Status	Cause
`400`	Invalid `table_fqn` format (must be `database.schema.table`)
`404`	Data connection not found for the given `resource_uri`
`501`	Connection type does not support row sampling
`503`	`EM_RUNTIME_ASSETS_URL` not configured

The endpoint authenticates via the standard platform JWT and respects project-scoped data connection access.

MCP API changes (2026-05-13)

Two breaking changes shipped in em-talk2data v3.2.0 (commit ecdc18a):

generate-sql: schema field is now required

GenerateSQLRequest changed from an optional list to a required single object:

{
  "question": "How many orders shipped last month?",
  "database": "analytics",
  "schemas": [{ "tables": [...] }]
}

The JSON field name is schema (not schema_ — that is the Python attribute alias)
The field is required; omitting it returns 422 Unprocessable Entity with an HTTPValidationError body

execute-query: 422 responses now include error detail

execute-query previously returned a generic 422 with no body schema on SQL validation failures. It now returns a structured HTTPValidationError body, making it easier to surface validation failure reasons in MCP clients.

Deployment: missing env vars in local/CI Helm values

EM_RUNTIME_ASSETS_URL and EM_RUNTIME_UTILS_URL were missing from the talk2data-service env block in values.local.yaml and values.ci.yaml. This caused execute_query calls to return 503. Update local and CI overrides to include:

talk2data-service:
  env:
    EM_RUNTIME_ASSETS_URL: "http://em-runtime-assets:8000"   # local: subchart
    EM_RUNTIME_UTILS_URL:  "http://em-runtime-utils:8000"    # local: subchart

In GKE (values.yaml), the URLs use the cross-namespace service DNS (em-runtime-assets.em-runtime.svc.cluster.local) — these were already correct before this fix.

Next Steps

Chat With Data

See how Text2SQL fits into the full conversational analytics pipeline.

Analysis Agent

Learn how the Insights Agent orchestrates analysis beyond SQL.

Visualizations

Understand how query results become interactive charts.

Data Source Setup

Connect a database for Text2SQL to query.

Data Insights

Data Governance

Semiconductor

Customer Admin Guide

Text-to-SQL

Text-to-SQL Agent

Pipeline versions

Pipeline Overview

Schema-Aware Generation

SQL Validation

Safety Rules

Pipeline Framework

Event Streaming

Error Handling

Agent Card

Sample Data Preview

MCP API changes (2026-05-13)

generate-sql: schema field is now required

execute-query: 422 responses now include error detail

Deployment: missing env vars in local/CI Helm values

Next Steps

Chat With Data

Analysis Agent

Visualizations

Data Source Setup

Data Insights

Data Governance

Semiconductor

Customer Admin Guide

Documentation Index

​Text-to-SQL Agent

​Pipeline versions

​Pipeline Overview

​Schema-Aware Generation

​SQL Validation

​Safety Rules

​Pipeline Framework

​Event Streaming

​Error Handling

​Agent Card

​Sample Data Preview

​MCP API changes (2026-05-13)

​generate-sql: schema field is now required

​execute-query: 422 responses now include error detail

​Deployment: missing env vars in local/CI Helm values

​Next Steps

Chat With Data

Analysis Agent

Visualizations

Data Source Setup

Text-to-SQL Agent

Pipeline versions

Pipeline Overview

Schema-Aware Generation

SQL Validation

Safety Rules

Pipeline Framework

Event Streaming

Error Handling

Agent Card

Sample Data Preview

MCP API changes (2026-05-13)

generate-sql: schema field is now required

execute-query: 422 responses now include error detail

Deployment: missing env vars in local/CI Helm values

Next Steps