Eval Harness
Agent quality degrades silently. A prompt change, a model upgrade, or a new tool can alter behaviour across hundreds of evaluation dimensions without surfacing as an exception. A structured eval harness catches regressions before they reach users.CRAFT uses Langfuse for observability and evaluation. Before setting up your eval harness, complete the Langfuse Setup Guide to connect your agent to your Langfuse project.
What to Evaluate
For each agent, define evaluations in three categories:| Category | Description | Example |
|---|---|---|
| Functional correctness | Does the agent produce the right output for a given input? | SQL matches expected query, chart type matches request |
| Tool-call accuracy | Does the agent call the right tools with correct arguments? | Calls get_schema before execute_sql; passes correct schema_fqn |
| Conversation behaviour | Does the agent handle edge cases gracefully? | Asks for clarification when datasource is missing; does not loop indefinitely |
Setting Up Langfuse Tracing
All agent frameworks on CRAFT support Langfuse tracing via the OTEL exporter.Pydantic AI
Configure Langfuse via the standard OTEL exporter. Add this at application startup:TracerProvider is configured. Every agent run, tool call, and MCP request appears as a structured span.
Google ADK
Configure Langfuse via the standard OTEL exporter:google.adk.agent root spans in Langfuse.
Claude Agent SDK
The Anthropic SDK emits OTEL spans automatically when aTracerProvider is configured — the same setup as Pydantic AI and ADK works without additional configuration:
anthropic.messages.create root spans in Langfuse, with tool_use child spans for each tool invocation.
LangGraph
LangGraph integrates with Langfuse via the LangChain callback handler (requireslangfuse>=3.0):
langgraph:node:<name> entries. Each node execution, LLM call, and tool invocation is captured separately.
Golden Traces
A golden trace is a recorded agent execution that represents the correct behaviour for a given input. Store golden traces in your test fixtures and replay them to catch regressions.Recording a Golden Trace
Replaying Golden Traces
Langfuse Evaluators
Langfuse evaluators run scoring functions against traces. CRAFT agents use two types:LLM-as-Judge Evaluators
Use a lightweight LLM to score agent outputs on dimensions like correctness, helpfulness, and groundedness:Deterministic Evaluators
For structured outputs where correctness can be computed without an LLM:Regression Suite Structure
Organise your test fixtures by scenario type. For the text2sql agent, a well-structured regression suite includes:Running Evals in CI
Add eval runs to your CI pipeline to catch regressions on pull requests:Cost-Aware Evaluation
Track token usage per eval run to catch cost regressions alongside quality regressions:Next Steps
Debugging Agents
Inspect traces and diagnose failures when evals surface regressions.

