Eval Harness

Agent quality degrades silently. A prompt change, a model upgrade, or a new tool can alter behaviour across hundreds of evaluation dimensions without surfacing as an exception. A structured eval harness catches regressions before they reach users.

CRAFT uses Langfuse for observability and evaluation. Before setting up your eval harness, complete the Langfuse Setup Guide to connect your agent to your Langfuse project.

What to Evaluate

For each agent, define evaluations in three categories:

Category	Description	Example
Functional correctness	Does the agent produce the right output for a given input?	SQL matches expected query, chart type matches request
Tool-call accuracy	Does the agent call the right tools with correct arguments?	Calls `get_schema` before `execute_sql`; passes correct `schema_fqn`
Conversation behaviour	Does the agent handle edge cases gracefully?	Asks for clarification when datasource is missing; does not loop indefinitely

Setting Up Langfuse Tracing

All agent frameworks on CRAFT support Langfuse tracing via the OTEL exporter.

Pydantic AI

Configure Langfuse via the standard OTEL exporter. Add this at application startup:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint=f"https://{LANGFUSE_HOST}/api/public/otel/v1/traces",
            headers={"Authorization": f"Basic {langfuse_basic_auth}"},
        )
    )
)
trace.set_tracer_provider(provider)

Pydantic AI emits OTEL spans automatically when a TracerProvider is configured. Every agent run, tool call, and MCP request appears as a structured span.

Google ADK

Configure Langfuse via the standard OTEL exporter:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint=f"https://{LANGFUSE_HOST}/api/public/otel/v1/traces",
            headers={"Authorization": f"Basic {langfuse_basic_auth}"},
        )
    )
)
trace.set_tracer_provider(provider)

ADK emits OTEL spans for each agent turn, tool invocation, and LLM call. Look for google.adk.agent root spans in Langfuse.

Claude Agent SDK

The Anthropic SDK emits OTEL spans automatically when a TracerProvider is configured — the same setup as Pydantic AI and ADK works without additional configuration:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from anthropic import Anthropic

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint=f"https://{LANGFUSE_HOST}/api/public/otel/v1/traces",
            headers={"Authorization": f"Basic {langfuse_basic_auth}"},
        )
    )
)
trace.set_tracer_provider(provider)

# Anthropic SDK auto-emits spans (anthropic.messages.create root) when TracerProvider is set
client = Anthropic()

Look for anthropic.messages.create root spans in Langfuse, with tool_use child spans for each tool invocation.

LangGraph

LangGraph integrates with Langfuse via the LangChain callback handler (requires langfuse>=3.0):

# Requires: pip install langfuse>=3.0
from langfuse.langchain import CallbackHandler  # v3 import path (NOT langfuse.callback)
from langgraph.prebuilt import create_react_agent

langfuse_handler = CallbackHandler()  # reads LANGFUSE_PUBLIC_KEY/SECRET_KEY from env

graph = create_react_agent("anthropic:claude-opus-4-8", tools=[...])

# Pass handler per-invocation
result = await graph.ainvoke(
    {"messages": [{"role": "user", "content": "..."}]},
    config={"callbacks": [langfuse_handler]},
)

Spans appear in Langfuse as langgraph:node:<name> entries. Each node execution, LLM call, and tool invocation is captured separately.

Golden Traces

A golden trace is a recorded agent execution that represents the correct behaviour for a given input. Store golden traces in your test fixtures and replay them to catch regressions.

Recording a Golden Trace

import json
from langfuse import Langfuse

langfuse = Langfuse()

# Fetch a specific trace by ID from Langfuse
trace = langfuse.fetch_trace("trace-id-from-langfuse-ui")

# Serialize to a fixture file
golden = {
    "input": trace.input,
    "expected_output": trace.output,
    "tool_calls": [
        {"name": span.name, "input": span.input, "output": span.output}
        for span in trace.observations
        if span.type == "SPAN" and span.name.startswith("mcp.tool.")
    ],
}

with open("tests/fixtures/golden_revenue_query.json", "w") as f:
    json.dump(golden, f, indent=2)

Replaying Golden Traces

import pytest
import json
from my_agent.executor import MyAgentExecutor
from tests.harness import MockEventQueue, MockRequestContext


@pytest.mark.parametrize("fixture_path", [
    "tests/fixtures/golden_revenue_query.json",
    "tests/fixtures/golden_top_customers.json",
    "tests/fixtures/golden_missing_datasource.json",
])
async def test_golden_trace(fixture_path):
    with open(fixture_path) as f:
        golden = json.load(f)

    executor = MyAgentExecutor(...)
    event_queue = MockEventQueue()
    context = MockRequestContext(user_input=golden["input"])

    await executor.execute(context, event_queue)

    # Check final output
    final_event = event_queue.get_final_event()
    assert golden["expected_output"] in final_event.text

    # Check tool calls
    actual_tool_calls = event_queue.get_tool_calls()
    for expected in golden["tool_calls"]:
        assert any(
            tc["name"] == expected["name"] for tc in actual_tool_calls
        ), f"Expected tool call '{expected['name']}' not found"

Langfuse Evaluators

Langfuse evaluators run scoring functions against traces. CRAFT agents use two types:

LLM-as-Judge Evaluators

Use a lightweight LLM to score agent outputs on dimensions like correctness, helpfulness, and groundedness:

from langfuse import Langfuse
from langfuse.model import ModelUsage

langfuse = Langfuse()

def create_sql_correctness_evaluator():
    """Create a Langfuse evaluator that scores SQL correctness."""
    return langfuse.create_score_config(
        name="sql_correctness",
        data_type="NUMERIC",
        min_value=0,
        max_value=1,
    )


def score_sql_output(trace_id: str, generated_sql: str, expected_sql: str):
    """Score a trace using an LLM judge."""
    import anthropic

    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": f"""Score the SQL query correctness from 0 to 1.
Expected SQL: {expected_sql}
Generated SQL: {generated_sql}

Return only a JSON object: {{"score": 0.95, "reason": "..."}}"""
        }],
    )

    result = json.loads(response.content[0].text)

    langfuse.score(
        trace_id=trace_id,
        name="sql_correctness",
        value=result["score"],
        comment=result["reason"],
    )

Deterministic Evaluators

For structured outputs where correctness can be computed without an LLM:

def score_tool_call_order(trace_id: str, tool_calls: list[dict]):
    """Score whether the agent called tools in the correct order."""
    names = [tc["name"] for tc in tool_calls]

    # get_schema must precede execute_sql
    if "execute_sql" in names and "get_schema" in names:
        schema_idx = names.index("get_schema")
        sql_idx = names.index("execute_sql")
        score = 1.0 if schema_idx < sql_idx else 0.0
    elif "execute_sql" in names:
        score = 0.0  # called execute_sql without schema check
    else:
        score = 1.0  # no SQL — correct for non-SQL queries

    langfuse.score(
        trace_id=trace_id,
        name="tool_call_order",
        value=score,
    )

Regression Suite Structure

Organise your test fixtures by scenario type. For the text2sql agent, a well-structured regression suite includes:

tests/
├── fixtures/
│   ├── happy_path/
│   │   ├── simple_count.json
│   │   ├── revenue_by_quarter.json
│   │   └── top_customers.json
│   ├── edge_cases/
│   │   ├── missing_datasource.json      # Agent should return state=failed
│   │   ├── unsupported_dialect.json
│   │   └── empty_result_set.json
│   └── cancellation/
│       └── cancel_mid_execution.json
├── test_golden_traces.py
├── test_tool_call_order.py
└── harness.py                           # MockEventQueue, MockRequestContext

Running Evals in CI

Add eval runs to your CI pipeline to catch regressions on pull requests:

# .github/workflows/eval.yml
name: Agent Eval

on:
  pull_request:
    paths:
      - "packages/my_agent/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run regression suite
        env:
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
        run: |
          uv run pytest tests/test_golden_traces.py -v \
            --tb=short \
            -x   # fail fast on first regression

Cost-Aware Evaluation

Track token usage per eval run to catch cost regressions alongside quality regressions:

# After each eval run, log usage to Langfuse
langfuse.score(
    trace_id=trace_id,
    name="input_tokens",
    value=usage.input_tokens,
    data_type="NUMERIC",
)
langfuse.score(
    trace_id=trace_id,
    name="output_tokens",
    value=usage.output_tokens,
    data_type="NUMERIC",
)
langfuse.score(
    trace_id=trace_id,
    name="tool_call_count",
    value=toolset.tool_call_count,
    data_type="NUMERIC",
)

Set budget alerts in Langfuse if token usage per task exceeds a threshold — this catches prompt regressions that inflate input context or cause tool call loops.

Next Steps

Debugging Agents

Inspect traces and diagnose failures when evals surface regressions.

​Eval Harness

​What to Evaluate

​Setting Up Langfuse Tracing

​Pydantic AI

​Google ADK

​Claude Agent SDK

​LangGraph

​Golden Traces

​Recording a Golden Trace

​Replaying Golden Traces

​Langfuse Evaluators

​LLM-as-Judge Evaluators

​Deterministic Evaluators

​Regression Suite Structure

​Running Evals in CI

​Cost-Aware Evaluation

​Next Steps

Debugging Agents

Eval Harness

What to Evaluate

Setting Up Langfuse Tracing

Pydantic AI

Google ADK

Claude Agent SDK

LangGraph

Golden Traces

Recording a Golden Trace

Replaying Golden Traces

Langfuse Evaluators

LLM-as-Judge Evaluators

Deterministic Evaluators

Regression Suite Structure

Running Evals in CI

Cost-Aware Evaluation

Next Steps