Evaluations

This example demonstrates Heartbit’s evaluation framework. Define test cases with expected behaviors, run them against an agent, and score the results using trajectory and keyword scorers.

Prerequisites

ANTHROPIC_API_KEY environment variable set with a valid API key

Running

export ANTHROPIC_API_KEY="sk-..."
cargo run -p heartbit --example eval

Source

use std::sync::Arc;

use heartbit::{
    AgentRunner, AnthropicProvider, EvalCase, EvalRunner, EvalSummary, KeywordScorer,
    TrajectoryScorer, build_eval_agent,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let api_key =
        std::env::var("ANTHROPIC_API_KEY").expect("set ANTHROPIC_API_KEY environment variable");
    let provider = Arc::new(AnthropicProvider::new(api_key, "claude-sonnet-4-20250514"));

    // Build an eval-ready agent with event collection for trajectory scoring.
    let builder = AgentRunner::builder(provider)
        .name("eval-agent")
        .system_prompt("You are a helpful assistant. Be concise.")
        .max_turns(3)
        .max_tokens(1024);
    let (agent, collector) = build_eval_agent(builder)?;

    // Define test cases.
    let cases = vec![
        EvalCase::new("greeting", "Say hello")
            .expect_no_tools()
            .expect_output_contains("hello"),
        EvalCase::new("math", "What is 7 * 6? Just the number.")
            .expect_no_tools()
            .expect_output_contains("42"),
    ];

    // Run evaluations and score.
    let runner = EvalRunner::new()
        .scorer(TrajectoryScorer)
        .scorer(KeywordScorer);
    let results = runner.run(&agent, &cases).await;

    // Use the collector for trajectory data on the last case.
    let tool_calls = EvalRunner::collected_tool_calls(&collector);
    eprintln!("[tool calls captured: {tool_calls:?}]");

    let summary = EvalSummary::from_results(&results);
    println!("{summary}");

    Ok(())
}

Walkthrough

Eval-ready agent — build_eval_agent(builder) wraps a standard AgentRunner builder and returns the agent plus a collector that captures events (tool calls, LLM responses) during execution. This is needed for trajectory scoring.

Defining test cases — EvalCase::new(name, prompt) creates a test case. Chain expectations:

.expect_no_tools() — the agent should answer without calling any tools
.expect_output_contains("hello") — the final output must contain this keyword (case-insensitive)

You can also use .expect_tool_call("tool_name") to assert that a specific tool was invoked.

Scorers — The EvalRunner accepts multiple scorers that each produce a score:

TrajectoryScorer — evaluates whether the agent’s tool-call trajectory matches expectations (did it call the right tools in the right order?)
KeywordScorer — checks whether expected keywords appear in the output

Running evals — runner.run(&agent, &cases) executes each case sequentially, collecting results. Each result includes per-scorer scores and pass/fail status.

Summary — EvalSummary::from_results(&results) aggregates results into a printable summary with overall pass rate, per-case breakdowns, and scorer details.

Tool call inspection — EvalRunner::collected_tool_calls(&collector) extracts the list of tool calls made during evaluation, useful for debugging unexpected agent behavior.

What to expect

Both cases should pass since they are simple tasks that do not require tools:

[tool calls captured: []]
Eval Summary
============
Cases: 2 | Passed: 2 | Failed: 0
Pass rate: 100.0%

  greeting: PASS (trajectory: 1.00, keyword: 1.00)
  math:     PASS (trajectory: 1.00, keyword: 1.00)