Skip to content

Evaluations

This example demonstrates Heartbit’s evaluation framework. Define test cases with expected behaviors, run them against an agent, and score the results using trajectory and keyword scorers.

  • ANTHROPIC_API_KEY environment variable set with a valid API key
Terminal window
export ANTHROPIC_API_KEY="sk-..."
cargo run -p heartbit --example eval
use std::sync::Arc;
use heartbit::{
AgentRunner, AnthropicProvider, EvalCase, EvalRunner, EvalSummary, KeywordScorer,
TrajectoryScorer, build_eval_agent,
};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let api_key =
std::env::var("ANTHROPIC_API_KEY").expect("set ANTHROPIC_API_KEY environment variable");
let provider = Arc::new(AnthropicProvider::new(api_key, "claude-sonnet-4-20250514"));
// Build an eval-ready agent with event collection for trajectory scoring.
let builder = AgentRunner::builder(provider)
.name("eval-agent")
.system_prompt("You are a helpful assistant. Be concise.")
.max_turns(3)
.max_tokens(1024);
let (agent, collector) = build_eval_agent(builder)?;
// Define test cases.
let cases = vec![
EvalCase::new("greeting", "Say hello")
.expect_no_tools()
.expect_output_contains("hello"),
EvalCase::new("math", "What is 7 * 6? Just the number.")
.expect_no_tools()
.expect_output_contains("42"),
];
// Run evaluations and score.
let runner = EvalRunner::new()
.scorer(TrajectoryScorer)
.scorer(KeywordScorer);
let results = runner.run(&agent, &cases).await;
// Use the collector for trajectory data on the last case.
let tool_calls = EvalRunner::collected_tool_calls(&collector);
eprintln!("[tool calls captured: {tool_calls:?}]");
let summary = EvalSummary::from_results(&results);
println!("{summary}");
Ok(())
}

Eval-ready agentbuild_eval_agent(builder) wraps a standard AgentRunner builder and returns the agent plus a collector that captures events (tool calls, LLM responses) during execution. This is needed for trajectory scoring.

Defining test casesEvalCase::new(name, prompt) creates a test case. Chain expectations:

  • .expect_no_tools() — the agent should answer without calling any tools
  • .expect_output_contains("hello") — the final output must contain this keyword (case-insensitive)

You can also use .expect_tool_call("tool_name") to assert that a specific tool was invoked.

Scorers — The EvalRunner accepts multiple scorers that each produce a score:

  • TrajectoryScorer — evaluates whether the agent’s tool-call trajectory matches expectations (did it call the right tools in the right order?)
  • KeywordScorer — checks whether expected keywords appear in the output

Running evalsrunner.run(&agent, &cases) executes each case sequentially, collecting results. Each result includes per-scorer scores and pass/fail status.

SummaryEvalSummary::from_results(&results) aggregates results into a printable summary with overall pass rate, per-case breakdowns, and scorer details.

Tool call inspectionEvalRunner::collected_tool_calls(&collector) extracts the list of tool calls made during evaluation, useful for debugging unexpected agent behavior.

Both cases should pass since they are simple tasks that do not require tools:

[tool calls captured: []]
Eval Summary
============
Cases: 2 | Passed: 2 | Failed: 0
Pass rate: 100.0%
greeting: PASS (trajectory: 1.00, keyword: 1.00)
math: PASS (trajectory: 1.00, keyword: 1.00)