Evaluations
This example demonstrates Heartbit’s evaluation framework. Define test cases with expected behaviors, run them against an agent, and score the results using trajectory and keyword scorers.
Prerequisites
Section titled “Prerequisites”ANTHROPIC_API_KEYenvironment variable set with a valid API key
Running
Section titled “Running”export ANTHROPIC_API_KEY="sk-..."cargo run -p heartbit --example evalSource
Section titled “Source”use std::sync::Arc;
use heartbit::{ AgentRunner, AnthropicProvider, EvalCase, EvalRunner, EvalSummary, KeywordScorer, TrajectoryScorer, build_eval_agent,};
#[tokio::main]async fn main() -> Result<(), Box<dyn std::error::Error>> { let api_key = std::env::var("ANTHROPIC_API_KEY").expect("set ANTHROPIC_API_KEY environment variable"); let provider = Arc::new(AnthropicProvider::new(api_key, "claude-sonnet-4-20250514"));
// Build an eval-ready agent with event collection for trajectory scoring. let builder = AgentRunner::builder(provider) .name("eval-agent") .system_prompt("You are a helpful assistant. Be concise.") .max_turns(3) .max_tokens(1024); let (agent, collector) = build_eval_agent(builder)?;
// Define test cases. let cases = vec![ EvalCase::new("greeting", "Say hello") .expect_no_tools() .expect_output_contains("hello"), EvalCase::new("math", "What is 7 * 6? Just the number.") .expect_no_tools() .expect_output_contains("42"), ];
// Run evaluations and score. let runner = EvalRunner::new() .scorer(TrajectoryScorer) .scorer(KeywordScorer); let results = runner.run(&agent, &cases).await;
// Use the collector for trajectory data on the last case. let tool_calls = EvalRunner::collected_tool_calls(&collector); eprintln!("[tool calls captured: {tool_calls:?}]");
let summary = EvalSummary::from_results(&results); println!("{summary}");
Ok(())}Walkthrough
Section titled “Walkthrough”Eval-ready agent — build_eval_agent(builder) wraps a standard AgentRunner builder and returns the agent plus a collector that captures events (tool calls, LLM responses) during execution. This is needed for trajectory scoring.
Defining test cases — EvalCase::new(name, prompt) creates a test case. Chain expectations:
.expect_no_tools()— the agent should answer without calling any tools.expect_output_contains("hello")— the final output must contain this keyword (case-insensitive)
You can also use .expect_tool_call("tool_name") to assert that a specific tool was invoked.
Scorers — The EvalRunner accepts multiple scorers that each produce a score:
TrajectoryScorer— evaluates whether the agent’s tool-call trajectory matches expectations (did it call the right tools in the right order?)KeywordScorer— checks whether expected keywords appear in the output
Running evals — runner.run(&agent, &cases) executes each case sequentially, collecting results. Each result includes per-scorer scores and pass/fail status.
Summary — EvalSummary::from_results(&results) aggregates results into a printable summary with overall pass rate, per-case breakdowns, and scorer details.
Tool call inspection — EvalRunner::collected_tool_calls(&collector) extracts the list of tool calls made during evaluation, useful for debugging unexpected agent behavior.
What to expect
Section titled “What to expect”Both cases should pass since they are simple tasks that do not require tools:
[tool calls captured: []]Eval Summary============Cases: 2 | Passed: 2 | Failed: 0Pass rate: 100.0%
greeting: PASS (trajectory: 1.00, keyword: 1.00) math: PASS (trajectory: 1.00, keyword: 1.00)