Skip to content

Eval Framework

Heartbit includes a built-in evaluation framework for testing agent behavior systematically.

The eval framework provides three scoring dimensions:

  • Trajectory scoring — verify agents call the right tools in the right order
  • Keyword scoring — check that outputs contain expected keywords
  • Similarity scoring — cosine similarity to reference outputs
use heartbit::{EvalCase, EvalRunner, TrajectoryScorer, KeywordScorer, SimilarityScorer};
let case = EvalCase::new("research-task", "Find info about Rust")
.expect_tool("websearch")
.expect_output_contains("Rust")
.reference_output("Rust is a systems programming language");
let runner = EvalRunner::new(vec![case])
.scorer(TrajectoryScorer) // tool call sequence matching
.scorer(KeywordScorer) // output keyword checking
.scorer(SimilarityScorer); // cosine similarity to reference
let summary = runner.run(agent).await?;
println!("Pass rate: {:.0}%", summary.pass_rate * 100.0);

An EvalCase defines:

  • Name — identifier for the test case
  • Input — the task to give the agent
  • Expected tools — tools the agent should call (trajectory scoring)
  • Expected keywords — keywords that should appear in the output
  • Reference output — ideal output for similarity scoring
ScorerWhat it checks
TrajectoryScorerTool call sequence matches expected tools
KeywordScorerOutput contains expected keywords
SimilarityScorerCosine similarity to reference output

Scorers are composable — add as many as needed via .scorer() on the runner.

EvalSummary provides:

  • pass_rate — fraction of cases that passed (0.0 to 1.0)
  • Per-case EvalResult with individual scorer results for debugging

EventCollector captures AgentEvent variants during eval runs for detailed trajectory analysis. Use it to inspect tool call sequences, timing, and LLM responses.