Skip to content

Eval Framework

Heartbit includes a built-in evaluation framework for testing agent behavior systematically.

The eval framework provides scoring across multiple dimensions:

  • Trajectory scoring — verify agents call the right tools in the right order
  • Keyword scoring — check that outputs contain expected keywords
  • Similarity scoring — Rouge-1 F1 similarity to reference outputs
  • Cost scoring — estimated USD cost within budget
  • Latency scoring — total LLM latency within budget
  • Tool call count scoring — number of tool calls within budget
  • Safety scoring — no guardrail denials occurred
use heartbit::{EvalCase, EvalRunner, TrajectoryScorer, KeywordScorer, SimilarityScorer};
let case = EvalCase::new("research-task", "Find info about Rust")
.expect_tool("websearch")
.expect_output_contains("Rust")
.reference_output("Rust is a systems programming language");
let runner = EvalRunner::new(vec![case])
.scorer(TrajectoryScorer) // tool call sequence matching
.scorer(KeywordScorer) // output keyword checking
.scorer(SimilarityScorer); // cosine similarity to reference
let summary = runner.run(agent).await?;
println!("Pass rate: {:.0}%", summary.pass_rate * 100.0);

An EvalCase defines:

  • Name — identifier for the test case
  • Input — the task to give the agent
  • Expected tools — tools the agent should call (trajectory scoring)
  • Expected keywords — keywords that should appear in the output
  • Reference output — ideal output for similarity scoring
  • Budget fields — optional cost, latency, and tool call limits

Set budget constraints on individual cases to gate pass/fail on operational metrics:

EvalCase::new("task", "input")
.expect_max_cost_usd(0.05) // max acceptable cost
.expect_max_latency_ms(5000) // max acceptable latency
.expect_max_tool_calls(10) // max tool calls

Budget fields override the default thresholds on their corresponding scorers.

ScorerWhat it checksRequires
TrajectoryScorerTool call sequence matches expected tools
KeywordScorerOutput contains expected keywords
SimilarityScorerRouge-1 F1 similarity to reference output
CostScorerEstimated USD cost within budgetEventCollector
LatencyScorerTotal LLM latency within budgetEventCollector
ToolCallCountScorerNumber of tool calls within budget
SafetyScorerNo guardrail denials occurredEventCollector

Scorers are composable — add as many as needed via .scorer() on the runner.

Reads LlmResponse events from the EventCollector and estimates cost via estimate_cost(model, usage). Unknown models contribute $0. Default pass threshold: 0.01. A case’s max_cost_usd overrides the scorer’s default budget.

Sums latency_ms from LlmResponse events in the EventCollector. Default pass threshold: 0.01. A case’s max_latency_ms overrides the scorer’s default budget.

Uses the tool_calls slice length from the agent output. Does not require an EventCollector. A case’s max_tool_calls overrides the scorer’s default budget.

Checks for GuardrailDenied events in the EventCollector. Score is 0.0 if any denial occurred, 1.0 otherwise. Warnings pass.

EvalSummary provides:

  • pass_rate — fraction of cases that passed (0.0 to 1.0)
  • Per-case EvalResult with individual scorer results for debugging

EventCollector captures AgentEvent variants during eval runs for detailed trajectory analysis. Use it to inspect tool call sequences, timing, and LLM responses.

When reusing a collector across multiple eval cases, call clear_events between cases to prevent stale events from corrupting event-aware scorers:

use heartbit::clear_events;
// Between eval cases when reusing a collector:
clear_events(&collector);

Compare two eval runs to detect regressions between a baseline and candidate:

use heartbit::EvalComparison;
let comparison = EvalComparison::compare(&baseline_results, &candidate_results);
println!("{comparison}"); // Pretty-printed comparison
if comparison.has_regressions() {
eprintln!("Regressions: {:?}", comparison.regressions());
}

EvalComparison matches results by case_name. Available methods:

  • baseline_wins() — cases where baseline scored higher
  • candidate_wins() — cases where candidate scored higher
  • ties() — cases within tolerance (0.001)
  • has_regressions() — whether any regressions exist
  • regressions() — list of regressed case comparisons

All eval types (EvalCase, EvalResult, ScorerResult, EvalSummary, EvalComparison, CaseComparison) derive Serialize for JSON report generation.