Eval Framework

Heartbit includes a built-in evaluation framework for testing agent behavior systematically.

Overview

The eval framework provides scoring across multiple dimensions:

Trajectory scoring — verify agents call the right tools in the right order
Keyword scoring — check that outputs contain expected keywords
Similarity scoring — Rouge-1 F1 similarity to reference outputs
Cost scoring — estimated USD cost within budget
Latency scoring — total LLM latency within budget
Tool call count scoring — number of tool calls within budget
Safety scoring — no guardrail denials occurred

Quick Start

use heartbit::{EvalCase, EvalRunner, TrajectoryScorer, KeywordScorer, SimilarityScorer};

let case = EvalCase::new("research-task", "Find info about Rust")
    .expect_tool("websearch")
    .expect_output_contains("Rust")
    .reference_output("Rust is a systems programming language");

let runner = EvalRunner::new(vec![case])
    .scorer(TrajectoryScorer)    // tool call sequence matching
    .scorer(KeywordScorer)       // output keyword checking
    .scorer(SimilarityScorer);   // cosine similarity to reference

let summary = runner.run(agent).await?;
println!("Pass rate: {:.0}%", summary.pass_rate * 100.0);

Eval Cases

An EvalCase defines:

Name — identifier for the test case
Input — the task to give the agent
Expected tools — tools the agent should call (trajectory scoring)
Expected keywords — keywords that should appear in the output
Reference output — ideal output for similarity scoring
Budget fields — optional cost, latency, and tool call limits

Budget Fields

Set budget constraints on individual cases to gate pass/fail on operational metrics:

EvalCase::new("task", "input")
    .expect_max_cost_usd(0.05)      // max acceptable cost
    .expect_max_latency_ms(5000)     // max acceptable latency
    .expect_max_tool_calls(10)       // max tool calls

Budget fields override the default thresholds on their corresponding scorers.

Scorers

Scorer	What it checks	Requires
`TrajectoryScorer`	Tool call sequence matches expected tools	—
`KeywordScorer`	Output contains expected keywords	—
`SimilarityScorer`	Rouge-1 F1 similarity to reference output	—
`CostScorer`	Estimated USD cost within budget	`EventCollector`
`LatencyScorer`	Total LLM latency within budget	`EventCollector`
`ToolCallCountScorer`	Number of tool calls within budget	—
`SafetyScorer`	No guardrail denials occurred	`EventCollector`

Scorers are composable — add as many as needed via .scorer() on the runner.

CostScorer

Reads LlmResponse events from the EventCollector and estimates cost via estimate_cost(model, usage). Unknown models contribute $0. Default pass threshold: 0.01. A case’s max_cost_usd overrides the scorer’s default budget.

LatencyScorer

Sums latency_ms from LlmResponse events in the EventCollector. Default pass threshold: 0.01. A case’s max_latency_ms overrides the scorer’s default budget.

ToolCallCountScorer

Uses the tool_calls slice length from the agent output. Does not require an EventCollector. A case’s max_tool_calls overrides the scorer’s default budget.

SafetyScorer

Checks for GuardrailDenied events in the EventCollector. Score is 0.0 if any denial occurred, 1.0 otherwise. Warnings pass.

Results

EvalSummary provides:

pass_rate — fraction of cases that passed (0.0 to 1.0)
Per-case EvalResult with individual scorer results for debugging

Event Collection

EventCollector captures AgentEvent variants during eval runs for detailed trajectory analysis. Use it to inspect tool call sequences, timing, and LLM responses.

clear_events

When reusing a collector across multiple eval cases, call clear_events between cases to prevent stale events from corrupting event-aware scorers:

use heartbit::clear_events;

// Between eval cases when reusing a collector:
clear_events(&collector);

A/B Comparison

Compare two eval runs to detect regressions between a baseline and candidate:

use heartbit::EvalComparison;

let comparison = EvalComparison::compare(&baseline_results, &candidate_results);
println!("{comparison}"); // Pretty-printed comparison

if comparison.has_regressions() {
    eprintln!("Regressions: {:?}", comparison.regressions());
}

EvalComparison matches results by case_name. Available methods:

baseline_wins() — cases where baseline scored higher
candidate_wins() — cases where candidate scored higher
ties() — cases within tolerance (0.001)
has_regressions() — whether any regressions exist
regressions() — list of regressed case comparisons

Serialize Support

All eval types (EvalCase, EvalResult, ScorerResult, EvalSummary, EvalComparison, CaseComparison) derive Serialize for JSON report generation.