Eval Framework
Heartbit includes a built-in evaluation framework for testing agent behavior systematically.
Overview
Section titled “Overview”The eval framework provides scoring across multiple dimensions:
- Trajectory scoring — verify agents call the right tools in the right order
- Keyword scoring — check that outputs contain expected keywords
- Similarity scoring — Rouge-1 F1 similarity to reference outputs
- Cost scoring — estimated USD cost within budget
- Latency scoring — total LLM latency within budget
- Tool call count scoring — number of tool calls within budget
- Safety scoring — no guardrail denials occurred
Quick Start
Section titled “Quick Start”use heartbit::{EvalCase, EvalRunner, TrajectoryScorer, KeywordScorer, SimilarityScorer};
let case = EvalCase::new("research-task", "Find info about Rust") .expect_tool("websearch") .expect_output_contains("Rust") .reference_output("Rust is a systems programming language");
let runner = EvalRunner::new(vec![case]) .scorer(TrajectoryScorer) // tool call sequence matching .scorer(KeywordScorer) // output keyword checking .scorer(SimilarityScorer); // cosine similarity to reference
let summary = runner.run(agent).await?;println!("Pass rate: {:.0}%", summary.pass_rate * 100.0);Eval Cases
Section titled “Eval Cases”An EvalCase defines:
- Name — identifier for the test case
- Input — the task to give the agent
- Expected tools — tools the agent should call (trajectory scoring)
- Expected keywords — keywords that should appear in the output
- Reference output — ideal output for similarity scoring
- Budget fields — optional cost, latency, and tool call limits
Budget Fields
Section titled “Budget Fields”Set budget constraints on individual cases to gate pass/fail on operational metrics:
EvalCase::new("task", "input") .expect_max_cost_usd(0.05) // max acceptable cost .expect_max_latency_ms(5000) // max acceptable latency .expect_max_tool_calls(10) // max tool callsBudget fields override the default thresholds on their corresponding scorers.
Scorers
Section titled “Scorers”| Scorer | What it checks | Requires |
|---|---|---|
TrajectoryScorer | Tool call sequence matches expected tools | — |
KeywordScorer | Output contains expected keywords | — |
SimilarityScorer | Rouge-1 F1 similarity to reference output | — |
CostScorer | Estimated USD cost within budget | EventCollector |
LatencyScorer | Total LLM latency within budget | EventCollector |
ToolCallCountScorer | Number of tool calls within budget | — |
SafetyScorer | No guardrail denials occurred | EventCollector |
Scorers are composable — add as many as needed via .scorer() on the runner.
CostScorer
Section titled “CostScorer”Reads LlmResponse events from the EventCollector and estimates cost via estimate_cost(model, usage). Unknown models contribute $0. Default pass threshold: 0.01. A case’s max_cost_usd overrides the scorer’s default budget.
LatencyScorer
Section titled “LatencyScorer”Sums latency_ms from LlmResponse events in the EventCollector. Default pass threshold: 0.01. A case’s max_latency_ms overrides the scorer’s default budget.
ToolCallCountScorer
Section titled “ToolCallCountScorer”Uses the tool_calls slice length from the agent output. Does not require an EventCollector. A case’s max_tool_calls overrides the scorer’s default budget.
SafetyScorer
Section titled “SafetyScorer”Checks for GuardrailDenied events in the EventCollector. Score is 0.0 if any denial occurred, 1.0 otherwise. Warnings pass.
Results
Section titled “Results”EvalSummary provides:
pass_rate— fraction of cases that passed (0.0 to 1.0)- Per-case
EvalResultwith individual scorer results for debugging
Event Collection
Section titled “Event Collection”EventCollector captures AgentEvent variants during eval runs for detailed trajectory analysis. Use it to inspect tool call sequences, timing, and LLM responses.
clear_events
Section titled “clear_events”When reusing a collector across multiple eval cases, call clear_events between cases to prevent stale events from corrupting event-aware scorers:
use heartbit::clear_events;
// Between eval cases when reusing a collector:clear_events(&collector);A/B Comparison
Section titled “A/B Comparison”Compare two eval runs to detect regressions between a baseline and candidate:
use heartbit::EvalComparison;
let comparison = EvalComparison::compare(&baseline_results, &candidate_results);println!("{comparison}"); // Pretty-printed comparison
if comparison.has_regressions() { eprintln!("Regressions: {:?}", comparison.regressions());}EvalComparison matches results by case_name. Available methods:
baseline_wins()— cases where baseline scored highercandidate_wins()— cases where candidate scored higherties()— cases within tolerance (0.001)has_regressions()— whether any regressions existregressions()— list of regressed case comparisons
Serialize Support
Section titled “Serialize Support”All eval types (EvalCase, EvalResult, ScorerResult, EvalSummary, EvalComparison, CaseComparison) derive Serialize for JSON report generation.