Guardrails

This example demonstrates Heartbit’s LLM-as-Judge guardrail. A cheap judge model evaluates every agent response against custom safety criteria before it is returned to the user. If the response violates a criterion, the agent is asked to regenerate.

Prerequisites

ANTHROPIC_API_KEY environment variable set with a valid API key

Running

export ANTHROPIC_API_KEY="sk-..."
cargo run -p heartbit --example guardrails

Source

use std::sync::Arc;

use heartbit::{AgentRunner, AnthropicProvider, BoxedProvider, LlmJudgeGuardrail};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let api_key =
        std::env::var("ANTHROPIC_API_KEY").expect("set ANTHROPIC_API_KEY environment variable");
    let provider = Arc::new(AnthropicProvider::new(&api_key, "claude-sonnet-4-20250514"));

    // Use a cheap model as the safety judge.
    let judge = Arc::new(BoxedProvider::new(AnthropicProvider::new(
        &api_key,
        "claude-haiku-4-5-20251001",
    )));

    let guardrail = LlmJudgeGuardrail::builder(judge)
        .criterion("Response must not contain personal insults")
        .criterion("Response must not include made-up statistics")
        .build()?;

    let agent = AgentRunner::builder(provider)
        .name("safe-agent")
        .system_prompt("You are a helpful assistant. Be concise and factual.")
        .guardrail(Arc::new(guardrail))
        .max_turns(3)
        .max_tokens(2048)
        .build()?;

    let output = agent.execute("Tell me about climate change.").await?;
    println!("{}", output.result);

    Ok(())
}

Walkthrough

Separate judge model — The guardrail uses its own LLM provider, typically a cheaper/faster model like Haiku. This keeps guardrail costs low while the main agent uses a more capable model. BoxedProvider wraps the provider into an object-safe type required by the guardrail builder.

Defining criteria — Each .criterion() call adds a natural-language rule the judge evaluates against. You can add as many criteria as needed:

“Response must not contain personal insults”
“Response must not include made-up statistics”

The judge receives these criteria and the agent’s response, then returns a verdict: SAFE, UNSAFE, or WARN.

Wiring the guardrail — .guardrail(Arc::new(guardrail)) attaches the guardrail to the agent. Multiple guardrails can be chained; the first Deny verdict wins.

How it works at runtime:

The agent generates a response
The post_llm hook sends the response to the judge model
If the judge returns UNSAFE, the response is blocked and the agent receives feedback to try again
If the judge returns SAFE or WARN, the response passes through
The guardrail is fail-open: if the judge errors or times out, the response is allowed through (production-safe behavior)

Turn budget — max_turns(3) gives the agent room to regenerate if a response is denied. A denied response consumes a turn, so set this higher than 1 when using guardrails.

What to expect

The agent provides a factual response about climate change. If its first attempt contained made-up statistics, the guardrail would deny it and the agent would regenerate. The final output is a guardrail-approved response:

Climate change refers to long-term shifts in global temperatures and weather patterns...