Response Caching

Heartbit includes a built-in LRU cache for LLM completion responses. When the same prompt and tool configuration produces the same request, the cached response is returned instantly without an API call.

How It Works

ResponseCache is a thread-safe LRU cache backed by a Vec with move-to-front on hit and eviction from the back. It uses std::sync::Mutex (never held across .await) for safe concurrent access.

Cache keys are computed from three components via FNV-1a hashing:

System prompt — the full system prompt text
Messages — the serialized conversation history
Tool names — sorted before hashing (order-independent)

Two requests with the same system prompt, message history, and available tools will produce the same cache key regardless of tool ordering.

Setup

use std::sync::Arc;
use heartbit::AgentRunner;
use heartbit::agent::ResponseCache;

let cache = Arc::new(ResponseCache::new(50)); // 50-entry LRU

let runner = AgentRunner::builder(provider)
    .name("researcher")
    .system_prompt("You are a researcher.")
    .response_cache(cache.clone())
    .build()?;

Wire the cache via AgentRunnerBuilder::response_cache(Arc<ResponseCache>).

Capacity Guidelines

Use Case	Capacity	Notes
Development/testing	10-50	Fast iteration, low memory
Deterministic pipelines	50-100	Repeated queries with same inputs
Interactive agents	0 (disabled)	Conversations rarely repeat exactly

The cache uses O(n) operations per access, which is efficient for typical capacities (10-100). For very large caches (1000+), consider an alternative approach.

When to Use

Response caching is most effective for:

Deterministic tasks where the same input reliably produces the same useful output
Cost reduction on repeated queries during development or batch processing
Testing scenarios where you want consistent LLM responses without API calls

Response caching is not recommended for:

Interactive conversations (each turn produces unique message history)
Tasks where freshness matters (e.g., web search followed by analysis)
Agents with non-deterministic tool results that feed back into prompts

API Reference

// Create with max entries
let cache = ResponseCache::new(100);

// Manual key computation
let key = ResponseCache::compute_key(
    "system prompt",
    &messages,
    &["tool_a", "tool_b"],
);

// Manual get/put
if let Some(response) = cache.get(key) {
    // Cache hit
} else {
    let response = provider.complete(request).await?;
    cache.put(key, response.clone());
}

// Utilities
cache.len();       // Current entry count
cache.is_empty();  // True if empty
cache.clear();     // Remove all entries

When wired via AgentRunnerBuilder, the cache is checked automatically before each LLM call and populated after each successful response.

The cache is wrapped in Arc, so you can share a single cache instance across multiple agents:

let shared_cache = Arc::new(ResponseCache::new(100));

let agent_a = AgentRunner::builder(provider.clone())
    .name("agent-a")
    .system_prompt("You analyze code.")
    .response_cache(shared_cache.clone())
    .build()?;

let agent_b = AgentRunner::builder(provider.clone())
    .name("agent-b")
    .system_prompt("You analyze code.")  // Same prompt = cache hits
    .response_cache(shared_cache.clone())
    .build()?;

Agents with identical system prompts and tool sets will share cache hits. Agents with different prompts naturally produce different cache keys.