Skip to content

Response Caching

Heartbit includes a built-in LRU cache for LLM completion responses. When the same prompt and tool configuration produces the same request, the cached response is returned instantly without an API call.

ResponseCache is a thread-safe LRU cache backed by a Vec with move-to-front on hit and eviction from the back. It uses std::sync::Mutex (never held across .await) for safe concurrent access.

Cache keys are computed from three components via FNV-1a hashing:

  1. System prompt — the full system prompt text
  2. Messages — the serialized conversation history
  3. Tool names — sorted before hashing (order-independent)

Two requests with the same system prompt, message history, and available tools will produce the same cache key regardless of tool ordering.

use std::sync::Arc;
use heartbit::AgentRunner;
use heartbit::agent::ResponseCache;
let cache = Arc::new(ResponseCache::new(50)); // 50-entry LRU
let runner = AgentRunner::builder(provider)
.name("researcher")
.system_prompt("You are a researcher.")
.response_cache(cache.clone())
.build()?;

Wire the cache via AgentRunnerBuilder::response_cache(Arc<ResponseCache>).

Use CaseCapacityNotes
Development/testing10-50Fast iteration, low memory
Deterministic pipelines50-100Repeated queries with same inputs
Interactive agents0 (disabled)Conversations rarely repeat exactly

The cache uses O(n) operations per access, which is efficient for typical capacities (10-100). For very large caches (1000+), consider an alternative approach.

Response caching is most effective for:

  • Deterministic tasks where the same input reliably produces the same useful output
  • Cost reduction on repeated queries during development or batch processing
  • Testing scenarios where you want consistent LLM responses without API calls

Response caching is not recommended for:

  • Interactive conversations (each turn produces unique message history)
  • Tasks where freshness matters (e.g., web search followed by analysis)
  • Agents with non-deterministic tool results that feed back into prompts
// Create with max entries
let cache = ResponseCache::new(100);
// Manual key computation
let key = ResponseCache::compute_key(
"system prompt",
&messages,
&["tool_a", "tool_b"],
);
// Manual get/put
if let Some(response) = cache.get(key) {
// Cache hit
} else {
let response = provider.complete(request).await?;
cache.put(key, response.clone());
}
// Utilities
cache.len(); // Current entry count
cache.is_empty(); // True if empty
cache.clear(); // Remove all entries

When wired via AgentRunnerBuilder, the cache is checked automatically before each LLM call and populated after each successful response.

The cache is wrapped in Arc, so you can share a single cache instance across multiple agents:

let shared_cache = Arc::new(ResponseCache::new(100));
let agent_a = AgentRunner::builder(provider.clone())
.name("agent-a")
.system_prompt("You analyze code.")
.response_cache(shared_cache.clone())
.build()?;
let agent_b = AgentRunner::builder(provider.clone())
.name("agent-b")
.system_prompt("You analyze code.") // Same prompt = cache hits
.response_cache(shared_cache.clone())
.build()?;

Agents with identical system prompts and tool sets will share cache hits. Agents with different prompts naturally produce different cache keys.