Prefix Caching
Automatic KV cache reuse for faster inference on repeated or similar prompts.
Overview
Large language models process input tokens through a series of attention layers, producing intermediate key-value (KV) cache entries for each token. When two requests share the same prefix (for example, the same system prompt), the KV cache entries for that prefix are identical. Prefix caching avoids recomputing these shared entries by routing requests to backends that already have the relevant cache loaded.
Xerotier handles prefix caching automatically. No client-side configuration is needed -- all requests benefit from prefix caching by default.
Benefits
- Reduced time-to-first-token (TTFT) -- Cached prefixes skip the prefill computation phase, reducing latency for the first generated token.
- Lower compute cost -- Fewer tokens need to be processed on the GPU, freeing capacity for other requests.
- Higher throughput -- Backends can serve more requests per second when prefill is partially skipped.
How It Works
When a request arrives, Xerotier identifies which backends already hold cached KV entries for the request's prompt prefix and prefers routing to those backends. The backend's inference engine handles the actual KV cache storage and reuse. This means prefix caching works transparently across all backends without any per-request configuration.
Two prompts that share the same opening messages will automatically benefit from cache reuse when routed to the same backend. Cache entries are evicted on an LRU basis, so frequently used prefixes stay warm while rarely used ones are reclaimed.
Prompt Optimization
While prefix caching is automatic, you can structure your prompts to maximize cache hit rates. The key principle is: put stable content at the beginning and variable content at the end.
Best Practices
| Practice | Why It Helps |
|---|---|
| Place system prompts first | System prompts are typically identical across requests. Placing them first ensures the shared prefix is as long as possible. |
| Use consistent prompt templates | Even small differences (extra whitespace, reworded instructions) create different cache keys and prevent cache reuse. |
| Keep shared context before variable content | Place few-shot examples, RAG context, or conversation history after the system prompt but before the user's new message. |
| Avoid per-request timestamps or IDs in early messages | Dynamic values in early messages invalidate all downstream cache entries, preventing any cache reuse. |
Example: Optimized Prompt Structure
{
"messages": [
{"role": "system", "content": "You are a helpful customer support agent for Acme Corp. Always be polite and reference our return policy when relevant."},
{"role": "user", "content": "I want to return my order #12345."}
]
}
In this example, the system message is identical across all customer support requests. The router can cache the KV entries for the system prompt and reuse them for every request, regardless of the user's specific question.
Example: Suboptimal Prompt Structure
{
"messages": [
{"role": "system", "content": "You are a helpful agent. Current time: 2025-01-15T10:30:00Z. Session: abc-123."},
{"role": "user", "content": "I want to return my order #12345."}
]
}
Here, the timestamp and session ID in the system prompt change with every request, preventing any prefix caching. Move dynamic values to a later message or to the user message if possible.
Multi-Turn Conversations
In multi-turn conversations, the entire conversation history up to the current turn forms the prefix. As the conversation grows, more of the prompt is shared between turns. The first few turns may not benefit much from caching, but longer conversations see increasing cache hit rates because each new turn only adds content at the end of the existing prefix.
Cached Tokens in Usage
The usage object in chat completion responses includes a
prompt_tokens_details field that reports how many input tokens
were served from the KV cache:
{
"usage": {
"prompt_tokens": 150,
"completion_tokens": 42,
"total_tokens": 192,
"prompt_tokens_details": {
"cached_tokens": 128
}
}
}
In this example, 128 of the 150 prompt tokens were served from the KV cache,
meaning the backend only needed to compute the remaining 22 tokens during
prefill. The higher the cached_tokens ratio, the faster
the time-to-first-token.
A cached_tokens value of 0 means no prefix cache was available
for that request. This is normal for the first request with a given prefix
or when the backend's KV cache has been evicted.
Tips
- Use streaming for perceived latency -- Streaming delivers tokens as they are generated, improving perceived responsiveness even when total latency is unchanged. See the Streaming API.
- Choose the right tier -- GPU tiers offer the lowest latency for interactive workloads. CPU tiers are better suited for batch processing where latency is less critical. See Service Tiers.
- Monitor cached_tokens -- Track the
cached_tokensfield in usage responses over time. A consistently low ratio may indicate that your prompt templates are not stable enough for caching. - Avoid unnecessary prompt variation -- Randomized few-shot example ordering, per-request metadata in system prompts, and dynamic timestamps all reduce cache effectiveness.