// Features

Prefix Caching

Reuse the KV cache across requests that share a prefix. Lower TTFT on long system prompts, cheaper retrieval-heavy chats, and you write zero client code. Routing prefers the backend that already holds your prefix in memory.

client_code
0 lines
on by default, every tier
cached_token_price
25%
of the full per-token rate
routing_signal
prefix affinity
scored alongside latency, health, quota
reported_in
prompt_tokens_details.cached_tokens
every chat completion response

Overview

Large language models process input tokens through a series of attention layers, producing intermediate key-value (KV) cache entries for each token. When two requests share the same prefix (for example, the same system prompt), the KV cache entries for that prefix are identical. Prefix caching avoids recomputing these shared entries by routing requests to backends that already have the relevant cache loaded.

Xerotier handles prefix caching automatically. No client-side configuration is needed. All requests benefit from prefix caching by default.

Benefits

  • Reduced time-to-first-token (TTFT). Cached prefixes skip the prefill computation phase, reducing latency for the first generated token.
  • Lower compute cost. Fewer tokens need to be processed on the GPU, freeing capacity for other requests.
  • Higher throughput. Backends can serve more requests per second when prefill is partially skipped.

How It Works

When a request arrives, Xerotier identifies which backends already hold cached KV entries for the request's prompt prefix and prefers routing to those backends. The backend's inference engine handles the actual KV cache storage and reuse. This means prefix caching works transparently across all backends without any per-request configuration.

Two prompts that share the same opening messages will automatically benefit from cache reuse when routed to the same backend. Two LRU layers cooperate: the router maintains a per-worker shadow index of recently seen prefix block hashes (used purely for routing affinity), and the backend's inference engine maintains the actual paged-attention KV cache. Both evict on an LRU basis, so frequently used prefixes stay warm while rarely used ones are reclaimed.

Prompt Optimization

While prefix caching is automatic, you can structure your prompts to maximize cache hit rates. The key principle is: put stable content at the beginning and variable content at the end.

Best Practices

Practice Why It Helps
Place system prompts first System prompts are typically identical across requests. Placing them first ensures the shared prefix is as long as possible.
Use consistent prompt templates Even small differences (extra whitespace, reworded instructions) create different cache keys and prevent cache reuse.
Keep shared context before variable content Place few-shot examples, RAG context, or conversation history after the system prompt but before the user's new message.
Avoid per-request timestamps or IDs in early messages Dynamic values in early messages invalidate all downstream cache entries, preventing any cache reuse.

Example: Optimized Prompt Structure

JSON cache_friendly
{ "messages": [ {"role": "system", "content": "You are a helpful customer support agent for Acme Corp. Always be polite and reference our return policy when relevant."}, {"role": "user", "content": "I want to return my order #12345."} ] }

In this example, the system message is identical across all customer support requests. The router can cache the KV entries for the system prompt and reuse them for every request, regardless of the user's specific question.

Example: Suboptimal Prompt Structure

JSON cache_busting
{ "messages": [ {"role": "system", "content": "You are a helpful agent. Current time: 2025-01-15T10:30:00Z. Session: abc-123."}, {"role": "user", "content": "I want to return my order #12345."} ] }

Here, the timestamp and session ID in the system prompt change with every request, preventing any prefix caching. Move dynamic values to a later message or to the user message if possible.

The single biggest cache-killer is a dynamic value in early messages. A timestamp, request ID, user ID, or random nonce in the system prompt invalidates every downstream cache entry, even if the rest of the prompt is byte-identical. Keep dynamic values in user messages or trailing tool results, not in the system prompt.

Multi-Turn Conversations

In multi-turn conversations, the entire conversation history up to the current turn forms the prefix. As the conversation grows, more of the prompt is shared between turns. The first few turns may not benefit much from caching, but longer conversations see increasing cache hit rates because each new turn only adds content at the end of the existing prefix.

Cached Tokens in Usage

The usage object in chat completion responses includes a prompt_tokens_details field that reports how many input tokens were served from the KV cache:

JSON
{ "usage": { "prompt_tokens": 150, "completion_tokens": 42, "total_tokens": 192, "prompt_tokens_details": { "cached_tokens": 128, "audio_tokens": null } } }

In this example, 128 of the 150 prompt tokens were served from the KV cache, meaning the backend only needed to compute the remaining 22 tokens during prefill. The higher the cached_tokens ratio, the faster the time-to-first-token.

A cached_tokens value of 0 means no prefix cache was available for that request. This is normal for the first request with a given prefix or when the backend's KV cache has been evicted.

Cost Implications

On billable tiers, cached tokens are charged at a fraction of the full per-token rate. Each tier has a cached_token_cost_multiplier that determines the discount: a multiplier of 0.25 means cached tokens are charged at 25% of the standard input token price (a 75% discount). Free and Self-Hosted tiers do not meter cached tokens.

For example, on the GPU NVIDIA Shared tier ($1.25 / 1M tokens), cached tokens are billed at $0.3125 / 1M tokens. If a request has 1,000 prompt tokens and 800 are served from cache, you are charged for 200 full-price tokens plus 800 discounted tokens rather than 1,000 full-price tokens.

cost = (uncached_tokens / 1_000_000) * price_per_1m + (cached_tokens / 1_000_000) * price_per_1m * cached_token_cost_multiplier

The cached_tokens count is reported in usage.prompt_tokens_details.cached_tokens in every response, so you can monitor savings per request. See Service Tiers: Cached Token Pricing for the per-tier multiplier table.

Troubleshooting

Most "cache isn't working" reports trace back to one of three causes. Check them in order before opening a support thread.

Symptom Likely cause What to check
cached_tokens is always 0 Dynamic value early in the prompt Inspect the first 1-2 messages. Strip timestamps, request IDs, and per-request metadata; move them to the user message.
cached_tokens only sometimes > 0 Prefix landed on a different backend Affinity is preferred, not guaranteed. Under load or after restart, routing may pick a cold backend. The ratio averages out across requests; treat any single response as a sample, not a verdict.
Was caching, now isn't Prompt template drift Diff the current prompt against a known-good run. Extra whitespace, rewrapped lines, or a reordered tool list all break the prefix hash.

Tips

  • Use streaming for perceived latency. Streaming delivers tokens as they are generated, improving perceived responsiveness even when total latency is unchanged. See the Streaming API.
  • Choose the right tier. GPU tiers offer the lowest latency for interactive workloads. CPU tiers are better suited for batch processing where latency is less critical. See Service Tiers.
  • Monitor cached_tokens. Track the cached_tokens field in usage responses over time. A consistently low ratio may indicate that your prompt templates are not stable enough for caching.
  • Avoid unnecessary prompt variation. Randomized few-shot example ordering, per-request metadata in system prompts, and dynamic timestamps all reduce cache effectiveness.