The Router
One OpenAI-compatible surface. Behind it, a scoring dispatcher that picks a worker per request from fifteen independent signals, samples with A-Res weighted random, and dispatches over a mutually-authenticated CurveZMQ mesh. Managed pools, self-hosted agents, and public cloud are one fleet.
Overview
Every Xerotier inference request, chat, responses, embeddings, reranking, tool calls, arrives at the same router process that also serves the frontend page surface. The router holds a live picture of every connected worker: which models are loaded, how hot their KV cache is, how recently each one answered, how close it is to your traffic, and whether the project sending the request is over or under its fair share.
For each request, the router scores every eligible worker on more than a dozen independent signals, samples a winner with a weighted-random algorithm that still leaves room for exploration, and dispatches over an authenticated, encrypted mesh with built-in backpressure. The whole decision happens before your first token, typically in under a millisecond at the router itself.
flowchart LR
Client["Client request
(chat / responses / tools)"]
Admit["Admission
auth + credits + rate limits"]
Score["Composite scoring
15+ signals"]
Select["Weighted-random
sampling + exploration"]
Dispatch["Encrypted dispatch
over agent mesh"]
Stream["Streaming response
with credit-based flow control"]
Feedback["Post-request feedback
recalibrates the router"]
Client --> Admit
Admit --> Score
Score --> Select
Select --> Dispatch
Dispatch --> Stream
Stream -.-> Feedback
Feedback -.-> Score
Composite Scoring
Picking a worker by latency alone is a trap, the fastest worker a millisecond ago may be the one whose KV cache just evicted your prompt, the one whose queue just filled, or the one in a region a continent away. The router blends 15 distinct signals into a single composite score, then re-evaluates on every request.
The fifteen signals
- // affinity
- Model-cache affinity
- Is the model already loaded? Cold-loading a 70B model is measured in minutes; a warm worker answers immediately.
- // prefix
- Prefix-cache match
- How much of your prompt's KV cache the worker still holds from a prior request. More overlap, shorter prefill, faster first token.
- // ewma
- Latency prediction
- A per-worker, exponentially-weighted moving average of recent end-to-end latency. Weighted harder when an SLO is attached.
- // pressure
- KV-cache pressure
- Workers near their KV ceiling are deprioritized and, past a threshold, excluded entirely.
- // health
- Worker health
- Heartbeat freshness, error rates, and probe results drive a continuous health score.
- // slots
- Worker utilization
- How many slots remain available right now.
- // region
- Region affinity
- Stay close to the caller when latency matters; spill out when capacity does.
- // fairness
- Per-project fairness
- Projects under their fair share are favored; projects burning through burst capacity are softly deprioritized.
- // tier
- Service-tier priority
- Priority requests outrank standard. Flex requests yield to both.
- // ownership
- Self-hosted ownership
- If your project owns the agent, your requests get first pass at it.
- // slo
- SLO bonus
- Workers consistently meeting your SLO win ties.
- // origin
- Internal vs. external
- Internal traffic gets a configurable boost to keep system jobs from being starved by user traffic.
- // batch
- Batch penalty
- Batch requests are weighted down so interactive traffic preempts them naturally.
- // staleness
- Metrics-staleness penalty
- Workers whose telemetry is overdue are progressively distrusted until they re-report.
- // strong-prefix
- Strong-prefix bonus
- A bonus tier kicks in when prefix overlap is high and the prefix-match prediction has proven accurate for that worker.
flowchart LR
subgraph Signals
S1["Model-cache affinity"]
S2["Prefix-cache match"]
S3["EWMA latency"]
S4["KV pressure"]
S5["Worker health"]
S6["Utilization"]
S7["Region affinity"]
S8["Per-project fairness"]
S9["Service tier"]
SA["...additional signals"]
end
Score{"Composite score
per worker"}
S1 --> Score
S2 --> Score
S3 --> Score
S4 --> Score
S5 --> Score
S6 --> Score
S7 --> Score
S8 --> Score
S9 --> Score
SA --> Score
Weighted Selection & Exploration
A single best score is rarely the best answer. Always picking the top-scoring worker concentrates traffic, accelerates KV evictions on the favorite, and prevents the router from learning that a newly connected worker is actually faster. The router avoids this in two ways.
A-Res weighted-random sampling
Instead of picking the maximum score, the router draws from all
eligible workers using the A-Res algorithm (Efraimidis and
Spirakis). Each worker's selection probability is proportional to
its composite score, so the best worker wins most of the time.
A worker that is 90% as good still wins a meaningful share of
requests. Load spreads naturally, KV caches stay warm across the
pool, and a high scorer that briefly stumbles has competitors
ready to absorb the traffic.
Epsilon-greedy exploration
A small slice of requests, five percent by default, bypasses the composite score entirely and chooses uniformly at random from the eligible pool. This keeps the router honest: newly connected workers, workers whose latency model is stale, and workers recovering from a degraded state all get sampled, measured, and folded back into the scoring model. The result is a router that keeps learning, instead of one that lock-steps to whatever was fastest an hour ago.
Prefix-Cache Affinity
The router keeps a per-worker index of which prompt prefixes each worker has recently served. When a new request arrives, the router can predict, before dispatching, how much of the prompt's KV cache is likely already resident on each candidate, and prefer the candidates with the warmest match.
Self-calibrating accuracy
Every prediction is checked. After the request completes, the worker reports how many prefix blocks it actually hit, and the router compares that to what it predicted. The per-worker accuracy of those predictions is tracked with an exponential moving average and used to dampen or amplify future prefix-match scores. A worker whose cache map turns out to be a poor predictor gets less benefit of the doubt; a worker whose predictions are consistently right gets a strong-prefix bonus on top.
If accuracy collapses for any worker, usually because its cache state diverged from what the router believed, the router resets that worker's prefix index and rebuilds it from fresh evidence. No human intervention required.
sequenceDiagram
participant Router
participant Worker
Router->>Router: Predict prefix match for request
Router->>Worker: Dispatch
Worker-->>Router: Response + actual hit/total blocks
Router->>Router: EWMA-update prediction accuracy
Note over Router: Accuracy too low?
Reset this worker's index.
The user-facing effect is simple: when your prompts share stable prefixes (a system prompt, a few-shot block, a long document quoted across many turns) the router quietly funnels those requests to the workers most likely to skip the prefill, without you having to pin or tag anything. See Prefix Caching for prompt-structure best practices.
Per-Project Fairness
A shared router is only as fair as its weakest defense against a noisy neighbor. The router enforces fairness at three layers, so a project burning through its allowance cannot starve quieter projects sharing the same workers.
Concurrent-request fairness
Every project has a soft burst budget tied to its tier. Projects well under their budget are favored in worker selection; projects at or over their budget get progressively deprioritized, not blocked outright, but routed last when contention is high. Flex requests always yield to standard and priority. Soft fairness keeps low-volume projects responsive without ever leaving capacity idle.
Rate-limit fairness
Above the soft layer, sliding-window rate limits apply per project, with a graduated grace zone: as you approach your limit, response headers begin signaling slowdown so you can throttle cleanly. Cross the hard limit and requests are rejected with standard HTTP semantics.
Spend fairness
Every request passes through an atomic credit ledger before it can be dispatched. Credit holds are taken at admission, settled on completion, and serialized in the database so two concurrent requests can never both spend the last dollar. Spend caps are enforced consistently across every endpoint the router exposes.
Anti-starvation
Idle projects are never forgotten. Long-idle projects keep their fair-share position when traffic resumes, and a periodic reaper sweeps stale request accounting so a leaked counter cannot keep a project pinned to the back of the line indefinitely.
Secure Agent Protocol
Once a worker is chosen, the request travels over Xerotier's agent mesh, a purpose-built protocol that handles encryption, framing, flow control, and lease management without any of it being your problem.
CurveZMQ encryption
Every router-to-agent connection is mutually authenticated and
end-to-end encrypted with CurveZMQ. Each agent is identified by
the fingerprint of its public key, so the router can refuse to
talk to an agent whose identity it has not previously enrolled.
Keys are rotated on a schedule and held in protected memory on
the router side.
MessagePack framing
All control-plane messages use a compact, typed binary format
(MessagePack with a two-byte type prefix). The router can
dispatch on message type without decoding the body, which keeps
the hot path cheap. The choice of MessagePack over Python pickle
or other dynamic formats also closes off a well-known class of
deserialization vulnerabilities.
Lease-bound workers
Agents do not connect once and trust forever. Each agent holds a short-lived lease that it must renew on a heartbeat. If a lease lapses, the router moves the agent into a grace state and probes it directly; if probes fail, the agent is evicted and its in-flight work is rerouted. The result: a partitioned or hung agent vanishes from the eligible pool in seconds, not minutes.
Credit-based streaming backpressure
Streaming responses are governed by a credit-based flow control
protocol layered on top of ZeroMQ high-water marks. Each stream
has a window of bytes the sender is allowed to emit; the
receiver replenishes the window as it consumes data. If the
client slows down, the credit window contracts and the worker
naturally pauses generation rather than buffering unbounded
tokens upstream. Memory stays predictable even under SSE
backpressure.
sequenceDiagram
participant Router
participant Agent
Router->>Agent: CURVE handshake (mutual auth)
Agent->>Router: Lease renewal (heartbeat)
Router->>Agent: Dispatch (MessagePack, typed)
Agent-->>Router: Stream chunks (within credit window)
Router->>Agent: Replenish credits as client consumes
Note over Router,Agent: Client slow?
Window contracts, worker pauses.
Resilience & Self-Healing
A production router has to assume things will break and keep serving traffic anyway. The router carries the usual primitives , and a few less-usual ones.
- Circuit breakers, Sensitive subsystems (such as memory extraction) sit behind per-endpoint breakers. Consecutive failures open the circuit and short-circuit further calls for a cooldown window, so a sick downstream cannot cascade into the request path.
- Lease probes, Agents in the grace state are actively pinged with round-trip latency measured. Workers that recover are reinstated; workers that do not are evicted cleanly.
- Tier fallback & spillover, Per-worker queues have bounded depth. When a tier saturates, requests can be resumed on adjacent capacity instead of deadlocking against a single hot worker.
- Cold provisioning, When no warm worker holds your model, eligible candidates can be warmed and routed around until the model is ready; cold-load orchestration is handled by the agent fleet, not the router request path.
- Self-healing prefix index, A worker whose prefix predictions stop being accurate gets its slice of the index flushed and rebuilt automatically.
- Stale-request reaper, A background sweep evicts request accounting that has been pending too long, preventing leaked counters from corrupting fairness over time.
- Enrollment rate limiting, The agent-enrollment routes pass through a per-source token-bucket limiter so a misbehaving network cannot flood the control plane. Enroll-initiate and enroll-complete are auth-gated by the standard middleware chain; the refresh route is rate-limited and verifies its bearer inline rather than through middleware auth.
Observability
The router is built for operators, not just for end users. Every request is observable end-to-end, every decision is auditable, and the cost of that observability is bounded.
- Distributed tracing, Standards-compliant
W3C traceparentheaders are extracted on ingress and propagated to downstream agents, so a request can be followed across the router, the inference worker, any tool calls, and back, in your existing tracing stack. - Prometheus metrics with cardinality protection, All key router decisions are exported as Prometheus metrics. A built-in cardinality limiter caps the number of unique tenant and region labels and folds overflow into a shared bucket, so a runaway label cannot blow up your monitoring backend.
- Structured audit log, Approval gates and admin actions (agent management, admin API, storage-tier and cache controls, maintenance operations, and chat administration) are written to a structured audit log in the database, queryable for compliance reporting.
- Per-request usage accounting, Tokens in, tokens out, cached tokens, and (where applicable) research and reasoning tokens are returned in the response payload so you can attribute cost to features and users without scraping logs.
In Operator Terms
- You write to one OpenAI-compatible API. The router decides, per request, where it actually runs.
- Your prompts stay warm. Prefix-cache affinity, model affinity, and region affinity all stack, without you having to pin or tag anything.
- Your neighbors cannot starve you. Soft fairness, hard rate limits, and atomic credit holds keep noisy tenants from stealing your capacity.
- Failures are contained. Circuit breakers, lease probes, tier spillover, and a self-healing affinity index keep one bad worker from poisoning the pool.
- You can see what happened. W3C tracing across every hop, cardinality-safe Prometheus metrics, and a structured audit trail.
Bring your own agents, use our managed pools, or mix the two behind a single project. The router treats them as one tier-aware fleet, and gives every request a fair, fast, observable home.