The Router - Xerotier

Overview

Every Xerotier inference request, chat, responses, embeddings, reranking, tool calls, arrives at the same router process that also serves the frontend page surface. The router holds a live picture of every connected worker: which models are loaded, how hot their KV cache is, how recently each one answered, how close it is to your traffic, and whether the project sending the request is over or under its fair share.

For each request, the router scores every eligible worker on more than a dozen independent signals, samples a winner with a weighted-random algorithm that still leaves room for exploration, and dispatches over an authenticated, encrypted mesh with built-in backpressure. The whole decision happens before your first token, typically in under a millisecond at the router itself.

flowchart LR
    Client["Client request
(chat / responses / tools)"]
    Admit["Admission
auth + credits + rate limits"]
    Score["Composite scoring
15+ signals"]
    Select["Weighted-random
sampling + exploration"]
    Dispatch["Encrypted dispatch
over agent mesh"]
    Stream["Streaming response
with credit-based flow control"]
    Feedback["Post-request feedback
recalibrates the router"]

    Client --> Admit
    Admit --> Score
    Score --> Select
    Select --> Dispatch
    Dispatch --> Stream
    Stream -.-> Feedback
    Feedback -.-> Score

Composite Scoring

Picking a worker by latency alone is a trap, the fastest worker a millisecond ago may be the one whose KV cache just evicted your prompt, the one whose queue just filled, or the one in a region a continent away. The router blends 15 distinct signals into a single composite score, then re-evaluates on every request.

The fifteen signals

// affinity: Model-cache affinity; Is the model already loaded? Cold-loading a 70B model is measured in minutes; a warm worker answers immediately.
// prefix: Prefix-cache match; How much of your prompt's KV cache the worker still holds from a prior request. More overlap, shorter prefill, faster first token.
// ewma: Latency prediction; A per-worker, exponentially-weighted moving average of recent end-to-end latency. Weighted harder when an SLO is attached.
// pressure: KV-cache pressure; Workers near their KV ceiling are deprioritized and, past a threshold, excluded entirely.
// health: Worker health; Heartbeat freshness, error rates, and probe results drive a continuous health score.
// slots: Worker utilization; How many slots remain available right now.
// region: Region affinity; Stay close to the caller when latency matters; spill out when capacity does.
// fairness: Per-project fairness; Projects under their fair share are favored; projects burning through burst capacity are softly deprioritized.
// tier: Service-tier priority; Priority requests outrank standard. Flex requests yield to both.
// ownership: Self-hosted ownership; If your project owns the agent, your requests get first pass at it.
// slo: SLO bonus; Workers consistently meeting your SLO win ties.
// origin: Internal vs. external; Internal traffic gets a configurable boost to keep system jobs from being starved by user traffic.
// batch: Batch penalty; Batch requests are weighted down so interactive traffic preempts them naturally.
// staleness: Metrics-staleness penalty; Workers whose telemetry is overdue are progressively distrusted until they re-report.
// strong-prefix: Strong-prefix bonus; A bonus tier kicks in when prefix overlap is high and the prefix-match prediction has proven accurate for that worker.

flowchart LR
    subgraph Signals
        S1["Model-cache affinity"]
        S2["Prefix-cache match"]
        S3["EWMA latency"]
        S4["KV pressure"]
        S5["Worker health"]
        S6["Utilization"]
        S7["Region affinity"]
        S8["Per-project fairness"]
        S9["Service tier"]
        SA["...additional signals"]
    end
    Score{"Composite score
per worker"}
    S1 --> Score
    S2 --> Score
    S3 --> Score
    S4 --> Score
    S5 --> Score
    S6 --> Score
    S7 --> Score
    S8 --> Score
    S9 --> Score
    SA --> Score

Weighted Selection & Exploration

A single best score is rarely the best answer. Always picking the top-scoring worker concentrates traffic, accelerates KV evictions on the favorite, and prevents the router from learning that a newly connected worker is actually faster. The router avoids this in two ways.

A-Res weighted-random sampling

Instead of picking the maximum score, the router draws from all eligible workers using the A-Res algorithm (Efraimidis and Spirakis). Each worker's selection probability is proportional to its composite score, so the best worker wins most of the time. A worker that is 90% as good still wins a meaningful share of requests. Load spreads naturally, KV caches stay warm across the pool, and a high scorer that briefly stumbles has competitors ready to absorb the traffic.

Epsilon-greedy exploration

A small slice of requests, five percent by default, bypasses the composite score entirely and chooses uniformly at random from the eligible pool. This keeps the router honest: newly connected workers, workers whose latency model is stale, and workers recovering from a degraded state all get sampled, measured, and folded back into the scoring model. The result is a router that keeps learning, instead of one that lock-steps to whatever was fastest an hour ago.

Prefix-Cache Affinity

The router keeps a per-worker index of which prompt prefixes each worker has recently served. When a new request arrives, the router can predict, before dispatching, how much of the prompt's KV cache is likely already resident on each candidate, and prefer the candidates with the warmest match.

Self-calibrating accuracy

Every prediction is checked. After the request completes, the worker reports how many prefix blocks it actually hit, and the router compares that to what it predicted. The per-worker accuracy of those predictions is tracked with an exponential moving average and used to dampen or amplify future prefix-match scores. A worker whose cache map turns out to be a poor predictor gets less benefit of the doubt; a worker whose predictions are consistently right gets a strong-prefix bonus on top.

If accuracy collapses for any worker, usually because its cache state diverged from what the router believed, the router resets that worker's prefix index and rebuilds it from fresh evidence. No human intervention required.

sequenceDiagram
    participant Router
    participant Worker
    Router->>Router: Predict prefix match for request
    Router->>Worker: Dispatch
    Worker-->>Router: Response + actual hit/total blocks
    Router->>Router: EWMA-update prediction accuracy
    Note over Router: Accuracy too low?
Reset this worker's index.

The user-facing effect is simple: when your prompts share stable prefixes (a system prompt, a few-shot block, a long document quoted across many turns) the router quietly funnels those requests to the workers most likely to skip the prefill, without you having to pin or tag anything. See Prefix Caching for prompt-structure best practices.

Per-Project Fairness

A shared router is only as fair as its weakest defense against a noisy neighbor. The router enforces fairness at three layers, so a project burning through its allowance cannot starve quieter projects sharing the same workers.

Concurrent-request fairness

Every project has a soft burst budget tied to its tier. Projects well under their budget are favored in worker selection; projects at or over their budget get progressively deprioritized, not blocked outright, but routed last when contention is high. Flex requests always yield to standard and priority. Soft fairness keeps low-volume projects responsive without ever leaving capacity idle.

Rate-limit fairness

Above the soft layer, sliding-window rate limits apply per project, with a graduated grace zone: as you approach your limit, response headers begin signaling slowdown so you can throttle cleanly. Cross the hard limit and requests are rejected with standard HTTP semantics.

Spend fairness

Every request passes through an atomic credit ledger before it can be dispatched. Credit holds are taken at admission, settled on completion, and serialized in the database so two concurrent requests can never both spend the last dollar. Spend caps are enforced consistently across every endpoint the router exposes.

Anti-starvation

Idle projects are never forgotten. Long-idle projects keep their fair-share position when traffic resumes, and a periodic reaper sweeps stale request accounting so a leaked counter cannot keep a project pinned to the back of the line indefinitely.

Secure Agent Protocol

Once a worker is chosen, the request travels over Xerotier's agent mesh, a purpose-built protocol that handles encryption, framing, flow control, and lease management without any of it being your problem.

CurveZMQ encryption

Every router-to-agent connection is mutually authenticated and end-to-end encrypted with CurveZMQ. Each agent is identified by the fingerprint of its public key, so the router can refuse to talk to an agent whose identity it has not previously enrolled. Keys are rotated on a schedule and held in protected memory on the router side.

MessagePack framing

All control-plane messages use a compact, typed binary format (MessagePack with a two-byte type prefix). The router can dispatch on message type without decoding the body, which keeps the hot path cheap. The choice of MessagePack over Python pickle or other dynamic formats also closes off a well-known class of deserialization vulnerabilities.

Lease-bound workers

Agents do not connect once and trust forever. Each agent holds a short-lived lease that it must renew on a heartbeat. If a lease lapses, the router moves the agent into a grace state and probes it directly; if probes fail, the agent is evicted and its in-flight work is rerouted. The result: a partitioned or hung agent vanishes from the eligible pool in seconds, not minutes.

Credit-based streaming backpressure

Streaming responses are governed by a credit-based flow control protocol layered on top of ZeroMQ high-water marks. Each stream has a window of bytes the sender is allowed to emit; the receiver replenishes the window as it consumes data. If the client slows down, the credit window contracts and the worker naturally pauses generation rather than buffering unbounded tokens upstream. Memory stays predictable even under SSE backpressure.

sequenceDiagram
    participant Router
    participant Agent
    Router->>Agent: CURVE handshake (mutual auth)
    Agent->>Router: Lease renewal (heartbeat)
    Router->>Agent: Dispatch (MessagePack, typed)
    Agent-->>Router: Stream chunks (within credit window)
    Router->>Agent: Replenish credits as client consumes
    Note over Router,Agent: Client slow?
Window contracts, worker pauses.

Resilience & Self-Healing

A production router has to assume things will break and keep serving traffic anyway. The router carries the usual primitives , and a few less-usual ones.

Circuit breakers, Sensitive subsystems (such as memory extraction) sit behind per-endpoint breakers. Consecutive failures open the circuit and short-circuit further calls for a cooldown window, so a sick downstream cannot cascade into the request path.
Lease probes, Agents in the grace state are actively pinged with round-trip latency measured. Workers that recover are reinstated; workers that do not are evicted cleanly.
Tier fallback & spillover, Per-worker queues have bounded depth. When a tier saturates, requests can be resumed on adjacent capacity instead of deadlocking against a single hot worker.
Cold provisioning, When no warm worker holds your model, eligible candidates can be warmed and routed around until the model is ready; cold-load orchestration is handled by the agent fleet, not the router request path.
Self-healing prefix index, A worker whose prefix predictions stop being accurate gets its slice of the index flushed and rebuilt automatically.
Stale-request reaper, A background sweep evicts request accounting that has been pending too long, preventing leaked counters from corrupting fairness over time.
Enrollment rate limiting, The agent-enrollment routes pass through a per-source token-bucket limiter so a misbehaving network cannot flood the control plane. Enroll-initiate and enroll-complete are auth-gated by the standard middleware chain; the refresh route is rate-limited and verifies its bearer inline rather than through middleware auth.

Observability

The router is built for operators, not just for end users. Every request is observable end-to-end, every decision is auditable, and the cost of that observability is bounded.

Distributed tracing, Standards-compliant W3C traceparent headers are extracted on ingress and propagated to downstream agents, so a request can be followed across the router, the inference worker, any tool calls, and back, in your existing tracing stack.
Prometheus metrics with cardinality protection, All key router decisions are exported as Prometheus metrics. A built-in cardinality limiter caps the number of unique tenant and region labels and folds overflow into a shared bucket, so a runaway label cannot blow up your monitoring backend.
Structured audit log, Approval gates and admin actions (agent management, admin API, storage-tier and cache controls, maintenance operations, and chat administration) are written to a structured audit log in the database, queryable for compliance reporting.
Per-request usage accounting, Tokens in, tokens out, cached tokens, and (where applicable) research and reasoning tokens are returned in the response payload so you can attribute cost to features and users without scraping logs.

In Operator Terms

You write to one OpenAI-compatible API. The router decides, per request, where it actually runs.
Your prompts stay warm. Prefix-cache affinity, model affinity, and region affinity all stack, without you having to pin or tag anything.
Your neighbors cannot starve you. Soft fairness, hard rate limits, and atomic credit holds keep noisy tenants from stealing your capacity.
Failures are contained. Circuit breakers, lease probes, tier spillover, and a self-healing affinity index keep one bad worker from poisoning the pool.
You can see what happened. W3C tracing across every hop, cardinality-safe Prometheus metrics, and a structured audit trail.

Bring your own agents, use our managed pools, or mix the two behind a single project. The router treats them as one tier-aware fleet, and gives every request a fair, fast, observable home.