// Getting Started

The Router

One OpenAI-compatible surface. Behind it, a scoring dispatcher that picks a worker per request from fifteen independent signals, samples with A-Res weighted random, and dispatches over a mutually-authenticated CurveZMQ mesh. Managed pools, self-hosted agents, and public cloud are one fleet.

Overview

Every Xerotier inference request, chat, responses, embeddings, reranking, tool calls, arrives at the same router process that also serves the frontend page surface. The router holds a live picture of every connected worker: which models are loaded, how hot their KV cache is, how recently each one answered, how close it is to your traffic, and whether the project sending the request is over or under its fair share.

For each request, the router scores every eligible worker on more than a dozen independent signals, samples a winner with a weighted-random algorithm that still leaves room for exploration, and dispatches over an authenticated, encrypted mesh with built-in backpressure. The whole decision happens before your first token, typically in under a millisecond at the router itself.

Composite Scoring

Picking a worker by latency alone is a trap, the fastest worker a millisecond ago may be the one whose KV cache just evicted your prompt, the one whose queue just filled, or the one in a region a continent away. The router blends 15 distinct signals into a single composite score, then re-evaluates on every request.

The fifteen signals

// affinity
Model-cache affinity
Is the model already loaded? Cold-loading a 70B model is measured in minutes; a warm worker answers immediately.
// prefix
Prefix-cache match
How much of your prompt's KV cache the worker still holds from a prior request. More overlap, shorter prefill, faster first token.
// ewma
Latency prediction
A per-worker, exponentially-weighted moving average of recent end-to-end latency. Weighted harder when an SLO is attached.
// pressure
KV-cache pressure
Workers near their KV ceiling are deprioritized and, past a threshold, excluded entirely.
// health
Worker health
Heartbeat freshness, error rates, and probe results drive a continuous health score.
// slots
Worker utilization
How many slots remain available right now.
// region
Region affinity
Stay close to the caller when latency matters; spill out when capacity does.
// fairness
Per-project fairness
Projects under their fair share are favored; projects burning through burst capacity are softly deprioritized.
// tier
Service-tier priority
Priority requests outrank standard. Flex requests yield to both.
// ownership
Self-hosted ownership
If your project owns the agent, your requests get first pass at it.
// slo
SLO bonus
Workers consistently meeting your SLO win ties.
// origin
Internal vs. external
Internal traffic gets a configurable boost to keep system jobs from being starved by user traffic.
// batch
Batch penalty
Batch requests are weighted down so interactive traffic preempts them naturally.
// staleness
Metrics-staleness penalty
Workers whose telemetry is overdue are progressively distrusted until they re-report.
// strong-prefix
Strong-prefix bonus
A bonus tier kicks in when prefix overlap is high and the prefix-match prediction has proven accurate for that worker.

Weighted Selection & Exploration

A single best score is rarely the best answer. Always picking the top-scoring worker concentrates traffic, accelerates KV evictions on the favorite, and prevents the router from learning that a newly connected worker is actually faster. The router avoids this in two ways.

A-Res weighted-random sampling

Instead of picking the maximum score, the router draws from all eligible workers using the A-Res algorithm (Efraimidis and Spirakis). Each worker's selection probability is proportional to its composite score, so the best worker wins most of the time. A worker that is 90% as good still wins a meaningful share of requests. Load spreads naturally, KV caches stay warm across the pool, and a high scorer that briefly stumbles has competitors ready to absorb the traffic.

Epsilon-greedy exploration

A small slice of requests, five percent by default, bypasses the composite score entirely and chooses uniformly at random from the eligible pool. This keeps the router honest: newly connected workers, workers whose latency model is stale, and workers recovering from a degraded state all get sampled, measured, and folded back into the scoring model. The result is a router that keeps learning, instead of one that lock-steps to whatever was fastest an hour ago.

Prefix-Cache Affinity

The router keeps a per-worker index of which prompt prefixes each worker has recently served. When a new request arrives, the router can predict, before dispatching, how much of the prompt's KV cache is likely already resident on each candidate, and prefer the candidates with the warmest match.

Self-calibrating accuracy

Every prediction is checked. After the request completes, the worker reports how many prefix blocks it actually hit, and the router compares that to what it predicted. The per-worker accuracy of those predictions is tracked with an exponential moving average and used to dampen or amplify future prefix-match scores. A worker whose cache map turns out to be a poor predictor gets less benefit of the doubt; a worker whose predictions are consistently right gets a strong-prefix bonus on top.

If accuracy collapses for any worker, usually because its cache state diverged from what the router believed, the router resets that worker's prefix index and rebuilds it from fresh evidence. No human intervention required.

The user-facing effect is simple: when your prompts share stable prefixes (a system prompt, a few-shot block, a long document quoted across many turns) the router quietly funnels those requests to the workers most likely to skip the prefill, without you having to pin or tag anything. See Prefix Caching for prompt-structure best practices.

Per-Project Fairness

A shared router is only as fair as its weakest defense against a noisy neighbor. The router enforces fairness at three layers, so a project burning through its allowance cannot starve quieter projects sharing the same workers.

Concurrent-request fairness

Every project has a soft burst budget tied to its tier. Projects well under their budget are favored in worker selection; projects at or over their budget get progressively deprioritized, not blocked outright, but routed last when contention is high. Flex requests always yield to standard and priority. Soft fairness keeps low-volume projects responsive without ever leaving capacity idle.

Rate-limit fairness

Above the soft layer, sliding-window rate limits apply per project, with a graduated grace zone: as you approach your limit, response headers begin signaling slowdown so you can throttle cleanly. Cross the hard limit and requests are rejected with standard HTTP semantics.

Spend fairness

Every request passes through an atomic credit ledger before it can be dispatched. Credit holds are taken at admission, settled on completion, and serialized in the database so two concurrent requests can never both spend the last dollar. Spend caps are enforced consistently across every endpoint the router exposes.

Anti-starvation

Idle projects are never forgotten. Long-idle projects keep their fair-share position when traffic resumes, and a periodic reaper sweeps stale request accounting so a leaked counter cannot keep a project pinned to the back of the line indefinitely.

Secure Agent Protocol

Once a worker is chosen, the request travels over Xerotier's agent mesh, a purpose-built protocol that handles encryption, framing, flow control, and lease management without any of it being your problem.

CurveZMQ encryption

Every router-to-agent connection is mutually authenticated and end-to-end encrypted with CurveZMQ. Each agent is identified by the fingerprint of its public key, so the router can refuse to talk to an agent whose identity it has not previously enrolled. Keys are rotated on a schedule and held in protected memory on the router side.

MessagePack framing

All control-plane messages use a compact, typed binary format (MessagePack with a two-byte type prefix). The router can dispatch on message type without decoding the body, which keeps the hot path cheap. The choice of MessagePack over Python pickle or other dynamic formats also closes off a well-known class of deserialization vulnerabilities.

Lease-bound workers

Agents do not connect once and trust forever. Each agent holds a short-lived lease that it must renew on a heartbeat. If a lease lapses, the router moves the agent into a grace state and probes it directly; if probes fail, the agent is evicted and its in-flight work is rerouted. The result: a partitioned or hung agent vanishes from the eligible pool in seconds, not minutes.

Credit-based streaming backpressure

Streaming responses are governed by a credit-based flow control protocol layered on top of ZeroMQ high-water marks. Each stream has a window of bytes the sender is allowed to emit; the receiver replenishes the window as it consumes data. If the client slows down, the credit window contracts and the worker naturally pauses generation rather than buffering unbounded tokens upstream. Memory stays predictable even under SSE backpressure.

Resilience & Self-Healing

A production router has to assume things will break and keep serving traffic anyway. The router carries the usual primitives , and a few less-usual ones.

  • Circuit breakers, Sensitive subsystems (such as memory extraction) sit behind per-endpoint breakers. Consecutive failures open the circuit and short-circuit further calls for a cooldown window, so a sick downstream cannot cascade into the request path.
  • Lease probes, Agents in the grace state are actively pinged with round-trip latency measured. Workers that recover are reinstated; workers that do not are evicted cleanly.
  • Tier fallback & spillover, Per-worker queues have bounded depth. When a tier saturates, requests can be resumed on adjacent capacity instead of deadlocking against a single hot worker.
  • Cold provisioning, When no warm worker holds your model, eligible candidates can be warmed and routed around until the model is ready; cold-load orchestration is handled by the agent fleet, not the router request path.
  • Self-healing prefix index, A worker whose prefix predictions stop being accurate gets its slice of the index flushed and rebuilt automatically.
  • Stale-request reaper, A background sweep evicts request accounting that has been pending too long, preventing leaked counters from corrupting fairness over time.
  • Enrollment rate limiting, The agent-enrollment routes pass through a per-source token-bucket limiter so a misbehaving network cannot flood the control plane. Enroll-initiate and enroll-complete are auth-gated by the standard middleware chain; the refresh route is rate-limited and verifies its bearer inline rather than through middleware auth.

Observability

The router is built for operators, not just for end users. Every request is observable end-to-end, every decision is auditable, and the cost of that observability is bounded.

  • Distributed tracing, Standards-compliant W3C traceparent headers are extracted on ingress and propagated to downstream agents, so a request can be followed across the router, the inference worker, any tool calls, and back, in your existing tracing stack.
  • Prometheus metrics with cardinality protection, All key router decisions are exported as Prometheus metrics. A built-in cardinality limiter caps the number of unique tenant and region labels and folds overflow into a shared bucket, so a runaway label cannot blow up your monitoring backend.
  • Structured audit log, Approval gates and admin actions (agent management, admin API, storage-tier and cache controls, maintenance operations, and chat administration) are written to a structured audit log in the database, queryable for compliance reporting.
  • Per-request usage accounting, Tokens in, tokens out, cached tokens, and (where applicable) research and reasoning tokens are returned in the response payload so you can attribute cost to features and users without scraping logs.

In Operator Terms

  • You write to one OpenAI-compatible API. The router decides, per request, where it actually runs.
  • Your prompts stay warm. Prefix-cache affinity, model affinity, and region affinity all stack, without you having to pin or tag anything.
  • Your neighbors cannot starve you. Soft fairness, hard rate limits, and atomic credit holds keep noisy tenants from stealing your capacity.
  • Failures are contained. Circuit breakers, lease probes, tier spillover, and a self-healing affinity index keep one bad worker from poisoning the pool.
  • You can see what happened. W3C tracing across every hop, cardinality-safe Prometheus metrics, and a structured audit trail.

Bring your own agents, use our managed pools, or mix the two behind a single project. The router treats them as one tier-aware fleet, and gives every request a fair, fast, observable home.