Documentation
An OpenAI-compatible inference surface, fully documented. Existing OpenAI SDKs and HTTP clients target it after a base-URL and key swap.
Overview
Xerotier.ai is a multi-tenant inference platform that serves open-source AI models behind an OpenAI-compatible API. Change your base URL and API key to start issuing requests against Xerotier-hosted endpoints or against your own XIM nodes.
Key Features
- Drop-in OpenAI SDKs - Point the official Python or Node.js client at the Xerotier base URL; no fork, no shim.
- Xerotier Inference Microservice (XIM) - Self-host the same workers Xerotier runs, behind your own perimeter, registered into the router.
- Custom Model Upload - Push a HuggingFace directory or archive; the router preflights GPU fit and serves it under your project.
- Streaming Support - SSE token stream with vendor-prefixed `x_*` events for reasoning, tool calls, and metadata.
- Native KV Cache Offload - vLLM CPU KV offload plus prefix caching, on by default, to cut time-to-first-token on repeated prompts.
Ready to issue a request? Head to the Quickstart for a working curl call, or read Error Handling first if you maintain a strict SDK integration.
Getting Started
Account to first response in five minutes. Authenticate, point your client at a Xerotier base URL, send a request.
Introduction
Platform shape, base URLs per surface, and which OpenAI SDK calls cross over unchanged.
Quickstart
Create an account, provision an endpoint, send a chat completion. Copy-paste curl included.
Authentication
Scoped API keys, rotation, and the trade-offs between project-wide and endpoint-bound scopes.
The Router
Thirteen-signal composite scoring with A-Res weighted-random selection and 5% epsilon-greedy exploration.
API Reference
Every documented endpoint, request shape, and SSE event the router accepts and emits. Identical to OpenAI where the spec is identical; vendor-prefixed where it diverges.
Chat Completions
POST /v1/chat/completions with full tool-call, streaming, and reasoning-content support.
Responses API
POST /v1/responses for higher-level agentic flows with server-managed turns and built-in tools.
Streaming API
SSE frame format, the response.* and x_* event family, and SDK-side handling notes.
Tool Calling
Tool schemas, parallel calls, and how function results round-trip back through the model.
MCP Integration
Attach external Model Context Protocol servers; their tools surface as native tool calls.
Server-Side Tools
Web search, code interpreter, and file-search hosted by the router, no client wiring.
Web Search
Built-in web fetch with citations; usage is metered separately and surfaced in x_chat.metadata.
Embeddings
POST /v1/embeddings with batched input, base64 output, and per-model dimension control.
Reranking & Scoring
Score query-document pairs or rerank a candidate list against a query, batch-friendly.
Conversations
Server-stored multi-turn threads; resume by id with full reasoning and tool-call history.
Files API
Upload, list, and reference files by id from chat completions, responses, and batch.
Batch API
Submit a JSONL of requests for asynchronous processing at lower per-token cost.
Uploads API
Resumable multipart upload for files larger than the synchronous POST cap.
Stored Completions
List, retrieve, and export completions retained against a project for audit and replay.
Error Handling
OpenAI-shaped error envelopes, retry-after semantics, and which 5xx codes are safe to retry.
Features
Capabilities that extend the OpenAI surface: memory, document workspace, prefix caching, service tiers, SLO targets, and a workspace graph the router uses to route.
Chat Memory
Project-scoped memories distilled from prior turns and injected into matching new requests.
Document Workspace
Per-chat artifact area; uploads are chunked, embedded, and recallable from the same conversation.
Prefix Caching
vLLM prefix-cache hits route to the worker that already holds the KV, cutting TTFT on warm prompts.
Service Tiers
Pick latency, throughput, or cost-optimized routing per request via the service_tier field.
SLO Tracking
Configure p95 latency and error-rate budgets per endpoint; consumption rolls up to /usage.
Workspace Graph
Force-directed view of projects, endpoints, models, and artifacts; the same data the router scores against.
Chat Branching
Fork at any turn; each branch keeps its own reasoning and tool-call lineage for compare runs.
Model Management
Upload, share, version, and discover models.
Model Management
Full lifecycle: upload, inspect, pin, retire, and surface custom models inside the project.
Model Sharing
Publish a private model to the public catalog with discoverability and licensing controls.
Model Upload
Push a HuggingFace directory or tar archive; the router preflights GPU fit before activation.
Model Versioning
Semantic version pins, staged rollouts, and atomic switch-over without dropping in-flight requests.
Model Catalog
Filterable index of public and project-scoped models with context length and license per entry.
Guides
Field-tested patterns, integrations, and the advanced flags the router exposes.
Usage Guides
Streaming patterns, rate-limit backoff, retry envelopes, and idempotency-key wiring.
SDK & Integrations
Per-SDK base-URL config: openai-python, openai-node, LangChain, LlamaIndex, opencode, Cursor.
OpenCode Integration
Drive Xerotier endpoints from the OpenCode CLI with one-shot config import.
Platform
Operator surfaces around the inference surface: teams, auth, storage, billing, status, and the webhooks that tie them to your own systems.
Teams & User Management
Owner, admin, member, viewer roles; project-scoped invitations with email or join-key flow.
Authentication & Security
Session cookies for the dashboard, scoped API keys for clients, rotation without dropping live traffic.
Endpoint CORS
Per-endpoint origin allowlists with preflight handling; browsers hit the inference URL directly.
Storage
S3-compatible object store backing the Files, Uploads, and Document Workspace surfaces.
Usage Tracking & Billing
Per-token metering, daily rollups in daily_usage_rollups, and a credit ledger with atomic debits.
Billing & Subscriptions
Plan tiers, prepaid credit top-ups, invoice retrieval, and payment-method rotation.
Budget Alerts
Per-project spend thresholds with webhook and email delivery; soft warns before hard caps.
Data Export
JSONL export of conversations, artifacts, and rollups; signed-URL download with TTL.
Status & Maintenance
Live status feed and maintenance-window calendar; subscribe via webhook or RSS.
Webhooks
Signed POST callbacks for platform events: usage, approvals, status, and XEM lifecycle.
Infrastructure
Choose between Xerotier-hosted inference and self-hosted Xerotier Inference Microservice (XIM) nodes you run on your own GPUs.
XIM Deployment
Docker-based deploy of a XIM node on CUDA or ROCm GPUs; auto-registers with the router on boot.
XIM on macOS
Native Apple Silicon app with Metal-accelerated vLLM. Download the DMG, paste a join key, enroll. No container.
XIM Advanced
Scheduler tuning, KV cache sizing, paged-attention flags, and per-runtime vLLM overrides.
Private Agents
Run XIM or XEM workers inside your VPC; mesh dials out to the router over CurveZMQ.
Hosted vs Self-Hosted
Side-by-side comparison: latency, fleet ops, cost model, and which knobs each surface exposes.
XIM Guides
End-to-end walkthroughs for operating self-hosted XIM nodes.
Execution Management (XEM)
Long-running agent workflows that the router schedules, approves, retries, and reports on. Same authentication, same observability, same envelopes as the inference surface.
XEM Overview
Execution graph model, lifecycle states, and where XEM fits next to a one-shot chat completion.
Error Codes
Numeric code table with retry guidance per code; structured envelope matches /v1/chat errors.
Webhook Events
exec.started, exec.approval_requested, exec.completed and friends; signed and idempotency-keyed.
Approvals
Pause execution before any destructive tool call; resume via approval URL or webhook reply.
Environment Variables
XEROTIER_XEM_* env table with defaults, ranges, and which knobs reload without a restart.
Config Files
YAML execution-graph schema; nodes, edges, tool bindings, and approval gates.
Glossary
Terms used across the XEM surface: lease, credit, iteration, approval, gate, runner.
XEM Guides
Task-oriented walkthroughs for building on XEM.
Deploy Your First XEM
From a single YAML to a running execution graph with one xeroctl exec submit.
Author a Chat Template
Reusable prompt and tool bundles versioned alongside your execution graphs.
Slack Approval Bot
Wire exec.approval_requested webhooks into a Slack workflow with interactive approve / deny.
Incident Response
Runbooks for stuck iterations, exhausted approvals, and credit-ledger reconciliation.
Credential Rotation
Rotate API keys consumed by long-lived XEM executions without dropping the run.
Tools
Same surface the dashboard speaks, scripted. The CLI carries chat, responses, models, embeddings, rerank, batches, files, conversations, webhooks, keys, agents, slos, uploads, config, platform ops, exec, templates, approvals, and learnings.
Most-used subcommands
chat
Send a chat completion from the terminal; stream output to stdout or pipe to jq.
responses
Drive the Responses API and its tool surface from the shell, suitable for cron pipelines.
models
List catalog entries, inspect a model card, and pin a version for the project default.
keys
Create, scope, and rotate API keys; emits the raw key once and the id-only thereafter.
agents
Enroll a XIM or XEM agent, inspect its lease, and revoke a worker without a restart.
exec
Submit, monitor, and approve XEM executions; tails the SSE event stream by default.
All 19 subcommand pages are listed in the sidebar under Tools.