Deploy Your First XIM
Blank GPU host to a XIM (Xerotier Inference Mesh node) serving an OpenAI-compatible chat completion through your router, in about thirty minutes. Six numbered steps, real shell transcripts, one curl at the end. CUDA, ROCm, and CPU paths inline.
Prerequisites
- A Linux host with one of: Nvidia CUDA GPU (compute capability 7.5+), AMD ROCm GPU (gfx906+), or AMD EPYC CPU with ZenDNN. CPU-only runs are supported for development but very slow for production.
- Container runtime:
podmanordockerwith the appropriate GPU toolkit (nvidia-container-toolkitor ROCm device-plugin). - At least 50 GiB free disk for the model cache
(more for larger models). The agent caches models
under
~/.cache/xerotier/modelsby default; override withXEROTIER_AGENT_MODEL_CACHE_PATHor by mounting a volume at that location. - Outbound HTTPS to the router URL (for enrollment) and outbound TCP to the router's CurveZMQ port (for the data plane).
- An API key with the
managementscope on your laptop, for minting the join key.
Step 1: Create a Join Key
From your workstation, mint a join key against the
router that the XIM should enroll with. The
--router-addr value is the CurveZMQ
address the agent will dial after enrollment.
xeroctl agents join-keys --create \
--name xim-gpu-pool \
--region us-east \
--router-addr tcp://router.example.com:5555 \
--ttl-seconds 900 \
--supported-tiers gpu
Transcript:
Join key 'xjk_demo_abcdef0123456789wxyz' created successfully.
Join Token: eyJhbGciOi...
Save this token, it will not be shown again.
ID: xjk_demo_abcdef0123456789wxyz
Name: xim-gpu-pool
Region: us-east
Max Enrollments: 1
Expires: 2026-04-22T03:30:00Z
Copy the JWT. The token is TTL-bounded by
--ttl-seconds and capped to
--max-enrollments. Once consumed it
cannot be reused. The router assigns each enrolled
XIM a stable worker identity, which is persisted in
the agent's state directory.
The minting account's API key must carry the
management scope. Mint one from the
API Keys page if you do not
already have one.
The flags shown above cover the create path only.
Run xeroctl agents join-keys --help for
the full surface, including --list,
--revoke, --include-terminal,
--force, --limit, and
--after.
Step 2: Prepare the Host
Clone the
cloudnull/xerotier-public
repository on the XIM host (or copy just the
compose/ directory) so the compose
files referenced below are available locally.
git clone https://github.com/cloudnull/xerotier-public.git
cd xerotier-public/compose
On the XIM host, create the directory layout:
sudo install -d -m 0755 /etc/xerotier
sudo install -d -m 0755 /var/lib/inference
sudo install -d -m 0755 /var/log/xerotier
Pick the deployment that matches your accelerator. All three options use the published container image from cloudnull/xerotier-public; pick exactly one. On an Apple Silicon Mac, stop here and use the native application instead, see XIM on macOS.
Option A: Nvidia CUDA
docker compose -f compose.agent-nvidia.yaml up -d
Requires nvidia-container-toolkit and a
driver compatible with the bundled CUDA runtime
(CUDA 12.x).
Option B: AMD ROCm
docker compose -f compose.agent-amd-rocm.yaml up -d
Requires the ROCm runtime visible at
/dev/kfd and /dev/dri/*.
The image targets gfx906+ and includes the
xerotier-vllm wrapper.
Option C: CPU (AMD EPYC + ZenDNN)
docker compose -f compose.agent-amd-cpu-zendnn.yaml up -d
For development and testing only. KV cache offload flags do not apply on CPU backends; vLLM serves inference directly from host RAM.
Step 3: Configure the Environment
Compose file: the example commands in this and the next step use compose.agent-nvidia.yaml. If your step-2 pick was ROCm or CPU, substitute compose.agent-amd-rocm.yaml or compose.agent-amd-cpu-zendnn.yaml in every command below.
Create a .env file in the
compose/ directory. Docker Compose reads
it automatically for variable substitution, so the
agent picks up the join key on the next
up.
# Enrollment (first run only; remove after successful enrollment)
XEROTIER_AGENT_JOIN_KEY=eyJhbGciOi...
# KV cache CPU offloading (optional, NVIDIA CUDA only)
# Default: 25%% of system RAM, clamped [4, 128] GiB. Set 0 to disable.
# Ignored on AMD ROCm and CPU backends.
XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB=
The agent does not need a router address or CURVE
public key: both arrive in the enrollment response
and are persisted under
/var/lib/inference/ alongside the
agent's own auto-generated CURVE keypair.
Treat this file as a secret: the join token is a bearer credential. Set chmod 600 .env; never commit it to a repository and never copy it onto an operator workstation.
Recreate the container so the new env file takes effect:
docker compose -f compose.agent-nvidia.yaml up -d --force-recreate
Step 4: Start the Agent
docker compose -f compose.agent-nvidia.yaml up -d
docker compose -f compose.agent-nvidia.yaml logs -f agent
Expected log lines on success (paraphrased, the agent emits structured log records with these fields, exact wording may vary):
[info] consuming join key from env XEROTIER_AGENT_JOIN_KEY
[info] enrollment succeeded, worker_id=wkr_01HX..., region=us-east-1
[info] detected accelerator: nvidiaCUDA, gpu_count=1, vram=24576MiB
[info] KV offload enabled size_gib=16 accelerator=nvidiaCUDA
[info] vLLM process started pid=42
[info] vLLM engine ready, listening on /tmp/xerotier-engine.sock
[info] lease established, heartbeat every 10s
The join key is consumed; CURVE keys and the worker
state file are written to
/var/lib/inference/. Do not re-use the
join key even if the enrollment seems to have
failed.
If something stalls here: see Troubleshooting for enrollment rejected, no accelerator detected, and vLLM startup timed out, the three failures that fire most often at this step.
Step 5: Load a First Model
Models are project-scoped resources. Importing a new model from a registry (HuggingFace and similar) is a one-time setup performed in the dashboard; once a model exists in the project, an endpoint binds it to a service tier and schedules it onto an enrolled XIM.
5a: Import the model (one-time)
Open the Models page in the
dashboard, click Add Model, paste
the registry path
(meta-llama/Llama-3.1-8B-Instruct),
and confirm. The Frontend records the model in the
project catalog and assigns it a UUID.
To list project models from the CLI:
xeroctl models
The first column of the output is the model UUID;
copy it into --model-id below.
5b: Create an endpoint that binds the model
Either create the endpoint in the Endpoints dashboard, or from the CLI:
xeroctl endpoints create \
--name "Llama 3.1 8B" \
--slug llama-31-8b \
--model-id <model-uuid> \
--tier-id gpu \
--task-mode generate
5c: Provision the endpoint to a XIM
Provisioning dispatches the model to the matching XIM. The router streams weights to the agent and waits for the vLLM engine to report ready.
xeroctl endpoints provision <endpoint-uuid>
xeroctl endpoints list
When the endpoint status reaches
active, the model is live and routable.
Step 6: Run a Chat Completion
Hit the OpenAI-compatible API on the router:
curl https://router.example.com/v1/chat/completions \
-H "Authorization: Bearer $XEROTIER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Say hi in one word."}
]
}'
Expected response:
{
"id": "chatcmpl-01HX...",
"object": "chat.completion",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Hi."},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 14, "completion_tokens": 2, "total_tokens": 16}
}
Congratulations, your first XIM is serving model inference through the router. Next, tune the inference stack for your workload: XIM Advanced Configuration.
Troubleshooting
See XIM Advanced Configuration for the full operational guide. Common first-deploy issues:
- Enrollment rejected: the enrollment response carries a non-zero error code and message. Typical causes are an expired or already-consumed join token, or a region/tier mismatch. Mint a new join key and retry.
- No accelerator detected: the
container cannot see the GPU. Verify
nvidia-container-toolkitis installed and the compose file exposes/dev/nvidia*, or for ROCm verify/dev/kfdand/dev/dri/*are mounted into the container. - vLLM startup timed out: the
agent log emits
vLLM startup timed outwhen the engine fails to report ready within its inactivity grace or absolute ceiling. Usually the model is too large for available VRAM: reduce--gpu-memory-utilization, raiseXEROTIER_AGENT_KV_OFFLOAD_SIZE_GB(NVIDIA only), or switch to a smaller or quantized model. - Model pull failure: the agent
could not stream weights from the router or
upstream registry. Confirm outbound HTTPS works
from the host, check that the model cache path
(
~/.cache/xerotier/modelsby default) has free space, and inspect the router log for the upstream HTTP error.