Deploy your first XIM

Prerequisites

A Linux host with one of: Nvidia CUDA GPU (compute capability 7.5+), AMD ROCm GPU (gfx906+), or AMD EPYC CPU with ZenDNN. CPU-only runs are supported for development but very slow for production.
Container runtime: podman or docker with the appropriate GPU toolkit (nvidia-container-toolkit or ROCm device-plugin).
At least 50 GiB free disk for the model cache (more for larger models). The agent caches models under ~/.cache/xerotier/models by default; override with XEROTIER_AGENT_MODEL_CACHE_PATH or by mounting a volume at that location.
Outbound HTTPS to the router URL (for enrollment) and outbound TCP to the router's CurveZMQ port (for the data plane).
An API key with the management scope on your laptop, for minting the join key.

Step 1: Create a Join Key

From your workstation, mint a join key against the router that the XIM should enroll with. The --router-addr value is the CurveZMQ address the agent will dial after enrollment.

bash

xeroctl agents join-keys --create \
    --name xim-gpu-pool \
    --region us-east \
    --router-addr tcp://router.example.com:5555 \
    --ttl-seconds 900 \
    --supported-tiers gpu

                

Transcript:

text

Join key 'xjk_demo_abcdef0123456789wxyz' created successfully.

Join Token: eyJhbGciOi...
Save this token, it will not be shown again.

  ID: xjk_demo_abcdef0123456789wxyz
  Name: xim-gpu-pool
  Region: us-east
  Max Enrollments: 1
  Expires: 2026-04-22T03:30:00Z

                

Copy the JWT. The token is TTL-bounded by --ttl-seconds and capped to --max-enrollments. Once consumed it cannot be reused. The router assigns each enrolled XIM a stable worker identity, which is persisted in the agent's state directory.

The minting account's API key must carry the management scope. Mint one from the API Keys page if you do not already have one.

The flags shown above cover the create path only. Run xeroctl agents join-keys --help for the full surface, including --list, --revoke, --include-terminal, --force, --limit, and --after.

Step 2: Prepare the Host

Clone the cloudnull/xerotier-public repository on the XIM host (or copy just the compose/ directory) so the compose files referenced below are available locally.

bash

git clone https://github.com/cloudnull/xerotier-public.git
cd xerotier-public/compose

On the XIM host, create the directory layout:

bash

sudo install -d -m 0755 /etc/xerotier
sudo install -d -m 0755 /var/lib/inference
sudo install -d -m 0755 /var/log/xerotier

                

Pick the deployment that matches your accelerator. All three options use the published container image from cloudnull/xerotier-public; pick exactly one. On an Apple Silicon Mac, stop here and use the native application instead, see XIM on macOS.

Option A: Nvidia CUDA

bash

docker compose -f compose.agent-nvidia.yaml up -d

Requires nvidia-container-toolkit and a driver compatible with the bundled CUDA runtime (CUDA 12.x).

Option B: AMD ROCm

bash

docker compose -f compose.agent-amd-rocm.yaml up -d

Requires the ROCm runtime visible at /dev/kfd and /dev/dri/*. The image targets gfx906+ and includes the xerotier-vllm wrapper.

Option C: CPU (AMD EPYC + ZenDNN)

bash

docker compose -f compose.agent-amd-cpu-zendnn.yaml up -d

For development and testing only. KV cache offload flags do not apply on CPU backends; vLLM serves inference directly from host RAM.

Step 3: Configure the Environment

Compose file: the example commands in this and the next step use compose.agent-nvidia.yaml. If your step-2 pick was ROCm or CPU, substitute compose.agent-amd-rocm.yaml or compose.agent-amd-cpu-zendnn.yaml in every command below.

Create a .env file in the compose/ directory. Docker Compose reads it automatically for variable substitution, so the agent picks up the join key on the next up.

compose/.env

# Enrollment (first run only; remove after successful enrollment)
XEROTIER_AGENT_JOIN_KEY=eyJhbGciOi...

# KV cache CPU offloading (optional, NVIDIA CUDA only)
# Default: 25%% of system RAM, clamped [4, 128] GiB. Set 0 to disable.
# Ignored on AMD ROCm and CPU backends.
XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB=

                

The agent does not need a router address or CURVE public key: both arrive in the enrollment response and are persisted under /var/lib/inference/ alongside the agent's own auto-generated CURVE keypair.

Treat this file as a secret: the join token is a bearer credential. Set chmod 600 .env; never commit it to a repository and never copy it onto an operator workstation.

Recreate the container so the new env file takes effect:

bash

docker compose -f compose.agent-nvidia.yaml up -d --force-recreate

Step 4: Start the Agent

bash

docker compose -f compose.agent-nvidia.yaml up -d
docker compose -f compose.agent-nvidia.yaml logs -f agent

Expected log lines on success (paraphrased, the agent emits structured log records with these fields, exact wording may vary):

text

[info] consuming join key from env XEROTIER_AGENT_JOIN_KEY
[info] enrollment succeeded, worker_id=wkr_01HX..., region=us-east-1
[info] detected accelerator: nvidiaCUDA, gpu_count=1, vram=24576MiB
[info] KV offload enabled size_gib=16 accelerator=nvidiaCUDA
[info] vLLM process started pid=42
[info] vLLM engine ready, listening on /tmp/xerotier-engine.sock
[info] lease established, heartbeat every 10s

                

The join key is consumed; CURVE keys and the worker state file are written to /var/lib/inference/. Do not re-use the join key even if the enrollment seems to have failed.

If something stalls here: see Troubleshooting for enrollment rejected, no accelerator detected, and vLLM startup timed out, the three failures that fire most often at this step.

Step 5: Load a First Model

Models are project-scoped resources. Importing a new model from a registry (HuggingFace and similar) is a one-time setup performed in the dashboard; once a model exists in the project, an endpoint binds it to a service tier and schedules it onto an enrolled XIM.

5a: Import the model (one-time)

Open the Models page in the dashboard, click Add Model, paste the registry path (meta-llama/Llama-3.1-8B-Instruct), and confirm. The Frontend records the model in the project catalog and assigns it a UUID.

To list project models from the CLI:

bash

xeroctl models

The first column of the output is the model UUID; copy it into --model-id below.

5b: Create an endpoint that binds the model

Either create the endpoint in the Endpoints dashboard, or from the CLI:

bash

xeroctl endpoints create \
    --name "Llama 3.1 8B" \
    --slug llama-31-8b \
    --model-id <model-uuid> \
    --tier-id gpu \
    --task-mode generate

                

5c: Provision the endpoint to a XIM

Provisioning dispatches the model to the matching XIM. The router streams weights to the agent and waits for the vLLM engine to report ready.

bash

xeroctl endpoints provision <endpoint-uuid>
xeroctl endpoints list

When the endpoint status reaches active, the model is live and routable.

Step 6: Run a Chat Completion

Hit the OpenAI-compatible API on the router:

bash

curl https://router.example.com/v1/chat/completions \
    -H "Authorization: Bearer $XEROTIER_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "user", "content": "Say hi in one word."}
        ]
    }'

                

Expected response:

json

{
    "id": "chatcmpl-01HX...",
    "object": "chat.completion",
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "choices": [{
        "index": 0,
        "message": {"role": "assistant", "content": "Hi."},
        "finish_reason": "stop"
    }],
    "usage": {"prompt_tokens": 14, "completion_tokens": 2, "total_tokens": 16}
}

                

Congratulations, your first XIM is serving model inference through the router. Next, tune the inference stack for your workload: XIM Advanced Configuration.

Troubleshooting

See XIM Advanced Configuration for the full operational guide. Common first-deploy issues:

Enrollment rejected: the enrollment response carries a non-zero error code and message. Typical causes are an expired or already-consumed join token, or a region/tier mismatch. Mint a new join key and retry.
No accelerator detected: the container cannot see the GPU. Verify nvidia-container-toolkit is installed and the compose file exposes /dev/nvidia*, or for ROCm verify /dev/kfd and /dev/dri/* are mounted into the container.
vLLM startup timed out: the agent log emits vLLM startup timed out when the engine fails to report ready within its inactivity grace or absolute ceiling. Usually the model is too large for available VRAM: reduce --gpu-memory-utilization, raise XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB (NVIDIA only), or switch to a smaller or quantized model.
Model pull failure: the agent could not stream weights from the router or upstream registry. Confirm outbound HTTPS works from the host, check that the model cache path (~/.cache/xerotier/models by default) has free space, and inspect the router log for the upstream HTTP error.