Xerotier Inference Microservice (XIM)
Deploy a XIM node on your own infrastructure using Docker containers with NVIDIA GPU support. XIM nodes give you full control over your inference hardware while leveraging Xerotier.ai routing and management capabilities.
Prerequisites
Before deploying a XIM node, ensure your infrastructure meets the following requirements.
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 16GB VRAM | NVIDIA A30/H100+ or RTX 3090+ |
| System RAM | 32GB | 64GB+ |
| Disk Space | 100GB SSD | 500GB+ NVMe SSD |
| Network | 100 Mbps | 1 Gbps+ |
Software Requirements
| Software | Version |
|---|---|
| Docker | 24.0+ |
| Docker Compose | 2.20+ |
| NVIDIA Driver | 535+ |
| NVIDIA Container Toolkit | 1.14+ |
NVIDIA Container Toolkit Installation
Install the NVIDIA Container Toolkit to enable GPU access in Docker containers:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify GPU Access
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi
Container Setup
The Xerotier.ai backend agent is distributed as a Docker image that includes vLLM for model inference.
Container Details
| Property | Value |
|---|---|
| Image | xerotier/backend-agent:latest |
| Home Directory | /var/lib/inference |
| Model Cache | /var/lib/inference/.cache/xerotier/models |
| Config Directory | /var/lib/inference/.config/xerotier |
Pull the Image
docker pull xerotier/backend-agent:latest
Environment Variables
Configure the XIM node using environment variables. The following tables list the options needed for deployment.
Required Variables
| Variable | Description |
|---|---|
XEROTIER_AGENT_JOIN_KEY |
Join key for enrolling with the Xerotier router mesh. Obtain from the Agents dashboard. |
Agent Configuration
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_MAX_CONCURRENT |
8 | Maximum concurrent inference requests |
XEROTIER_AGENT_LOG_LEVEL |
info | Log level: debug, info, warn, error |
vLLM Configuration
| Variable | Default | Description |
|---|---|---|
VLLM_MODEL |
meta-llama/Llama-3.2-1B-Instruct | HuggingFace model ID or local path |
XEROTIER_AGENT_MAX_MODEL_LEN |
auto | Maximum sequence length (uses model default if unset) |
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE |
auto | Tensor parallel size for multi-GPU (auto-configured from visible devices) |
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION |
0.90 | GPU memory utilization (0.0-1.0) |
SHM_SIZE |
8589934592 | Docker Compose shm_size setting (not an environment variable). Controls shared memory allocation for the container in bytes (8GB default). |
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES |
0 | Specific GPU devices to use (comma-separated) |
Cache Configuration
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_MODEL_CACHE_PATH |
/var/lib/inference/.cache/xerotier/models | Local model cache directory |
XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB |
100 | Maximum cache size in gigabytes |
GPU Configuration
Configure GPU access and memory allocation for optimal performance.
Single GPU Setup
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=1
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0
Multi-GPU Setup
For models that require multiple GPUs (tensor parallelism):
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0,1
Specific GPU Selection
To use specific GPUs (e.g., GPUs 2 and 3 on a 4-GPU system):
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=2,3
Shared Memory Configuration
Larger models require more shared memory. Adjust shm_size (Docker Compose config) based on model size:
| Model Size | Recommended SHM_SIZE |
|---|---|
| 1-8B parameters | 8GB (8589934592) |
| 13-34B parameters | 16GB (17179869184) |
| 70B+ parameters | 32GB (34359738368) |
Model Storage
Configure persistent storage for downloaded models to avoid re-downloading on container restart.
Host Directory Setup
Create directories on the host for persistent model and configuration storage:
sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache
Docker Compose with Volume Mounts
# SPDX-License-Identifier: MIT
# Xerotier Agent - NVIDIA GPU Stack (with LMCache + Valkey)
#
# Deploys a XIM GPU node with Valkey-backed LMCache for
# KV cache sharing across multiple XIM nodes on the same host.
#
# QUICK START:
# 1. Get a join key from your Xerotier dashboard:
# Dashboard -> Infrastructure -> Agents -> Generate Join Key
#
# 2. Create host directories with correct permissions:
# sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache
# sudo chown -R 5152:5152 /data/xerotier
#
# 3. Set environment variables or create a .env file:
# export XEROTIER_AGENT_JOIN_KEY=xjk_your_key_here
#
# 4. Start the agent:
# docker compose -f docker-compose.agent-nvidia.yaml up -d
#
# ENROLLMENT WORKFLOW:
# - On first start, the agent enrolls using your join key
# - Enrollment state is persisted to /data/xerotier/config
# - On subsequent restarts, the agent reconnects automatically
# - You can remove XEROTIER_AGENT_JOIN_KEY after successful enrollment
#
# ENVIRONMENT VARIABLES:
# XEROTIER_AGENT_JOIN_KEY [REQUIRED] Join key from Xerotier dashboard (first run only)
# XEROTIER_AGENT_MAX_CONCURRENT Optional ceiling for concurrent requests (auto-configured when not set)
# XEROTIER_AGENT_TENSOR_PARALLEL_SIZE Tensor parallel size for multi-GPU (default: 1)
# XEROTIER_AGENT_GPU_MEMORY_UTILIZATION GPU memory utilization fraction (default: 0.90)
# XEROTIER_AGENT_MAX_MODEL_LEN Maximum sequence length (default: model default)
# XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB Local model cache size in GB (default: 100)
# XEROTIER_AGENT_LOG_LEVEL Logging level: trace, debug, info, warning, error (default: info)
# SHM_SIZE Shared memory size in bytes (default: 8589934592 = 8GB)
# DOCKER_GPU_COUNT Number of GPUs to reserve (default: 1)
#
# KERNEL SOCKET BUFFER TUNING (optional):
# The agent sets ZeroMQ socket buffers to 4 MiB for streaming throughput.
# Linux defaults net.core.wmem_max and net.core.rmem_max to 212992 bytes,
# which silently caps the requested buffer size. The agent will attempt to
# raise these limits on startup (requires privileged mode or CAP_SYS_ADMIN).
#
# If the container is not privileged, set these on the host before starting:
# sudo sysctl -w net.core.wmem_max=4194304
# sudo sysctl -w net.core.rmem_max=4194304
#
# To persist across reboots, add to /etc/sysctl.d/99-xerotier.conf:
# net.core.wmem_max = 4194304
# net.core.rmem_max = 4194304
services:
agent:
image: ${DOCKER_REGISTRY:-ghcr.io/cloudnull/xerotier}-public/xim-vllm-cu:${VERSION:-latest}
container_name: xim-vllm-cu
network_mode: host
ipc: host
privileged: true
shm_size: ${SHM_SIZE:-90g}
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
volumes:
# Persistent model cache
- /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
# Persistent enrollment state
- /data/xerotier/config:/var/lib/inference/.config/xerotier
# Caching
- /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache
environment:
# Agent Enrollment [REQUIRED for first run]
XEROTIER_AGENT_JOIN_KEY: "${XEROTIER_AGENT_JOIN_KEY:-}"
# Model is assigned from the dashboard during enrollment
XEROTIER_AGENT_LOG_LEVEL: "${XEROTIER_AGENT_LOG_LEVEL:-info}"
# LMCache Configuration
XEROTIER_AGENT_LMCACHE_ENABLED: true
XEROTIER_AGENT_LMCACHE_REDIS_URL: "${XEROTIER_AGENT_LMCACHE_REDIS_URL:-}"
restart: unless-stopped
LMCache Setup
LMCache provides multi-tiered KV cache sharing for vLLM, reducing Time-to-First-Token (TTFT) for repeated prompt prefixes. The XIM node natively manages LMCache configuration.
Benefits
- Reduced Time-to-First-Token: Cache hits can reduce TTFT by 50-90% for repeated prompt prefixes
- Multi-Tier Caching: Three cache tiers with different speed/capacity tradeoffs
- Horizontal Scalability: Multiple XIM nodes can share a remote Redis/Valkey cache
- Graceful Degradation: XIM node continues without cache if initialization fails
Cache Tiers
| Tier | Speed | Default Size | Use Case |
|---|---|---|---|
| CPU Memory | ~100 GB/s | 10% of system RAM | Hot cache, frequently accessed prefixes |
| Local Disk | ~5 GB/s (NVMe) | 10% of partition | Warm cache, persistent across restarts |
| Remote Redis | ~1 GB/s (network) | Valkey maxmemory | Shared cache across multiple XIM nodes |
Quick Start (Local Only)
Enable LMCache with local CPU and disk caching only (no Redis required):
# Enable LMCache with auto-calculated sizes
XEROTIER_AGENT_LMCACHE_ENABLED=true
Tip: When LMCache is enabled without explicit size overrides, the XIM node auto-calculates optimal values (10% of system resources). This works well for most deployments.
AMD ROCm GPU Setup with Valkey
For AMD ROCm GPU deployments with Valkey for shared KV cache:
# SPDX-License-Identifier: MIT
# Xerotier Agent - AMD GPU ROCm Stack (with LMCache + Valkey)
#
# Deploys a XIM AMD GPU node using ROCm with Valkey-backed
# LMCache for KV cache sharing across multiple XIM nodes on the same host.
#
# QUICK START:
# 1. Get a join key from your Xerotier dashboard:
# Dashboard -> Infrastructure -> Agents -> Generate Join Key
#
# 2. Create host directories with correct permissions:
# sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache
# sudo chown -R 5152:5152 /data/xerotier
#
# 3. Set environment variables or create a .env file:
# export XEROTIER_AGENT_JOIN_KEY=xjk_your_key_here
#
# 4. Start the agent:
# docker compose -f docker-compose.agent-amd-rocm.yaml up -d
#
# PREREQUISITES:
# - AMD GPU with ROCm support (MI210, MI250, MI300, RX 7900, etc.)
# - ROCm driver installed on the host
# - /dev/kfd and /dev/dri devices available
# - User in video and render groups on the host
#
# ENROLLMENT WORKFLOW:
# - On first start, the agent enrolls using your join key
# - Enrollment state is persisted to /data/xerotier/config
# - On subsequent restarts, the agent reconnects automatically
# - You can remove XEROTIER_AGENT_JOIN_KEY after successful enrollment
#
# ENVIRONMENT VARIABLES:
# XEROTIER_AGENT_JOIN_KEY [REQUIRED] Join key from Xerotier dashboard (first run only)
# HIP_VISIBLE_DEVICES GPU device indices to use (default: 0)
# XEROTIER_AGENT_MAX_CONCURRENT Optional ceiling for concurrent requests (auto-configured when not set)
# XEROTIER_AGENT_TENSOR_PARALLEL_SIZE Tensor parallel size for multi-GPU (default: 1)
# XEROTIER_AGENT_GPU_MEMORY_UTILIZATION GPU memory utilization fraction (default: 0.90)
# XEROTIER_AGENT_MAX_MODEL_LEN Maximum sequence length (default: model default)
# XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB Local model cache size in GB (default: 100)
# XEROTIER_AGENT_LOG_LEVEL Logging level: trace, debug, info, warning, error (default: info)
# SHM_SIZE Shared memory size in bytes (default: 8589934592 = 8GB)
#
# KERNEL SOCKET BUFFER TUNING (optional):
# The agent sets ZeroMQ socket buffers to 4 MiB for streaming throughput.
# Linux defaults net.core.wmem_max and net.core.rmem_max to 212992 bytes,
# which silently caps the requested buffer size. The agent will attempt to
# raise these limits on startup (requires privileged mode or CAP_SYS_ADMIN).
#
# If the container is not privileged, set these on the host before starting:
# sudo sysctl -w net.core.wmem_max=4194304
# sudo sysctl -w net.core.rmem_max=4194304
#
# To persist across reboots, add to /etc/sysctl.d/99-xerotier.conf:
# net.core.wmem_max = 4194304
# net.core.rmem_max = 4194304
services:
agent:
image: ${DOCKER_REGISTRY:-ghcr.io/cloudnull/xerotier}-public/xim-vllm-rocm:${VERSION:-latest}
container_name: xim-vllm-rocm
network_mode: host
ipc: host
privileged: true
shm_size: ${SHM_SIZE:-90g}
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
security_opt:
- seccomp=unconfined
group_add:
- video
- render
volumes:
# Persistent model cache
- /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
# Persistent enrollment state
- /data/xerotier/config:/var/lib/inference/.config/xerotier
# Caching
- /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache
environment:
# Agent Enrollment [REQUIRED for first run]
XEROTIER_AGENT_JOIN_KEY: "${XEROTIER_AGENT_JOIN_KEY:-}"
# ROCm GPU Configuration
HIP_VISIBLE_DEVICES: "${HIP_VISIBLE_DEVICES:-0}"
# Model is assigned from the dashboard during enrollment
XEROTIER_AGENT_MAX_MODEL_LEN: "${XEROTIER_AGENT_MAX_MODEL_LEN:-}"
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE: "${XEROTIER_AGENT_TENSOR_PARALLEL_SIZE:-1}"
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION: "${XEROTIER_AGENT_GPU_MEMORY_UTILIZATION:-0.90}"
XEROTIER_AGENT_MAX_CONCURRENT: "${XEROTIER_AGENT_MAX_CONCURRENT:-}"
XEROTIER_AGENT_LOG_LEVEL: "${XEROTIER_AGENT_LOG_LEVEL:-info}"
# LMCache Configuration
XEROTIER_AGENT_LMCACHE_ENABLED: "false"
XEROTIER_AGENT_LMCACHE_REDIS_URL: ""
restart: unless-stopped
Sizing Recommendations
| Deployment Size | CPU Cache | Disk Cache | Valkey Memory |
|---|---|---|---|
| Small (32GB RAM, single XIM node) | 2-4 GB | 20 GB | 4 GB |
| Medium (64GB RAM, 2-4 XIM nodes) | 4-8 GB per node | 50 GB per node | 8 GB |
| Large (128GB+ RAM, 4+ XIM nodes) | 8-16 GB per node | 100 GB per node | 16-32 GB |
Verification
After starting your XIM node, verify that it is running correctly.
Check Container Logs
docker-compose logs -f agent
Verify LMCache
If LMCache is enabled, check the logs for successful initialization:
docker-compose logs agent | grep -i lmcache
# Expected output:
# LMCache enabled
# Wrote LMCache configuration
# config_path=/var/lib/inference/.config/xerotier/lmcache_config.yaml