Xerotier Inference Microservice (XIM)

Deploy a XIM node on your own infrastructure using Docker containers with NVIDIA GPU support. XIM nodes give you full control over your inference hardware while leveraging Xerotier.ai routing and management capabilities.

Prerequisites

Before deploying a XIM node, ensure your infrastructure meets the following requirements.

Hardware Requirements

Component Minimum Recommended
GPU NVIDIA with 16GB VRAM NVIDIA A30/H100+ or RTX 3090+
System RAM 32GB 64GB+
Disk Space 100GB SSD 500GB+ NVMe SSD
Network 100 Mbps 1 Gbps+

Software Requirements

Software Version
Docker 24.0+
Docker Compose 2.20+
NVIDIA Driver 535+
NVIDIA Container Toolkit 1.14+

NVIDIA Container Toolkit Installation

Install the NVIDIA Container Toolkit to enable GPU access in Docker containers:

Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

Verify GPU Access

bash
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi

Container Setup

The Xerotier.ai backend agent is distributed as a Docker image that includes vLLM for model inference.

Container Details

Property Value
Image xerotier/backend-agent:latest
Home Directory /var/lib/inference
Model Cache /var/lib/inference/.cache/xerotier/models
Config Directory /var/lib/inference/.config/xerotier

Pull the Image

bash
docker pull xerotier/backend-agent:latest

Environment Variables

Configure the XIM node using environment variables. The following tables list the options needed for deployment.

Required Variables

Variable Description
XEROTIER_AGENT_JOIN_KEY Join key for enrolling with the Xerotier router mesh. Obtain from the Agents dashboard.

Agent Configuration

Variable Default Description
XEROTIER_AGENT_MAX_CONCURRENT 8 Maximum concurrent inference requests
XEROTIER_AGENT_LOG_LEVEL info Log level: debug, info, warn, error

vLLM Configuration

Variable Default Description
VLLM_MODEL meta-llama/Llama-3.2-1B-Instruct HuggingFace model ID or local path
XEROTIER_AGENT_MAX_MODEL_LEN auto Maximum sequence length (uses model default if unset)
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE auto Tensor parallel size for multi-GPU (auto-configured from visible devices)
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION 0.90 GPU memory utilization (0.0-1.0)
SHM_SIZE 8589934592 Docker Compose shm_size setting (not an environment variable). Controls shared memory allocation for the container in bytes (8GB default).
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES 0 Specific GPU devices to use (comma-separated)

Cache Configuration

Variable Default Description
XEROTIER_AGENT_MODEL_CACHE_PATH /var/lib/inference/.cache/xerotier/models Local model cache directory
XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB 100 Maximum cache size in gigabytes

GPU Configuration

Configure GPU access and memory allocation for optimal performance.

Single GPU Setup

.env
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=1 XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90 XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0

Multi-GPU Setup

For models that require multiple GPUs (tensor parallelism):

.env
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2 XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90 XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0,1

Specific GPU Selection

To use specific GPUs (e.g., GPUs 2 and 3 on a 4-GPU system):

.env
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2 XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=2,3

Shared Memory Configuration

Larger models require more shared memory. Adjust shm_size (Docker Compose config) based on model size:

Model Size Recommended SHM_SIZE
1-8B parameters 8GB (8589934592)
13-34B parameters 16GB (17179869184)
70B+ parameters 32GB (34359738368)

Model Storage

Configure persistent storage for downloaded models to avoid re-downloading on container restart.

Host Directory Setup

Create directories on the host for persistent model and configuration storage:

bash
sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache

Docker Compose with Volume Mounts

docker-compose.agent-nvidia.yaml
# SPDX-License-Identifier: MIT # Xerotier Agent - NVIDIA GPU Stack (with LMCache + Valkey) # # Deploys a XIM GPU node with Valkey-backed LMCache for # KV cache sharing across multiple XIM nodes on the same host. # # QUICK START: # 1. Get a join key from your Xerotier dashboard: # Dashboard -> Infrastructure -> Agents -> Generate Join Key # # 2. Create host directories with correct permissions: # sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache # sudo chown -R 5152:5152 /data/xerotier # # 3. Set environment variables or create a .env file: # export XEROTIER_AGENT_JOIN_KEY=xjk_your_key_here # # 4. Start the agent: # docker compose -f docker-compose.agent-nvidia.yaml up -d # # ENROLLMENT WORKFLOW: # - On first start, the agent enrolls using your join key # - Enrollment state is persisted to /data/xerotier/config # - On subsequent restarts, the agent reconnects automatically # - You can remove XEROTIER_AGENT_JOIN_KEY after successful enrollment # # ENVIRONMENT VARIABLES: # XEROTIER_AGENT_JOIN_KEY [REQUIRED] Join key from Xerotier dashboard (first run only) # XEROTIER_AGENT_MAX_CONCURRENT Optional ceiling for concurrent requests (auto-configured when not set) # XEROTIER_AGENT_TENSOR_PARALLEL_SIZE Tensor parallel size for multi-GPU (default: 1) # XEROTIER_AGENT_GPU_MEMORY_UTILIZATION GPU memory utilization fraction (default: 0.90) # XEROTIER_AGENT_MAX_MODEL_LEN Maximum sequence length (default: model default) # XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB Local model cache size in GB (default: 100) # XEROTIER_AGENT_LOG_LEVEL Logging level: trace, debug, info, warning, error (default: info) # SHM_SIZE Shared memory size in bytes (default: 8589934592 = 8GB) # DOCKER_GPU_COUNT Number of GPUs to reserve (default: 1) # # KERNEL SOCKET BUFFER TUNING (optional): # The agent sets ZeroMQ socket buffers to 4 MiB for streaming throughput. # Linux defaults net.core.wmem_max and net.core.rmem_max to 212992 bytes, # which silently caps the requested buffer size. The agent will attempt to # raise these limits on startup (requires privileged mode or CAP_SYS_ADMIN). # # If the container is not privileged, set these on the host before starting: # sudo sysctl -w net.core.wmem_max=4194304 # sudo sysctl -w net.core.rmem_max=4194304 # # To persist across reboots, add to /etc/sysctl.d/99-xerotier.conf: # net.core.wmem_max = 4194304 # net.core.rmem_max = 4194304 services: agent: image: ${DOCKER_REGISTRY:-ghcr.io/cloudnull/xerotier}-public/xim-vllm-cu:${VERSION:-latest} container_name: xim-vllm-cu network_mode: host ipc: host privileged: true shm_size: ${SHM_SIZE:-90g} deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] volumes: # Persistent model cache - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models # Persistent enrollment state - /data/xerotier/config:/var/lib/inference/.config/xerotier # Caching - /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache environment: # Agent Enrollment [REQUIRED for first run] XEROTIER_AGENT_JOIN_KEY: "${XEROTIER_AGENT_JOIN_KEY:-}" # Model is assigned from the dashboard during enrollment XEROTIER_AGENT_LOG_LEVEL: "${XEROTIER_AGENT_LOG_LEVEL:-info}" # LMCache Configuration XEROTIER_AGENT_LMCACHE_ENABLED: true XEROTIER_AGENT_LMCACHE_REDIS_URL: "${XEROTIER_AGENT_LMCACHE_REDIS_URL:-}" restart: unless-stopped

LMCache Setup

LMCache provides multi-tiered KV cache sharing for vLLM, reducing Time-to-First-Token (TTFT) for repeated prompt prefixes. The XIM node natively manages LMCache configuration.

Benefits

  • Reduced Time-to-First-Token: Cache hits can reduce TTFT by 50-90% for repeated prompt prefixes
  • Multi-Tier Caching: Three cache tiers with different speed/capacity tradeoffs
  • Horizontal Scalability: Multiple XIM nodes can share a remote Redis/Valkey cache
  • Graceful Degradation: XIM node continues without cache if initialization fails

Cache Tiers

Tier Speed Default Size Use Case
CPU Memory ~100 GB/s 10% of system RAM Hot cache, frequently accessed prefixes
Local Disk ~5 GB/s (NVMe) 10% of partition Warm cache, persistent across restarts
Remote Redis ~1 GB/s (network) Valkey maxmemory Shared cache across multiple XIM nodes

Quick Start (Local Only)

Enable LMCache with local CPU and disk caching only (no Redis required):

.env
# Enable LMCache with auto-calculated sizes XEROTIER_AGENT_LMCACHE_ENABLED=true

Tip: When LMCache is enabled without explicit size overrides, the XIM node auto-calculates optimal values (10% of system resources). This works well for most deployments.

AMD ROCm GPU Setup with Valkey

For AMD ROCm GPU deployments with Valkey for shared KV cache:

docker-compose.agent-amd-rocm.yaml
# SPDX-License-Identifier: MIT # Xerotier Agent - AMD GPU ROCm Stack (with LMCache + Valkey) # # Deploys a XIM AMD GPU node using ROCm with Valkey-backed # LMCache for KV cache sharing across multiple XIM nodes on the same host. # # QUICK START: # 1. Get a join key from your Xerotier dashboard: # Dashboard -> Infrastructure -> Agents -> Generate Join Key # # 2. Create host directories with correct permissions: # sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache # sudo chown -R 5152:5152 /data/xerotier # # 3. Set environment variables or create a .env file: # export XEROTIER_AGENT_JOIN_KEY=xjk_your_key_here # # 4. Start the agent: # docker compose -f docker-compose.agent-amd-rocm.yaml up -d # # PREREQUISITES: # - AMD GPU with ROCm support (MI210, MI250, MI300, RX 7900, etc.) # - ROCm driver installed on the host # - /dev/kfd and /dev/dri devices available # - User in video and render groups on the host # # ENROLLMENT WORKFLOW: # - On first start, the agent enrolls using your join key # - Enrollment state is persisted to /data/xerotier/config # - On subsequent restarts, the agent reconnects automatically # - You can remove XEROTIER_AGENT_JOIN_KEY after successful enrollment # # ENVIRONMENT VARIABLES: # XEROTIER_AGENT_JOIN_KEY [REQUIRED] Join key from Xerotier dashboard (first run only) # HIP_VISIBLE_DEVICES GPU device indices to use (default: 0) # XEROTIER_AGENT_MAX_CONCURRENT Optional ceiling for concurrent requests (auto-configured when not set) # XEROTIER_AGENT_TENSOR_PARALLEL_SIZE Tensor parallel size for multi-GPU (default: 1) # XEROTIER_AGENT_GPU_MEMORY_UTILIZATION GPU memory utilization fraction (default: 0.90) # XEROTIER_AGENT_MAX_MODEL_LEN Maximum sequence length (default: model default) # XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB Local model cache size in GB (default: 100) # XEROTIER_AGENT_LOG_LEVEL Logging level: trace, debug, info, warning, error (default: info) # SHM_SIZE Shared memory size in bytes (default: 8589934592 = 8GB) # # KERNEL SOCKET BUFFER TUNING (optional): # The agent sets ZeroMQ socket buffers to 4 MiB for streaming throughput. # Linux defaults net.core.wmem_max and net.core.rmem_max to 212992 bytes, # which silently caps the requested buffer size. The agent will attempt to # raise these limits on startup (requires privileged mode or CAP_SYS_ADMIN). # # If the container is not privileged, set these on the host before starting: # sudo sysctl -w net.core.wmem_max=4194304 # sudo sysctl -w net.core.rmem_max=4194304 # # To persist across reboots, add to /etc/sysctl.d/99-xerotier.conf: # net.core.wmem_max = 4194304 # net.core.rmem_max = 4194304 services: agent: image: ${DOCKER_REGISTRY:-ghcr.io/cloudnull/xerotier}-public/xim-vllm-rocm:${VERSION:-latest} container_name: xim-vllm-rocm network_mode: host ipc: host privileged: true shm_size: ${SHM_SIZE:-90g} devices: - /dev/kfd:/dev/kfd - /dev/dri:/dev/dri security_opt: - seccomp=unconfined group_add: - video - render volumes: # Persistent model cache - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models # Persistent enrollment state - /data/xerotier/config:/var/lib/inference/.config/xerotier # Caching - /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache environment: # Agent Enrollment [REQUIRED for first run] XEROTIER_AGENT_JOIN_KEY: "${XEROTIER_AGENT_JOIN_KEY:-}" # ROCm GPU Configuration HIP_VISIBLE_DEVICES: "${HIP_VISIBLE_DEVICES:-0}" # Model is assigned from the dashboard during enrollment XEROTIER_AGENT_MAX_MODEL_LEN: "${XEROTIER_AGENT_MAX_MODEL_LEN:-}" XEROTIER_AGENT_TENSOR_PARALLEL_SIZE: "${XEROTIER_AGENT_TENSOR_PARALLEL_SIZE:-1}" XEROTIER_AGENT_GPU_MEMORY_UTILIZATION: "${XEROTIER_AGENT_GPU_MEMORY_UTILIZATION:-0.90}" XEROTIER_AGENT_MAX_CONCURRENT: "${XEROTIER_AGENT_MAX_CONCURRENT:-}" XEROTIER_AGENT_LOG_LEVEL: "${XEROTIER_AGENT_LOG_LEVEL:-info}" # LMCache Configuration XEROTIER_AGENT_LMCACHE_ENABLED: "false" XEROTIER_AGENT_LMCACHE_REDIS_URL: "" restart: unless-stopped

Sizing Recommendations

Deployment Size CPU Cache Disk Cache Valkey Memory
Small (32GB RAM, single XIM node) 2-4 GB 20 GB 4 GB
Medium (64GB RAM, 2-4 XIM nodes) 4-8 GB per node 50 GB per node 8 GB
Large (128GB+ RAM, 4+ XIM nodes) 8-16 GB per node 100 GB per node 16-32 GB

Verification

After starting your XIM node, verify that it is running correctly.

Check Container Logs

bash
docker-compose logs -f agent

Verify LMCache

If LMCache is enabled, check the logs for successful initialization:

bash
docker-compose logs agent | grep -i lmcache # Expected output: # LMCache enabled # Wrote LMCache configuration # config_path=/var/lib/inference/.config/xerotier/lmcache_config.yaml