Self-Hosted Agent Deployment

Deploy a self-hosted Xerotier.ai agent on your own infrastructure using Docker containers with NVIDIA GPU support. Self-hosted agents give you full control over your inference hardware while leveraging Xerotier.ai routing and management capabilities.

Prerequisites

Before deploying a self-hosted agent, ensure your infrastructure meets the following requirements.

Hardware Requirements

Component Minimum Recommended
GPU NVIDIA with 16GB VRAM NVIDIA A30/H100+ or RTX 3090+
System RAM 32GB 64GB+
Disk Space 100GB SSD 500GB+ NVMe SSD
Network 100 Mbps 1 Gbps+

Software Requirements

Software Version
Docker 24.0+
Docker Compose 2.20+
NVIDIA Driver 535+
NVIDIA Container Toolkit 1.14+

NVIDIA Container Toolkit Installation

Install the NVIDIA Container Toolkit to enable GPU access in Docker containers:

Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

Verify GPU Access

docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi

Container Setup

The Xerotier.ai backend agent is distributed as a Docker image that includes vLLM for model inference.

Container Details

Property Value
Image xerotier/backend-agent:latest
User inference (UID:GID 5152:5152)
Home Directory /var/lib/inference
Model Cache /var/lib/inference/.cache/xerotier/models
LMCache Directory /var/lib/inference/.cache/lmcache
Config Directory /var/lib/inference/.config/xerotier

Pull the Image

docker pull xerotier/backend-agent:latest

Environment Variables

Configure the agent using environment variables. The following tables list all available options.

Required Variables

Variable Description
XEROTIER_AGENT_JOIN_KEY Join key for enrolling with the Xerotier router mesh. Obtain from the Agents dashboard.

Agent Configuration

Variable Default Description
XEROTIER_AGENT_MAX_CONCURRENT 8 Maximum concurrent inference requests
XEROTIER_AGENT_LOG_LEVEL info Log level: debug, info, warn, error
XEROTIER_AGENT_HEARTBEAT_MS 5000 Heartbeat interval in milliseconds

vLLM Configuration

Variable Default Description
VLLM_MODEL meta-llama/Llama-3.2-1B-Instruct HuggingFace model ID or local path
XEROTIER_AGENT_MAX_MODEL_LEN auto Maximum sequence length (uses model default if unset)
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE auto Tensor parallel size for multi-GPU (auto-configured from visible devices)
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION 0.90 GPU memory utilization (0.0-1.0)
SHM_SIZE 8589934592 Docker Compose shm_size setting (not an environment variable). Controls shared memory allocation for the container in bytes (8GB default).
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES 0 Specific GPU devices to use (comma-separated)

Cache Configuration

Variable Default Description
XEROTIER_AGENT_MODEL_CACHE_PATH /var/lib/inference/.cache/xerotier/models Local model cache directory
XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB 100 Maximum cache size in gigabytes

LMCache Configuration

LMCache provides multi-tiered KV cache sharing for reduced Time-to-First-Token (TTFT). The agent natively manages LMCache configuration and passes it to vLLM at startup.

Variable Default Description
XEROTIER_AGENT_LMCACHE_ENABLED true Enable LMCache KV cache sharing
XEROTIER_AGENT_LMCACHE_REDIS_URL - Redis URL (redis://host:port or redis://:pass@host:port)
XEROTIER_AGENT_LMCACHE_MAX_CPU_MB auto (10% RAM) Maximum CPU memory cache size in MB
XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB auto (10% disk) Maximum disk cache size in GB
XEROTIER_AGENT_LMCACHE_DISK_PATH /var/lib/inference/.cache/lmcache Disk cache storage directory

GPU Configuration

Configure GPU access and memory allocation for optimal performance.

Single GPU Setup

.env
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=1 XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90 XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0

Multi-GPU Setup

For models that require multiple GPUs (tensor parallelism):

.env
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2 XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90 XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0,1

Specific GPU Selection

To use specific GPUs (e.g., GPUs 2 and 3 on a 4-GPU system):

.env
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2 XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=2,3

Shared Memory Configuration

Larger models require more shared memory. Adjust shm_size (Docker Compose config) based on model size:

Model Size Recommended SHM_SIZE
1-8B parameters 8GB (8589934592)
13-34B parameters 16GB (17179869184)
70B+ parameters 32GB (34359738368)

Model Storage

Configure persistent storage for downloaded models to avoid re-downloading on container restart.

Host Directory Setup

Create directories on the host with correct permissions matching the container user (UID:GID 5152:5152):

sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache sudo chown -R 5152:5152 /data/xerotier

Important: The container runs as the inference user with UID:GID 5152:5152. Host directories must be owned by this user for the agent to read and write model files.

Docker Compose with Volume Mounts

docker-compose.yaml
services: agent: image: xerotier/backend-agent:latest runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia count: ${DOCKER_GPU_COUNT:-1} capabilities: [gpu] volumes: - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models - /data/xerotier/config:/var/lib/inference/.config/xerotier - /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache environment: - XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY} # Model is assigned from the dashboard during enrollment - XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-} - XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=${XEROTIER_AGENT_TENSOR_PARALLEL_SIZE:-1} - XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=${XEROTIER_AGENT_GPU_MEMORY_UTILIZATION:-0.90} - XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-8} - XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info} shm_size: ${SHM_SIZE:-8589934592} restart: unless-stopped

LMCache Setup

LMCache provides multi-tiered KV cache sharing for vLLM, dramatically reducing Time-to-First-Token (TTFT) for repeated prompt prefixes. The agent natively manages LMCache configuration.

Benefits

  • Reduced Time-to-First-Token: Cache hits can reduce TTFT by 50-90% for repeated prompt prefixes
  • Multi-Tier Caching: Three cache tiers with different speed/capacity tradeoffs
  • Horizontal Scalability: Multiple agents can share a remote Redis/Valkey cache
  • Graceful Degradation: Agent continues without cache if initialization fails

Cache Tiers

Tier Speed Default Size Use Case
CPU Memory ~100 GB/s 10% of system RAM Hot cache, frequently accessed prefixes
Local Disk ~5 GB/s (NVMe) 10% of partition Warm cache, persistent across restarts
Remote Redis ~1 GB/s (network) Valkey maxmemory Shared cache across multiple agents

Quick Start (Local Only)

Enable LMCache with local CPU and disk caching only (no Redis required):

.env
# Enable LMCache with auto-calculated sizes XEROTIER_AGENT_LMCACHE_ENABLED=true # Optional: Override auto-calculated sizes # XEROTIER_AGENT_LMCACHE_MAX_CPU_MB=4096 # 4GB CPU cache # XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB=50 # 50GB disk cache

Production Setup with Valkey

For multi-agent deployments, add Valkey for shared KV cache:

docker-compose.yaml
services: valkey: image: valkey/valkey:8.0 ports: - "6379:6379" command: - valkey-server - --requirepass - "${VALKEY_PASSWORD}" - --maxmemory - 8gb - --maxmemory-policy - allkeys-lru - --save - "" - --appendonly - "no" healthcheck: test: ["CMD", "valkey-cli", "-a", "${VALKEY_PASSWORD}", "ping"] interval: 10s timeout: 5s retries: 3 restart: unless-stopped agent: image: xerotier/backend-agent:latest runtime: nvidia depends_on: valkey: condition: service_healthy deploy: resources: reservations: devices: - driver: nvidia count: ${DOCKER_GPU_COUNT:-1} capabilities: [gpu] volumes: - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models - /data/xerotier/config:/var/lib/inference/.config/xerotier - /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache environment: - XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY} # Model is assigned from the dashboard during enrollment - XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-} - XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=${XEROTIER_AGENT_TENSOR_PARALLEL_SIZE:-1} - XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=${XEROTIER_AGENT_GPU_MEMORY_UTILIZATION:-0.90} - XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-8} - XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info} # LMCache Configuration - XEROTIER_AGENT_LMCACHE_ENABLED=true - XEROTIER_AGENT_LMCACHE_REDIS_URL=redis://:${VALKEY_PASSWORD}@valkey:6379 - XEROTIER_AGENT_LMCACHE_MAX_CPU_MB=${XEROTIER_AGENT_LMCACHE_MAX_CPU_MB:-} - XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB=${XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB:-} shm_size: ${SHM_SIZE:-8589934592} restart: unless-stopped

Sizing Recommendations

Deployment Size CPU Cache Disk Cache Valkey Memory
Small (32GB RAM, single agent) 2-4 GB 20 GB 4 GB
Medium (64GB RAM, 2-4 agents) 4-8 GB per agent 50 GB per agent 8 GB
Large (128GB+ RAM, 4+ agents) 8-16 GB per agent 100 GB per agent 16-32 GB

Tip: If you leave XEROTIER_AGENT_LMCACHE_MAX_CPU_MB and XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB unset, the agent auto-calculates optimal values (10% of system resources). This works well for most deployments.

Redis Authentication

For secured Redis/Valkey deployments, include the password in the URL:

# With password XEROTIER_AGENT_LMCACHE_REDIS_URL=redis://:your-secret-password@valkey:6379 # Standard format # redis://[:password@]host:port

Verifying LMCache

Check agent logs to verify LMCache initialization:

docker-compose logs agent | grep -i lmcache # Expected output: # LMCache enabled # Wrote LMCache configuration # config_path=/var/lib/inference/.config/xerotier/lmcache_config.yaml

Monitoring Cache Performance

Monitor Valkey cache metrics using redis-cli:

# Connect to Valkey docker-compose exec valkey valkey-cli # Check memory usage INFO memory # Check cache hit rate INFO stats # Look for: keyspace_hits and keyspace_misses # Monitor real-time operations MONITOR

Multi-Tenant Note: If you serve multiple tenants, ensure XEROTIER_AGENT_VLLM_SALT_SECRET is configured on the agent. This generates per-tenant cache salts that isolate cache keyspaces, preventing cross-tenant data leakage.

AMD CPU Deployment (ZenDNN)

Run inference on AMD EPYC CPUs without a GPU using vLLM with ZenDNN optimization. This requires building custom Docker images locally.

Build Required: Unlike GPU deployment, CPU-based inference requires you to build the Docker images locally. There is no pre-built image available due to the specialized build requirements for AMD CPU optimization.

Hardware Requirements

Component Minimum Recommended
CPU AMD EPYC with AVX-512 AMD EPYC 9454 (Genoa) or newer
System RAM 64GB 96GB+ (scales with model size)
Disk Space 100GB SSD 500GB+ NVMe SSD
CPU Cores 16 cores 24+ cores

Memory Requirements by Model Size

CPU inference requires significantly more system RAM than GPU VRAM:

Model Size System RAM Required
Sub-1B parameters ~32GB
3-4B parameters ~64GB
7-8B parameters ~96GB

Building the Docker Images

Building CPU-optimized vLLM with ZenDNN requires a multi-step process. For detailed instructions, see the AMD EPYC inference guide.

Step 1: Clone vLLM Repository

git clone https://github.com/vllm-project/vllm cd vllm git checkout v0.11.0

Version Compatibility: ZenTorch requires specific vLLM versions. At the time of writing, v0.11.0 is the recommended version. Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix.

Step 2: Build vLLM CPU Base Image

docker build -f docker/Dockerfile.cpu \ --build-arg VLLM_CPU_AVX512BF16=1 \ --build-arg VLLM_CPU_AVX512VNNI=1 \ --build-arg VLLM_CPU_DISABLE_AVX512=0 \ --tag vllm-cpu:local \ --target vllm-openai .

Step 3: Create ZenDNN Dockerfile

Create docker/Dockerfile.cpu-amd to add ZenDNN optimization:

docker/Dockerfile.cpu-amd
FROM vllm-cpu:local RUN apt-get update -y && apt-get install -y --no-install-recommends \ make cmake ccache git curl wget ca-certificates gcc-12 g++-12 \ libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 \ jq lsof libjemalloc2 gfortran && \ update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \ --slave /usr/bin/g++ g++ /usr/bin/g++-12 RUN git clone https://github.com/amd/ZenDNN-pytorch-plugin.git && \ cd ZenDNN-pytorch-plugin && uv pip install -r requirements.txt && \ CC=gcc CXX=g++ python3 setup.py bdist_wheel && \ uv pip install dist/*.whl ENTRYPOINT ["vllm", "serve"]

Step 4: Build ZenDNN-Optimized Image

docker build -f docker/Dockerfile.cpu-amd \ --build-arg VLLM_CPU_AVX512BF16=1 \ --build-arg VLLM_CPU_AVX512VNNI=1 \ --build-arg VLLM_CPU_DISABLE_AVX512=0 \ --tag vllm-cpu-zentorch:local .

Step 5: Build Xerotier.ai Agent Image

From the Xerotier.ai repository root, build the CPU agent:

docker build -f deploy/docker/Dockerfile.agent-amd-cpu \ --tag xerotier/backend-agent-cpu:local .

CPU-Specific Environment Variables

Variable Default Description
VLLM_PLUGINS zentorch Enable ZenDNN optimization plugin
VLLM_CPU_KVCACHE_SPACE - KV cache size in GB (~75% of RAM)
VLLM_CPU_OMP_THREADS_BIND - CPU core binding range (e.g., 0-23)
VLLM_CPU_NUM_OF_RESERVED_CPU 1 CPUs reserved for OS operations
VLLM_CPU_OMP_NUM_THREADS 16 Number of OpenMP threads

Docker Compose for CPU Agent

docker-compose-cpu.yaml
services: agent: image: xerotier/backend-agent-cpu:local network_mode: host ipc: host privileged: true volumes: - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models - /data/xerotier/config:/var/lib/inference/.config/xerotier environment: - XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY} # Model is assigned from the dashboard during enrollment - XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-} - XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-5} - XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info} - VLLM_PLUGINS=zentorch - VLLM_CPU_KVCACHE_SPACE=${VLLM_CPU_KVCACHE_SPACE:-50} - VLLM_CPU_OMP_THREADS_BIND=${VLLM_CPU_OMP_THREADS_BIND:-0-23} - VLLM_CPU_NUM_OF_RESERVED_CPU=1 shm_size: ${SHM_SIZE:-90g} restart: unless-stopped

Calculating Environment Values

Use these commands to calculate optimal values for your system:

# Calculate KV cache space (~75% of RAM minus overhead) export VLLM_CPU_KVCACHE_SPACE="$(($(free -g | awk '/Mem/ {print $2}') * 75 / 100))" # Calculate CPU core binding (all but one core) export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 2))" # Calculate shared memory (total RAM minus 1GB buffer) export SHM_SIZE="$(($(free -m | awk '/Mem/ {print $2}') - 1024))m" echo "KVCACHE_SPACE: ${VLLM_CPU_KVCACHE_SPACE}GB" echo "CPU_THREADS_BIND: ${VLLM_CPU_OMP_THREADS_BIND}" echo "SHM_SIZE: ${SHM_SIZE}"

Memory Tuning: Setting VLLM_CPU_KVCACHE_SPACE too high may cause out-of-memory errors. Start conservatively at 50-60% of available RAM and increase based on observed memory usage during inference.

Performance Considerations

  • Concurrency: CPU inference supports fewer concurrent requests than GPU. Start with XEROTIER_AGENT_MAX_CONCURRENT=5 and adjust based on model size.
  • Data Type: Use --dtype=bfloat16 for optimal performance on AMD EPYC with AVX-512 VNNI.
  • Model Selection: Smaller models (1-8B parameters) work best for CPU inference. Larger models will have significantly higher latency.
  • Memory Bandwidth: Inference performance is often memory-bandwidth limited. Ensure your system has adequate memory channels populated.

Troubleshooting

Common issues and their solutions when deploying self-hosted agents.

Agent Fails to Start

Symptom Solution
Join key expired Generate a new join key from the Agents dashboard
Connection refused Verify network connectivity to Xerotier router mesh
Invalid join key format Ensure the complete key is provided without truncation

GPU Not Detected

# Verify NVIDIA driver nvidia-smi # Verify Container Toolkit installation docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi # Check Docker runtime configuration docker info | grep -i runtime

Model Loading Fails

Symptom Solution
Out of disk space Increase disk allocation or reduce cache size
Model not found Verify VLLM_MODEL is a valid HuggingFace model ID

Permission Denied Errors

# Fix host directory permissions sudo chown -R 5152:5152 /data/xerotier # Verify permissions ls -la /data/xerotier

Out of Memory (OOM)

  • Reduce XEROTIER_AGENT_GPU_MEMORY_UTILIZATION to 0.85 or lower
  • Reduce XEROTIER_AGENT_MAX_CONCURRENT to limit concurrent requests
  • Reduce XEROTIER_AGENT_MAX_MODEL_LEN for shorter context windows
  • Use a smaller model or add more GPUs

LMCache Issues

Symptom Solution
LMCache not enabled (no logs) Verify XEROTIER_AGENT_LMCACHE_ENABLED=true is set in environment
Redis connection failed Check Valkey is running and XEROTIER_AGENT_LMCACHE_REDIS_URL is correct
Config write failed Ensure /var/lib/inference/.config/xerotier is writable
High disk usage Set explicit XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB limit
No cache hits across agents Verify all agents use same XEROTIER_AGENT_LMCACHE_REDIS_URL

Check LMCache status in logs:

# Verify LMCache initialization docker-compose logs agent | grep -i lmcache # Check config file was created docker-compose exec agent cat /var/lib/inference/.config/xerotier/lmcache_config.yaml # Test Valkey connectivity docker-compose exec valkey valkey-cli ping

Common Commands

Command Description
docker-compose logs -f agent View agent logs
docker-compose restart agent Restart the agent
docker-compose down Stop all services
nvidia-smi Monitor GPU utilization
docker stats Monitor container resource usage

Frequently Asked Questions

How do I get a join key?

Navigate to the Agents page in your dashboard and click "Generate Join Key". Configure the region and expiration, then copy the generated key. The full key is only shown once.

Can I run multiple models on one GPU?

The agent loads one model at a time per vLLM instance. To serve multiple models, deploy multiple agents on separate GPUs or use time-sharing (not recommended for production).

How do I update the agent?

Pull the latest image and restart: docker-compose pull && docker-compose up -d. Your model cache and configuration persist through updates.

What models are supported?

Any model compatible with vLLM, including most HuggingFace Transformers models. Check the vLLM supported models list for compatibility.

How much VRAM do I need?

As a rough guide: 7B models need ~16GB, 13B models need ~32GB, 70B models need ~140GB (multiple GPUs). Quantized models (GPTQ, AWQ) reduce requirements significantly.

Can I use AMD GPUs?

AMD ROCm GPU support is planned for a future release. However, you can run inference on AMD EPYC CPUs using vLLM with ZenDNN optimization. See the AMD CPU Deployment section for details.

Why do I need to build my own Docker image for CPU inference?

CPU-optimized vLLM with ZenDNN requires specific build flags (AVX-512BF16, AVX-512VNNI) that must match your CPU architecture. Pre-built images cannot provide these optimizations for all CPU variants.

What vLLM version should I use with ZenDNN?

Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix. At the time of writing, vLLM v0.11.0 is recommended. Version mismatches cause plugin loading failures.

Is my data secure?

Yes. Self-hosted agents only receive requests from your project. All connections use CURVE encryption (ZMQ). Your inference data never leaves your infrastructure.

What happens if my agent goes offline?

Requests are automatically routed to other available agents. If you have fallback enabled, requests can be served by shared infrastructure. Otherwise, they queue until your agent reconnects.

Do I need LMCache?

LMCache is optional but recommended for production deployments. It significantly reduces TTFT for repeated prompt prefixes (e.g., system prompts, few-shot examples). If your workload has many unique prompts with no shared prefixes, the benefit is reduced.

Can I use LMCache without Valkey/Redis?

Yes. Set XEROTIER_AGENT_LMCACHE_ENABLED=true without XEROTIER_AGENT_LMCACHE_REDIS_URL to use only local CPU memory and disk caching. This works well for single-agent deployments. Add Valkey when you need cache sharing across multiple agents.

What happens if LMCache fails to initialize?

The agent gracefully degrades - it logs a warning and continues without KV cache sharing. Inference still works, just without the TTFT optimization. Check logs for initialization errors if you expected LMCache to be enabled.

How much memory should I allocate for LMCache?

The agent auto-calculates 10% of system resources by default, which works for most deployments. For high-traffic systems, consider 10-20% of RAM for CPU cache and 50-100GB for disk cache. Monitor eviction rates to tune sizing.