Self-Hosted Agent Deployment
Deploy a self-hosted Xerotier.ai agent on your own infrastructure using Docker containers with NVIDIA GPU support. Self-hosted agents give you full control over your inference hardware while leveraging Xerotier.ai routing and management capabilities.
Prerequisites
Before deploying a self-hosted agent, ensure your infrastructure meets the following requirements.
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 16GB VRAM | NVIDIA A30/H100+ or RTX 3090+ |
| System RAM | 32GB | 64GB+ |
| Disk Space | 100GB SSD | 500GB+ NVMe SSD |
| Network | 100 Mbps | 1 Gbps+ |
Software Requirements
| Software | Version |
|---|---|
| Docker | 24.0+ |
| Docker Compose | 2.20+ |
| NVIDIA Driver | 535+ |
| NVIDIA Container Toolkit | 1.14+ |
NVIDIA Container Toolkit Installation
Install the NVIDIA Container Toolkit to enable GPU access in Docker containers:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify GPU Access
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi
Container Setup
The Xerotier.ai backend agent is distributed as a Docker image that includes vLLM for model inference.
Container Details
| Property | Value |
|---|---|
| Image | xerotier/backend-agent:latest |
| User | inference (UID:GID 5152:5152) |
| Home Directory | /var/lib/inference |
| Model Cache | /var/lib/inference/.cache/xerotier/models |
| LMCache Directory | /var/lib/inference/.cache/lmcache |
| Config Directory | /var/lib/inference/.config/xerotier |
Pull the Image
docker pull xerotier/backend-agent:latest
Environment Variables
Configure the agent using environment variables. The following tables list all available options.
Required Variables
| Variable | Description |
|---|---|
XEROTIER_AGENT_JOIN_KEY |
Join key for enrolling with the Xerotier router mesh. Obtain from the Agents dashboard. |
Agent Configuration
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_MAX_CONCURRENT |
8 | Maximum concurrent inference requests |
XEROTIER_AGENT_LOG_LEVEL |
info | Log level: debug, info, warn, error |
XEROTIER_AGENT_HEARTBEAT_MS |
5000 | Heartbeat interval in milliseconds |
vLLM Configuration
| Variable | Default | Description |
|---|---|---|
VLLM_MODEL |
meta-llama/Llama-3.2-1B-Instruct | HuggingFace model ID or local path |
XEROTIER_AGENT_MAX_MODEL_LEN |
auto | Maximum sequence length (uses model default if unset) |
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE |
auto | Tensor parallel size for multi-GPU (auto-configured from visible devices) |
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION |
0.90 | GPU memory utilization (0.0-1.0) |
SHM_SIZE |
8589934592 | Docker Compose shm_size setting (not an environment variable). Controls shared memory allocation for the container in bytes (8GB default). |
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES |
0 | Specific GPU devices to use (comma-separated) |
Cache Configuration
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_MODEL_CACHE_PATH |
/var/lib/inference/.cache/xerotier/models | Local model cache directory |
XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB |
100 | Maximum cache size in gigabytes |
LMCache Configuration
LMCache provides multi-tiered KV cache sharing for reduced Time-to-First-Token (TTFT). The agent natively manages LMCache configuration and passes it to vLLM at startup.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_LMCACHE_ENABLED |
true | Enable LMCache KV cache sharing |
XEROTIER_AGENT_LMCACHE_REDIS_URL |
- | Redis URL (redis://host:port or redis://:pass@host:port) |
XEROTIER_AGENT_LMCACHE_MAX_CPU_MB |
auto (10% RAM) | Maximum CPU memory cache size in MB |
XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB |
auto (10% disk) | Maximum disk cache size in GB |
XEROTIER_AGENT_LMCACHE_DISK_PATH |
/var/lib/inference/.cache/lmcache | Disk cache storage directory |
GPU Configuration
Configure GPU access and memory allocation for optimal performance.
Single GPU Setup
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=1
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0
Multi-GPU Setup
For models that require multiple GPUs (tensor parallelism):
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0,1
Specific GPU Selection
To use specific GPUs (e.g., GPUs 2 and 3 on a 4-GPU system):
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=2,3
Shared Memory Configuration
Larger models require more shared memory. Adjust shm_size (Docker Compose config) based on model size:
| Model Size | Recommended SHM_SIZE |
|---|---|
| 1-8B parameters | 8GB (8589934592) |
| 13-34B parameters | 16GB (17179869184) |
| 70B+ parameters | 32GB (34359738368) |
Model Storage
Configure persistent storage for downloaded models to avoid re-downloading on container restart.
Host Directory Setup
Create directories on the host with correct permissions matching the container user (UID:GID 5152:5152):
sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache
sudo chown -R 5152:5152 /data/xerotier
Important: The container runs as the inference user with UID:GID 5152:5152. Host directories must be owned by this user for the agent to read and write model files.
Docker Compose with Volume Mounts
services:
agent:
image: xerotier/backend-agent:latest
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: ${DOCKER_GPU_COUNT:-1}
capabilities: [gpu]
volumes:
- /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
- /data/xerotier/config:/var/lib/inference/.config/xerotier
- /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache
environment:
- XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY}
# Model is assigned from the dashboard during enrollment
- XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-}
- XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=${XEROTIER_AGENT_TENSOR_PARALLEL_SIZE:-1}
- XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=${XEROTIER_AGENT_GPU_MEMORY_UTILIZATION:-0.90}
- XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-8}
- XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info}
shm_size: ${SHM_SIZE:-8589934592}
restart: unless-stopped
LMCache Setup
LMCache provides multi-tiered KV cache sharing for vLLM, dramatically reducing Time-to-First-Token (TTFT) for repeated prompt prefixes. The agent natively manages LMCache configuration.
Benefits
- Reduced Time-to-First-Token: Cache hits can reduce TTFT by 50-90% for repeated prompt prefixes
- Multi-Tier Caching: Three cache tiers with different speed/capacity tradeoffs
- Horizontal Scalability: Multiple agents can share a remote Redis/Valkey cache
- Graceful Degradation: Agent continues without cache if initialization fails
Cache Tiers
| Tier | Speed | Default Size | Use Case |
|---|---|---|---|
| CPU Memory | ~100 GB/s | 10% of system RAM | Hot cache, frequently accessed prefixes |
| Local Disk | ~5 GB/s (NVMe) | 10% of partition | Warm cache, persistent across restarts |
| Remote Redis | ~1 GB/s (network) | Valkey maxmemory | Shared cache across multiple agents |
Quick Start (Local Only)
Enable LMCache with local CPU and disk caching only (no Redis required):
# Enable LMCache with auto-calculated sizes
XEROTIER_AGENT_LMCACHE_ENABLED=true
# Optional: Override auto-calculated sizes
# XEROTIER_AGENT_LMCACHE_MAX_CPU_MB=4096 # 4GB CPU cache
# XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB=50 # 50GB disk cache
Production Setup with Valkey
For multi-agent deployments, add Valkey for shared KV cache:
services:
valkey:
image: valkey/valkey:8.0
ports:
- "6379:6379"
command:
- valkey-server
- --requirepass
- "${VALKEY_PASSWORD}"
- --maxmemory
- 8gb
- --maxmemory-policy
- allkeys-lru
- --save
- ""
- --appendonly
- "no"
healthcheck:
test: ["CMD", "valkey-cli", "-a", "${VALKEY_PASSWORD}", "ping"]
interval: 10s
timeout: 5s
retries: 3
restart: unless-stopped
agent:
image: xerotier/backend-agent:latest
runtime: nvidia
depends_on:
valkey:
condition: service_healthy
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: ${DOCKER_GPU_COUNT:-1}
capabilities: [gpu]
volumes:
- /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
- /data/xerotier/config:/var/lib/inference/.config/xerotier
- /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache
environment:
- XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY}
# Model is assigned from the dashboard during enrollment
- XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-}
- XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=${XEROTIER_AGENT_TENSOR_PARALLEL_SIZE:-1}
- XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=${XEROTIER_AGENT_GPU_MEMORY_UTILIZATION:-0.90}
- XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-8}
- XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info}
# LMCache Configuration
- XEROTIER_AGENT_LMCACHE_ENABLED=true
- XEROTIER_AGENT_LMCACHE_REDIS_URL=redis://:${VALKEY_PASSWORD}@valkey:6379
- XEROTIER_AGENT_LMCACHE_MAX_CPU_MB=${XEROTIER_AGENT_LMCACHE_MAX_CPU_MB:-}
- XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB=${XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB:-}
shm_size: ${SHM_SIZE:-8589934592}
restart: unless-stopped
Sizing Recommendations
| Deployment Size | CPU Cache | Disk Cache | Valkey Memory |
|---|---|---|---|
| Small (32GB RAM, single agent) | 2-4 GB | 20 GB | 4 GB |
| Medium (64GB RAM, 2-4 agents) | 4-8 GB per agent | 50 GB per agent | 8 GB |
| Large (128GB+ RAM, 4+ agents) | 8-16 GB per agent | 100 GB per agent | 16-32 GB |
Tip: If you leave XEROTIER_AGENT_LMCACHE_MAX_CPU_MB and XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB unset, the agent auto-calculates optimal values (10% of system resources). This works well for most deployments.
Redis Authentication
For secured Redis/Valkey deployments, include the password in the URL:
# With password
XEROTIER_AGENT_LMCACHE_REDIS_URL=redis://:your-secret-password@valkey:6379
# Standard format
# redis://[:password@]host:port
Verifying LMCache
Check agent logs to verify LMCache initialization:
docker-compose logs agent | grep -i lmcache
# Expected output:
# LMCache enabled
# Wrote LMCache configuration
# config_path=/var/lib/inference/.config/xerotier/lmcache_config.yaml
Monitoring Cache Performance
Monitor Valkey cache metrics using redis-cli:
# Connect to Valkey
docker-compose exec valkey valkey-cli
# Check memory usage
INFO memory
# Check cache hit rate
INFO stats
# Look for: keyspace_hits and keyspace_misses
# Monitor real-time operations
MONITOR
Multi-Tenant Note: If you serve multiple tenants, ensure XEROTIER_AGENT_VLLM_SALT_SECRET is configured on the agent. This generates per-tenant cache salts that isolate cache keyspaces, preventing cross-tenant data leakage.
AMD CPU Deployment (ZenDNN)
Run inference on AMD EPYC CPUs without a GPU using vLLM with ZenDNN optimization. This requires building custom Docker images locally.
Build Required: Unlike GPU deployment, CPU-based inference requires you to build the Docker images locally. There is no pre-built image available due to the specialized build requirements for AMD CPU optimization.
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| CPU | AMD EPYC with AVX-512 | AMD EPYC 9454 (Genoa) or newer |
| System RAM | 64GB | 96GB+ (scales with model size) |
| Disk Space | 100GB SSD | 500GB+ NVMe SSD |
| CPU Cores | 16 cores | 24+ cores |
Memory Requirements by Model Size
CPU inference requires significantly more system RAM than GPU VRAM:
| Model Size | System RAM Required |
|---|---|
| Sub-1B parameters | ~32GB |
| 3-4B parameters | ~64GB |
| 7-8B parameters | ~96GB |
Building the Docker Images
Building CPU-optimized vLLM with ZenDNN requires a multi-step process. For detailed instructions, see the AMD EPYC inference guide.
Step 1: Clone vLLM Repository
git clone https://github.com/vllm-project/vllm
cd vllm
git checkout v0.11.0
Version Compatibility: ZenTorch requires specific vLLM versions. At the time of writing, v0.11.0 is the recommended version. Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix.
Step 2: Build vLLM CPU Base Image
docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512BF16=1 \
--build-arg VLLM_CPU_AVX512VNNI=1 \
--build-arg VLLM_CPU_DISABLE_AVX512=0 \
--tag vllm-cpu:local \
--target vllm-openai .
Step 3: Create ZenDNN Dockerfile
Create docker/Dockerfile.cpu-amd to add ZenDNN optimization:
FROM vllm-cpu:local
RUN apt-get update -y && apt-get install -y --no-install-recommends \
make cmake ccache git curl wget ca-certificates gcc-12 g++-12 \
libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 \
jq lsof libjemalloc2 gfortran && \
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \
--slave /usr/bin/g++ g++ /usr/bin/g++-12
RUN git clone https://github.com/amd/ZenDNN-pytorch-plugin.git && \
cd ZenDNN-pytorch-plugin && uv pip install -r requirements.txt && \
CC=gcc CXX=g++ python3 setup.py bdist_wheel && \
uv pip install dist/*.whl
ENTRYPOINT ["vllm", "serve"]
Step 4: Build ZenDNN-Optimized Image
docker build -f docker/Dockerfile.cpu-amd \
--build-arg VLLM_CPU_AVX512BF16=1 \
--build-arg VLLM_CPU_AVX512VNNI=1 \
--build-arg VLLM_CPU_DISABLE_AVX512=0 \
--tag vllm-cpu-zentorch:local .
Step 5: Build Xerotier.ai Agent Image
From the Xerotier.ai repository root, build the CPU agent:
docker build -f deploy/docker/Dockerfile.agent-amd-cpu \
--tag xerotier/backend-agent-cpu:local .
CPU-Specific Environment Variables
| Variable | Default | Description |
|---|---|---|
VLLM_PLUGINS |
zentorch | Enable ZenDNN optimization plugin |
VLLM_CPU_KVCACHE_SPACE |
- | KV cache size in GB (~75% of RAM) |
VLLM_CPU_OMP_THREADS_BIND |
- | CPU core binding range (e.g., 0-23) |
VLLM_CPU_NUM_OF_RESERVED_CPU |
1 | CPUs reserved for OS operations |
VLLM_CPU_OMP_NUM_THREADS |
16 | Number of OpenMP threads |
Docker Compose for CPU Agent
services:
agent:
image: xerotier/backend-agent-cpu:local
network_mode: host
ipc: host
privileged: true
volumes:
- /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
- /data/xerotier/config:/var/lib/inference/.config/xerotier
environment:
- XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY}
# Model is assigned from the dashboard during enrollment
- XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-}
- XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-5}
- XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info}
- VLLM_PLUGINS=zentorch
- VLLM_CPU_KVCACHE_SPACE=${VLLM_CPU_KVCACHE_SPACE:-50}
- VLLM_CPU_OMP_THREADS_BIND=${VLLM_CPU_OMP_THREADS_BIND:-0-23}
- VLLM_CPU_NUM_OF_RESERVED_CPU=1
shm_size: ${SHM_SIZE:-90g}
restart: unless-stopped
Calculating Environment Values
Use these commands to calculate optimal values for your system:
# Calculate KV cache space (~75% of RAM minus overhead)
export VLLM_CPU_KVCACHE_SPACE="$(($(free -g | awk '/Mem/ {print $2}') * 75 / 100))"
# Calculate CPU core binding (all but one core)
export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 2))"
# Calculate shared memory (total RAM minus 1GB buffer)
export SHM_SIZE="$(($(free -m | awk '/Mem/ {print $2}') - 1024))m"
echo "KVCACHE_SPACE: ${VLLM_CPU_KVCACHE_SPACE}GB"
echo "CPU_THREADS_BIND: ${VLLM_CPU_OMP_THREADS_BIND}"
echo "SHM_SIZE: ${SHM_SIZE}"
Memory Tuning: Setting VLLM_CPU_KVCACHE_SPACE too high may cause out-of-memory errors. Start conservatively at 50-60% of available RAM and increase based on observed memory usage during inference.
Performance Considerations
- Concurrency: CPU inference supports fewer concurrent requests than GPU. Start with
XEROTIER_AGENT_MAX_CONCURRENT=5and adjust based on model size. - Data Type: Use
--dtype=bfloat16for optimal performance on AMD EPYC with AVX-512 VNNI. - Model Selection: Smaller models (1-8B parameters) work best for CPU inference. Larger models will have significantly higher latency.
- Memory Bandwidth: Inference performance is often memory-bandwidth limited. Ensure your system has adequate memory channels populated.
Troubleshooting
Common issues and their solutions when deploying self-hosted agents.
Agent Fails to Start
| Symptom | Solution |
|---|---|
| Join key expired | Generate a new join key from the Agents dashboard |
| Connection refused | Verify network connectivity to Xerotier router mesh |
| Invalid join key format | Ensure the complete key is provided without truncation |
GPU Not Detected
# Verify NVIDIA driver
nvidia-smi
# Verify Container Toolkit installation
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi
# Check Docker runtime configuration
docker info | grep -i runtime
Model Loading Fails
| Symptom | Solution |
|---|---|
| Out of disk space | Increase disk allocation or reduce cache size |
| Model not found | Verify VLLM_MODEL is a valid HuggingFace model ID |
Permission Denied Errors
# Fix host directory permissions
sudo chown -R 5152:5152 /data/xerotier
# Verify permissions
ls -la /data/xerotier
Out of Memory (OOM)
- Reduce
XEROTIER_AGENT_GPU_MEMORY_UTILIZATIONto 0.85 or lower - Reduce
XEROTIER_AGENT_MAX_CONCURRENTto limit concurrent requests - Reduce
XEROTIER_AGENT_MAX_MODEL_LENfor shorter context windows - Use a smaller model or add more GPUs
LMCache Issues
| Symptom | Solution |
|---|---|
| LMCache not enabled (no logs) | Verify XEROTIER_AGENT_LMCACHE_ENABLED=true is set in environment |
| Redis connection failed | Check Valkey is running and XEROTIER_AGENT_LMCACHE_REDIS_URL is correct |
| Config write failed | Ensure /var/lib/inference/.config/xerotier is writable |
| High disk usage | Set explicit XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB limit |
| No cache hits across agents | Verify all agents use same XEROTIER_AGENT_LMCACHE_REDIS_URL |
Check LMCache status in logs:
# Verify LMCache initialization
docker-compose logs agent | grep -i lmcache
# Check config file was created
docker-compose exec agent cat /var/lib/inference/.config/xerotier/lmcache_config.yaml
# Test Valkey connectivity
docker-compose exec valkey valkey-cli ping
Common Commands
| Command | Description |
|---|---|
docker-compose logs -f agent |
View agent logs |
docker-compose restart agent |
Restart the agent |
docker-compose down |
Stop all services |
nvidia-smi |
Monitor GPU utilization |
docker stats |
Monitor container resource usage |
Frequently Asked Questions
How do I get a join key?
Navigate to the Agents page in your dashboard and click "Generate Join Key". Configure the region and expiration, then copy the generated key. The full key is only shown once.
Can I run multiple models on one GPU?
The agent loads one model at a time per vLLM instance. To serve multiple models, deploy multiple agents on separate GPUs or use time-sharing (not recommended for production).
How do I update the agent?
Pull the latest image and restart: docker-compose pull && docker-compose up -d. Your model cache and configuration persist through updates.
What models are supported?
Any model compatible with vLLM, including most HuggingFace Transformers models. Check the vLLM supported models list for compatibility.
How much VRAM do I need?
As a rough guide: 7B models need ~16GB, 13B models need ~32GB, 70B models need ~140GB (multiple GPUs). Quantized models (GPTQ, AWQ) reduce requirements significantly.
Can I use AMD GPUs?
AMD ROCm GPU support is planned for a future release. However, you can run inference on AMD EPYC CPUs using vLLM with ZenDNN optimization. See the AMD CPU Deployment section for details.
Why do I need to build my own Docker image for CPU inference?
CPU-optimized vLLM with ZenDNN requires specific build flags (AVX-512BF16, AVX-512VNNI) that must match your CPU architecture. Pre-built images cannot provide these optimizations for all CPU variants.
What vLLM version should I use with ZenDNN?
Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix. At the time of writing, vLLM v0.11.0 is recommended. Version mismatches cause plugin loading failures.
Is my data secure?
Yes. Self-hosted agents only receive requests from your project. All connections use CURVE encryption (ZMQ). Your inference data never leaves your infrastructure.
What happens if my agent goes offline?
Requests are automatically routed to other available agents. If you have fallback enabled, requests can be served by shared infrastructure. Otherwise, they queue until your agent reconnects.
Do I need LMCache?
LMCache is optional but recommended for production deployments. It significantly reduces TTFT for repeated prompt prefixes (e.g., system prompts, few-shot examples). If your workload has many unique prompts with no shared prefixes, the benefit is reduced.
Can I use LMCache without Valkey/Redis?
Yes. Set XEROTIER_AGENT_LMCACHE_ENABLED=true without XEROTIER_AGENT_LMCACHE_REDIS_URL to use only local CPU memory and disk caching. This works well for single-agent deployments. Add Valkey when you need cache sharing across multiple agents.
What happens if LMCache fails to initialize?
The agent gracefully degrades - it logs a warning and continues without KV cache sharing. Inference still works, just without the TTFT optimization. Check logs for initialization errors if you expected LMCache to be enabled.
How much memory should I allocate for LMCache?
The agent auto-calculates 10% of system resources by default, which works for most deployments. For high-traffic systems, consider 10-20% of RAM for CPU cache and 50-100GB for disk cache. Monitor eviction rates to tune sizing.