Self-Hosted Agent Deployment

Prerequisites

Before deploying a self-hosted agent, ensure your infrastructure meets the following requirements.

Hardware Requirements

Component	Minimum	Recommended
GPU	NVIDIA with 16GB VRAM	NVIDIA A30/H100+ or RTX 3090+
System RAM	32GB	64GB+
Disk Space	100GB SSD	500GB+ NVMe SSD
Network	100 Mbps	1 Gbps+

Software Requirements

Software	Version
Docker	24.0+
Docker Compose	2.20+
NVIDIA Driver	535+
NVIDIA Container Toolkit	1.14+

NVIDIA Container Toolkit Installation

Install the NVIDIA Container Toolkit to enable GPU access in Docker containers:

Ubuntu/Debian

                    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
                

Verify GPU Access

docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi

Container Setup

The Xerotier.ai backend agent is distributed as a Docker image that includes vLLM for model inference.

Container Details

Property	Value
Image	`xerotier/backend-agent:latest`
User	`inference` (UID:GID 5152:5152)
Home Directory	`/var/lib/inference`
Model Cache	`/var/lib/inference/.cache/xerotier/models`
LMCache Directory	`/var/lib/inference/.cache/lmcache`
Config Directory	`/var/lib/inference/.config/xerotier`

Pull the Image

docker pull xerotier/backend-agent:latest

Environment Variables

Configure the agent using environment variables. The following tables list all available options.

Required Variables

Variable	Description
`XEROTIER_AGENT_JOIN_KEY`	Join key for enrolling with the Xerotier router mesh. Obtain from the Agents dashboard.

Agent Configuration

Variable	Default	Description
`XEROTIER_AGENT_MAX_CONCURRENT`	8	Maximum concurrent inference requests
`XEROTIER_AGENT_LOG_LEVEL`	info	Log level: debug, info, warn, error
`XEROTIER_AGENT_HEARTBEAT_MS`	5000	Heartbeat interval in milliseconds

vLLM Configuration

Variable	Default	Description
`VLLM_MODEL`	meta-llama/Llama-3.2-1B-Instruct	HuggingFace model ID or local path
`XEROTIER_AGENT_MAX_MODEL_LEN`	auto	Maximum sequence length (uses model default if unset)
`XEROTIER_AGENT_TENSOR_PARALLEL_SIZE`	auto	Tensor parallel size for multi-GPU (auto-configured from visible devices)
`XEROTIER_AGENT_GPU_MEMORY_UTILIZATION`	0.90	GPU memory utilization (0.0-1.0)
`SHM_SIZE`	8589934592	Docker Compose `shm_size` setting (not an environment variable). Controls shared memory allocation for the container in bytes (8GB default).
`XEROTIER_AGENT_CUDA_VISIBLE_DEVICES`	0	Specific GPU devices to use (comma-separated)

Cache Configuration

Variable	Default	Description
`XEROTIER_AGENT_MODEL_CACHE_PATH`	/var/lib/inference/.cache/xerotier/models	Local model cache directory
`XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB`	100	Maximum cache size in gigabytes

LMCache Configuration

LMCache provides multi-tiered KV cache sharing for reduced Time-to-First-Token (TTFT). The agent natively manages LMCache configuration and passes it to vLLM at startup.

Variable	Default	Description
`XEROTIER_AGENT_LMCACHE_ENABLED`	true	Enable LMCache KV cache sharing
`XEROTIER_AGENT_LMCACHE_REDIS_URL`	-	Redis URL (redis://host:port or redis://:pass@host:port)
`XEROTIER_AGENT_LMCACHE_MAX_CPU_MB`	auto (10% RAM)	Maximum CPU memory cache size in MB
`XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB`	auto (10% disk)	Maximum disk cache size in GB
`XEROTIER_AGENT_LMCACHE_DISK_PATH`	/var/lib/inference/.cache/lmcache	Disk cache storage directory

GPU Configuration

Configure GPU access and memory allocation for optimal performance.

Single GPU Setup

.env

                    XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=1
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0
                

Multi-GPU Setup

For models that require multiple GPUs (tensor parallelism):

.env

                    XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.90
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=0,1
                

Specific GPU Selection

To use specific GPUs (e.g., GPUs 2 and 3 on a 4-GPU system):

.env

                    XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_CUDA_VISIBLE_DEVICES=2,3
                

Shared Memory Configuration

Larger models require more shared memory. Adjust shm_size (Docker Compose config) based on model size:

Model Size	Recommended SHM_SIZE
1-8B parameters	8GB (8589934592)
13-34B parameters	16GB (17179869184)
70B+ parameters	32GB (34359738368)

Model Storage

Configure persistent storage for downloaded models to avoid re-downloading on container restart.

Host Directory Setup

Create directories on the host with correct permissions matching the container user (UID:GID 5152:5152):

                    sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache
sudo chown -R 5152:5152 /data/xerotier
                

Important: The container runs as the inference user with UID:GID 5152:5152. Host directories must be owned by this user for the agent to read and write model files.

Docker Compose with Volume Mounts

docker-compose.yaml

                    services:
  agent:
    image: xerotier/backend-agent:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: ${DOCKER_GPU_COUNT:-1}
              capabilities: [gpu]
    volumes:
      - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
      - /data/xerotier/config:/var/lib/inference/.config/xerotier
      - /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache
    environment:
      - XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY}
      # Model is assigned from the dashboard during enrollment
      - XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-}
      - XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=${XEROTIER_AGENT_TENSOR_PARALLEL_SIZE:-1}
      - XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=${XEROTIER_AGENT_GPU_MEMORY_UTILIZATION:-0.90}
      - XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-8}
      - XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info}
    shm_size: ${SHM_SIZE:-8589934592}
    restart: unless-stopped
                

LMCache Setup

LMCache provides multi-tiered KV cache sharing for vLLM, dramatically reducing Time-to-First-Token (TTFT) for repeated prompt prefixes. The agent natively manages LMCache configuration.

Benefits

Reduced Time-to-First-Token: Cache hits can reduce TTFT by 50-90% for repeated prompt prefixes
Multi-Tier Caching: Three cache tiers with different speed/capacity tradeoffs
Horizontal Scalability: Multiple agents can share a remote Redis/Valkey cache
Graceful Degradation: Agent continues without cache if initialization fails

Cache Tiers

Tier	Speed	Default Size	Use Case
CPU Memory	~100 GB/s	10% of system RAM	Hot cache, frequently accessed prefixes
Local Disk	~5 GB/s (NVMe)	10% of partition	Warm cache, persistent across restarts
Remote Redis	~1 GB/s (network)	Valkey maxmemory	Shared cache across multiple agents

Quick Start (Local Only)

Enable LMCache with local CPU and disk caching only (no Redis required):

.env

                    # Enable LMCache with auto-calculated sizes
XEROTIER_AGENT_LMCACHE_ENABLED=true

# Optional: Override auto-calculated sizes
# XEROTIER_AGENT_LMCACHE_MAX_CPU_MB=4096    # 4GB CPU cache
# XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB=50    # 50GB disk cache
                

Production Setup with Valkey

For multi-agent deployments, add Valkey for shared KV cache:

docker-compose.yaml

                    services:
  valkey:
    image: valkey/valkey:8.0
    ports:
      - "6379:6379"
    command:
      - valkey-server
      - --requirepass
      - "${VALKEY_PASSWORD}"
      - --maxmemory
      - 8gb
      - --maxmemory-policy
      - allkeys-lru
      - --save
      - ""
      - --appendonly
      - "no"
    healthcheck:
      test: ["CMD", "valkey-cli", "-a", "${VALKEY_PASSWORD}", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    restart: unless-stopped

  agent:
    image: xerotier/backend-agent:latest
    runtime: nvidia
    depends_on:
      valkey:
        condition: service_healthy
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: ${DOCKER_GPU_COUNT:-1}
              capabilities: [gpu]
    volumes:
      - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
      - /data/xerotier/config:/var/lib/inference/.config/xerotier
      - /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache
    environment:
      - XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY}
      # Model is assigned from the dashboard during enrollment
      - XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-}
      - XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=${XEROTIER_AGENT_TENSOR_PARALLEL_SIZE:-1}
      - XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=${XEROTIER_AGENT_GPU_MEMORY_UTILIZATION:-0.90}
      - XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-8}
      - XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info}
      # LMCache Configuration
      - XEROTIER_AGENT_LMCACHE_ENABLED=true
      - XEROTIER_AGENT_LMCACHE_REDIS_URL=redis://:${VALKEY_PASSWORD}@valkey:6379
      - XEROTIER_AGENT_LMCACHE_MAX_CPU_MB=${XEROTIER_AGENT_LMCACHE_MAX_CPU_MB:-}
      - XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB=${XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB:-}
    shm_size: ${SHM_SIZE:-8589934592}
    restart: unless-stopped
                

Sizing Recommendations

Deployment Size	CPU Cache	Disk Cache	Valkey Memory
Small (32GB RAM, single agent)	2-4 GB	20 GB	4 GB
Medium (64GB RAM, 2-4 agents)	4-8 GB per agent	50 GB per agent	8 GB
Large (128GB+ RAM, 4+ agents)	8-16 GB per agent	100 GB per agent	16-32 GB

Tip: If you leave XEROTIER_AGENT_LMCACHE_MAX_CPU_MB and XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB unset, the agent auto-calculates optimal values (10% of system resources). This works well for most deployments.

Redis Authentication

For secured Redis/Valkey deployments, include the password in the URL:

                    # With password
XEROTIER_AGENT_LMCACHE_REDIS_URL=redis://:your-secret-password@valkey:6379

# Standard format
# redis://[:password@]host:port
                

Verifying LMCache

Check agent logs to verify LMCache initialization:

                    docker-compose logs agent | grep -i lmcache

# Expected output:
# LMCache enabled
# Wrote LMCache configuration
# config_path=/var/lib/inference/.config/xerotier/lmcache_config.yaml
                

Monitoring Cache Performance

Monitor Valkey cache metrics using redis-cli:

                    # Connect to Valkey
docker-compose exec valkey valkey-cli

# Check memory usage
INFO memory

# Check cache hit rate
INFO stats
# Look for: keyspace_hits and keyspace_misses

# Monitor real-time operations
MONITOR
                

Multi-Tenant Note: If you serve multiple tenants, ensure XEROTIER_AGENT_VLLM_SALT_SECRET is configured on the agent. This generates per-tenant cache salts that isolate cache keyspaces, preventing cross-tenant data leakage.

AMD CPU Deployment (ZenDNN)

Run inference on AMD EPYC CPUs without a GPU using vLLM with ZenDNN optimization. This requires building custom Docker images locally.

Build Required: Unlike GPU deployment, CPU-based inference requires you to build the Docker images locally. There is no pre-built image available due to the specialized build requirements for AMD CPU optimization.

Hardware Requirements

Component	Minimum	Recommended
CPU	AMD EPYC with AVX-512	AMD EPYC 9454 (Genoa) or newer
System RAM	64GB	96GB+ (scales with model size)
Disk Space	100GB SSD	500GB+ NVMe SSD
CPU Cores	16 cores	24+ cores

Memory Requirements by Model Size

CPU inference requires significantly more system RAM than GPU VRAM:

Model Size	System RAM Required
Sub-1B parameters	~32GB
3-4B parameters	~64GB
7-8B parameters	~96GB

Building the Docker Images

Building CPU-optimized vLLM with ZenDNN requires a multi-step process. For detailed instructions, see the AMD EPYC inference guide.

Step 1: Clone vLLM Repository

                    git clone https://github.com/vllm-project/vllm
cd vllm
git checkout v0.11.0
                

Version Compatibility: ZenTorch requires specific vLLM versions. At the time of writing, v0.11.0 is the recommended version. Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix.

Step 2: Build vLLM CPU Base Image

                    docker build -f docker/Dockerfile.cpu \
  --build-arg VLLM_CPU_AVX512BF16=1 \
  --build-arg VLLM_CPU_AVX512VNNI=1 \
  --build-arg VLLM_CPU_DISABLE_AVX512=0 \
  --tag vllm-cpu:local \
  --target vllm-openai .
                

Step 3: Create ZenDNN Dockerfile

Create docker/Dockerfile.cpu-amd to add ZenDNN optimization:

docker/Dockerfile.cpu-amd

                    FROM vllm-cpu:local

RUN apt-get update -y && apt-get install -y --no-install-recommends \
  make cmake ccache git curl wget ca-certificates gcc-12 g++-12 \
  libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 \
  jq lsof libjemalloc2 gfortran && \
  update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \
  --slave /usr/bin/g++ g++ /usr/bin/g++-12

RUN git clone https://github.com/amd/ZenDNN-pytorch-plugin.git && \
  cd ZenDNN-pytorch-plugin && uv pip install -r requirements.txt && \
  CC=gcc CXX=g++ python3 setup.py bdist_wheel && \
  uv pip install dist/*.whl

ENTRYPOINT ["vllm", "serve"]
                

Step 4: Build ZenDNN-Optimized Image

                    docker build -f docker/Dockerfile.cpu-amd \
  --build-arg VLLM_CPU_AVX512BF16=1 \
  --build-arg VLLM_CPU_AVX512VNNI=1 \
  --build-arg VLLM_CPU_DISABLE_AVX512=0 \
  --tag vllm-cpu-zentorch:local .
                

Step 5: Build Xerotier.ai Agent Image

From the Xerotier.ai repository root, build the CPU agent:

                    docker build -f deploy/docker/Dockerfile.agent-amd-cpu \
  --tag xerotier/backend-agent-cpu:local .
                

CPU-Specific Environment Variables

Variable	Default	Description
`VLLM_PLUGINS`	zentorch	Enable ZenDNN optimization plugin
`VLLM_CPU_KVCACHE_SPACE`	-	KV cache size in GB (~75% of RAM)
`VLLM_CPU_OMP_THREADS_BIND`	-	CPU core binding range (e.g., 0-23)
`VLLM_CPU_NUM_OF_RESERVED_CPU`	1	CPUs reserved for OS operations
`VLLM_CPU_OMP_NUM_THREADS`	16	Number of OpenMP threads

Docker Compose for CPU Agent

docker-compose-cpu.yaml

                    services:
  agent:
    image: xerotier/backend-agent-cpu:local
    network_mode: host
    ipc: host
    privileged: true
    volumes:
      - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
      - /data/xerotier/config:/var/lib/inference/.config/xerotier
    environment:
      - XEROTIER_AGENT_JOIN_KEY=${XEROTIER_AGENT_JOIN_KEY}
      # Model is assigned from the dashboard during enrollment
      - XEROTIER_AGENT_MAX_MODEL_LEN=${XEROTIER_AGENT_MAX_MODEL_LEN:-}
      - XEROTIER_AGENT_MAX_CONCURRENT=${XEROTIER_AGENT_MAX_CONCURRENT:-5}
      - XEROTIER_AGENT_LOG_LEVEL=${XEROTIER_AGENT_LOG_LEVEL:-info}
      - VLLM_PLUGINS=zentorch
      - VLLM_CPU_KVCACHE_SPACE=${VLLM_CPU_KVCACHE_SPACE:-50}
      - VLLM_CPU_OMP_THREADS_BIND=${VLLM_CPU_OMP_THREADS_BIND:-0-23}
      - VLLM_CPU_NUM_OF_RESERVED_CPU=1
    shm_size: ${SHM_SIZE:-90g}
    restart: unless-stopped
                

Calculating Environment Values

Use these commands to calculate optimal values for your system:

                    # Calculate KV cache space (~75% of RAM minus overhead)
export VLLM_CPU_KVCACHE_SPACE="$(($(free -g | awk '/Mem/ {print $2}') * 75 / 100))"

# Calculate CPU core binding (all but one core)
export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 2))"

# Calculate shared memory (total RAM minus 1GB buffer)
export SHM_SIZE="$(($(free -m | awk '/Mem/ {print $2}') - 1024))m"

echo "KVCACHE_SPACE: ${VLLM_CPU_KVCACHE_SPACE}GB"
echo "CPU_THREADS_BIND: ${VLLM_CPU_OMP_THREADS_BIND}"
echo "SHM_SIZE: ${SHM_SIZE}"
                

Memory Tuning: Setting VLLM_CPU_KVCACHE_SPACE too high may cause out-of-memory errors. Start conservatively at 50-60% of available RAM and increase based on observed memory usage during inference.

Performance Considerations

Concurrency: CPU inference supports fewer concurrent requests than GPU. Start with XEROTIER_AGENT_MAX_CONCURRENT=5 and adjust based on model size.
Data Type: Use --dtype=bfloat16 for optimal performance on AMD EPYC with AVX-512 VNNI.
Model Selection: Smaller models (1-8B parameters) work best for CPU inference. Larger models will have significantly higher latency.
Memory Bandwidth: Inference performance is often memory-bandwidth limited. Ensure your system has adequate memory channels populated.

Troubleshooting

Common issues and their solutions when deploying self-hosted agents.

Agent Fails to Start

Symptom	Solution
Join key expired	Generate a new join key from the Agents dashboard
Connection refused	Verify network connectivity to Xerotier router mesh
Invalid join key format	Ensure the complete key is provided without truncation

GPU Not Detected

                    # Verify NVIDIA driver
nvidia-smi

# Verify Container Toolkit installation
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi

# Check Docker runtime configuration
docker info | grep -i runtime
                

Model Loading Fails

Symptom	Solution
Out of disk space	Increase disk allocation or reduce cache size
Model not found	Verify VLLM_MODEL is a valid HuggingFace model ID

Permission Denied Errors

                    # Fix host directory permissions
sudo chown -R 5152:5152 /data/xerotier

# Verify permissions
ls -la /data/xerotier
                

Out of Memory (OOM)

Reduce XEROTIER_AGENT_GPU_MEMORY_UTILIZATION to 0.85 or lower
Reduce XEROTIER_AGENT_MAX_CONCURRENT to limit concurrent requests
Reduce XEROTIER_AGENT_MAX_MODEL_LEN for shorter context windows
Use a smaller model or add more GPUs

LMCache Issues

Symptom	Solution
LMCache not enabled (no logs)	Verify `XEROTIER_AGENT_LMCACHE_ENABLED=true` is set in environment
Redis connection failed	Check Valkey is running and `XEROTIER_AGENT_LMCACHE_REDIS_URL` is correct
Config write failed	Ensure /var/lib/inference/.config/xerotier is writable
High disk usage	Set explicit `XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB` limit
No cache hits across agents	Verify all agents use same `XEROTIER_AGENT_LMCACHE_REDIS_URL`

Check LMCache status in logs:

                    # Verify LMCache initialization
docker-compose logs agent | grep -i lmcache

# Check config file was created
docker-compose exec agent cat /var/lib/inference/.config/xerotier/lmcache_config.yaml

# Test Valkey connectivity
docker-compose exec valkey valkey-cli ping
                

Common Commands

Command	Description
`docker-compose logs -f agent`	View agent logs
`docker-compose restart agent`	Restart the agent
`docker-compose down`	Stop all services
`nvidia-smi`	Monitor GPU utilization
`docker stats`	Monitor container resource usage

Frequently Asked Questions

How do I get a join key?

Navigate to the Agents page in your dashboard and click "Generate Join Key". Configure the region and expiration, then copy the generated key. The full key is only shown once.

Can I run multiple models on one GPU?

The agent loads one model at a time per vLLM instance. To serve multiple models, deploy multiple agents on separate GPUs or use time-sharing (not recommended for production).

How do I update the agent?

Pull the latest image and restart: docker-compose pull && docker-compose up -d. Your model cache and configuration persist through updates.

What models are supported?

Any model compatible with vLLM, including most HuggingFace Transformers models. Check the vLLM supported models list for compatibility.

How much VRAM do I need?

As a rough guide: 7B models need ~16GB, 13B models need ~32GB, 70B models need ~140GB (multiple GPUs). Quantized models (GPTQ, AWQ) reduce requirements significantly.

Can I use AMD GPUs?

AMD ROCm GPU support is planned for a future release. However, you can run inference on AMD EPYC CPUs using vLLM with ZenDNN optimization. See the AMD CPU Deployment section for details.

Why do I need to build my own Docker image for CPU inference?

CPU-optimized vLLM with ZenDNN requires specific build flags (AVX-512BF16, AVX-512VNNI) that must match your CPU architecture. Pre-built images cannot provide these optimizations for all CPU variants.

What vLLM version should I use with ZenDNN?

Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix. At the time of writing, vLLM v0.11.0 is recommended. Version mismatches cause plugin loading failures.

Is my data secure?

Yes. Self-hosted agents only receive requests from your project. All connections use CURVE encryption (ZMQ). Your inference data never leaves your infrastructure.

What happens if my agent goes offline?

Requests are automatically routed to other available agents. If you have fallback enabled, requests can be served by shared infrastructure. Otherwise, they queue until your agent reconnects.

Do I need LMCache?

LMCache is optional but recommended for production deployments. It significantly reduces TTFT for repeated prompt prefixes (e.g., system prompts, few-shot examples). If your workload has many unique prompts with no shared prefixes, the benefit is reduced.

Can I use LMCache without Valkey/Redis?

Yes. Set XEROTIER_AGENT_LMCACHE_ENABLED=true without XEROTIER_AGENT_LMCACHE_REDIS_URL to use only local CPU memory and disk caching. This works well for single-agent deployments. Add Valkey when you need cache sharing across multiple agents.

What happens if LMCache fails to initialize?

The agent gracefully degrades - it logs a warning and continues without KV cache sharing. Inference still works, just without the TTFT optimization. Check logs for initialization errors if you expected LMCache to be enabled.

How much memory should I allocate for LMCache?

The agent auto-calculates 10% of system resources by default, which works for most deployments. For high-traffic systems, consider 10-20% of RAM for CPU cache and 50-100GB for disk cache. Monitor eviction rates to tune sizing.