XIM Advanced Configuration
Fine-tune quantization, speculative decoding, flow control, enrollment, and other advanced settings for your Xerotier Inference Microservice.
Quantization
The XIM node supports both automatic and manual quantization to fit large models into limited GPU VRAM. Auto-quantization inspects the model size and available VRAM at startup and selects the best method automatically.
Quantization Override
Quantization is determined by the database value set when the endpoint is created. The agent CLI can override this with an explicit method:
# Force a specific quantization method
XEROTIER_AGENT_VLLM_QUANTIZATION=bitsandbytes
Quantization Environment Variables
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_VLLM_QUANTIZATION |
- | Force a specific quantization method (overrides database value). Supported values: bitsandbytes, bitsandbytes-fp4, fp8, awq, gptq, and more... |
Pre-Quantized Models: If the model is already quantized (AWQ, GPTQ format on HuggingFace), the XIM node detects this and passes the appropriate flag to vLLM. No additional configuration is needed.
Speculative Decoding
Speculative decoding can improve generation throughput by speculatively predicting multiple tokens per forward pass and verifying them in a single step. The XIM node supports several speculative methods.
Opt-In Required: Speculative decoding is disabled by default. You must set XEROTIER_AGENT_SPECULATIVE_ENABLED=1 to activate it. The method is auto-detected from the model architecture when not explicitly specified.
Supported Methods
| Method | Description |
|---|---|
deepseek_mtp |
Multi-Token Prediction (MTP) for DeepSeek models that natively support it. |
ngram |
N-gram based speculation using prompt history. No draft model required. |
eagle |
EAGLE speculative decoding using an external draft model. |
Speculative Decoding Environment Variables
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_SPECULATIVE_ENABLED |
disabled | Set to 1 or true to enable speculative decoding. Required. |
XEROTIER_AGENT_SPECULATIVE_METHOD |
auto | Force a speculative method: deepseek_mtp, ngram, eagle, etc. Auto-detected from model architecture when not set. |
XEROTIER_AGENT_SPECULATIVE_TOKENS |
method default | Number of speculative tokens per step. Higher values increase throughput at the cost of verification overhead. |
XEROTIER_AGENT_SPECULATIVE_NGRAM_FALLBACK |
disabled | Set to 1 or true to enable n-gram fallback when the primary method is unavailable. |
XEROTIER_AGENT_SPECULATIVE_DRAFT_MODEL_PATH |
- | Filesystem path to an external draft model. Required for eagle and medusa methods. |
Example: MTP on DeepSeek
XEROTIER_AGENT_SPECULATIVE_ENABLED=1
# Method is auto-detected for DeepSeek models (deepseek_mtp)
# Override if needed:
# XEROTIER_AGENT_SPECULATIVE_METHOD=deepseek_mtp
Example: N-Gram Fallback
XEROTIER_AGENT_SPECULATIVE_ENABLED=1
XEROTIER_AGENT_SPECULATIVE_METHOD=ngram
XEROTIER_AGENT_SPECULATIVE_TOKENS=3
XEROTIER_AGENT_SPECULATIVE_NGRAM_FALLBACK=1
MoE Kernel Tuning
For Mixture-of-Experts (MoE) models, the XIM node can automatically generate and apply optimized kernel tuning configurations. This improves expert dispatch performance on your specific GPU hardware.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_MOE_CONFIG_ENABLED |
enabled | Enable automatic MoE kernel tuning config generation. Set to 0 or false to disable. |
XEROTIER_AGENT_MOE_CONFIG_PATH |
auto | Custom path for MoE tuned config files. When unset, the XIM node stores configs alongside the model cache. |
Non-MoE Models: These settings have no effect on dense (non-MoE) models. The XIM node detects whether the loaded model uses MoE architecture and only generates configs when applicable.
Auto-Configuration
The XIM node includes a dynamic auto-configuration system that inspects the model architecture, GPU hardware, and available memory at startup to select optimal vLLM parameters. This covers tensor parallelism, quantization, context length, and CUDA graph settings.
Auto-Configuration Environment Variables
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_AUTO_CONFIG |
enabled | Master toggle for dynamic auto-configuration. Set to 0 or false to disable all auto-tuning and use only explicit settings. |
XEROTIER_AGENT_AUTO_CONFIGURE_GPU |
enabled | Auto-configure tensor parallelism from detected GPU count. Set to 0 or false to use the explicit XEROTIER_AGENT_TENSOR_PARALLEL_SIZE value. |
XEROTIER_AGENT_AUTO_CUDA_MITIGATION |
enabled | Auto-apply CUDA graph mitigations for known GPU issues (A30, A40, L40 with TP>1). Set to 0 or false to disable. |
Disabling Auto-Configuration
To take full manual control of vLLM parameters, disable all auto-configuration:
# Disable all auto-configuration
XEROTIER_AGENT_AUTO_CONFIG=0
XEROTIER_AGENT_AUTO_CONFIGURE_GPU=0
XEROTIER_AGENT_AUTO_CUDA_MITIGATION=0
# Then set explicit values
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.85
XEROTIER_AGENT_MAX_MODEL_LEN=8192
CUDA Graph Workarounds
Some GPU models (A30, A40, L40) experience CUDA graph capture failures under specific tensor parallelism configurations. The XIM node detects these cases and applies mitigations automatically when XEROTIER_AGENT_AUTO_CUDA_MITIGATION is enabled.
For manual control of CUDA graph behavior:
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_VLLM_DISABLE_CUDA_GRAPHS |
disabled | Force disable CUDA graphs (enforce eager execution). Set to 1 or true if experiencing CUDA graph capture failures. |
XEROTIER_AGENT_VLLM_DISABLE_CUSTOM_ALL_REDUCE |
disabled | Force disable custom all-reduce optimization. Set to 1 or true for multi-GPU P2P issues on A30/A40 GPUs. |
Flow Control
The XIM node implements credit-based flow control for streaming inference to prevent buffer overflow on the router side. Flow control ensures the XIM node does not send chunks faster than the router can forward them to clients.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_STREAMING_SOCKET_ENABLED |
true | Enable the dedicated streaming DEALER socket. When enabled, inference chunks use a separate socket to avoid head-of-line blocking on the control channel. |
XEROTIER_AGENT_FLOW_CONTROL_ENABLED |
true | Enable credit-based flow control for streaming inference. Set to 0 or false to disable backpressure. |
XEROTIER_AGENT_FLOW_CONTROL_WINDOW_BYTES |
65536 | Initial credit window size in bytes per stream. The XIM node can send up to this many bytes before waiting for the router to replenish credits. |
XEROTIER_AGENT_FLOW_CONTROL_REPLENISH_THRESHOLD |
0.5 | Fraction of the window that triggers a credit replenishment from the router (0.0-1.0). |
XEROTIER_AGENT_FLOW_CONTROL_PAUSE_THRESHOLD_BYTES |
262144 | Total unacknowledged bytes across all streams before the XIM node pauses sending. Acts as a global safety valve. |
XEROTIER_AGENT_FLOW_CONTROL_TIMEOUT_SECONDS |
30 | Seconds to wait for credit recovery before considering the stream stalled. |
Defaults are production-tuned: The default flow control settings work well for most deployments. Only adjust these if you observe streaming stalls or excessive backpressure pauses in the XIM node logs.
Lease Configuration
The XIM node maintains a lease with the router mesh via periodic heartbeats. If the router does not receive a renewal within the lease duration window, it marks the XIM node as expired and stops routing requests to it.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_LEASE_RENEWAL_INTERVAL_MS |
10000 | How often (in milliseconds) the XIM node sends a lease renewal heartbeat to the router. Default: 10 seconds. |
XEROTIER_AGENT_LEASE_DURATION_MS |
30000 | Requested lease duration in milliseconds. If the router does not receive a renewal within this window, the XIM node is marked expired. Default: 30 seconds. |
Keep the ratio safe: The lease duration should be at least 2-3x the renewal interval. A tight margin increases the risk of false lease expirations during transient network issues.
Model Pull Retry
When the XIM node downloads a model from the router or a remote registry, it uses exponential backoff retries on failure. These settings control retry behavior.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_MODEL_PULL_MAX_ATTEMPTS |
5 | Maximum number of retry attempts for a failed model pull before giving up. |
XEROTIER_AGENT_MODEL_PULL_BASE_DELAY_MS |
1000 | Base delay in milliseconds between retries. Each subsequent retry doubles the delay (exponential backoff). |
XEROTIER_AGENT_MODEL_PULL_MAX_DELAY_MS |
15000 | Maximum delay in milliseconds between retries. The backoff delay is capped at this value. |
Retry Sequence: With defaults, retries occur at approximately 1s, 2s, 4s, 8s, 15s (capped). Increase XEROTIER_AGENT_MODEL_PULL_MAX_ATTEMPTS for unreliable networks.
Enrollment Options
Additional options for the enrollment and startup process.
Insecure Enrollment
By default, enrollment requires HTTPS endpoints. For development or trusted network environments, you can allow non-HTTPS enrollment:
# Via CLI flag
xerotier-backend-agent enroll --join-key xjk_abc123... --insecure
# Via environment variable
XEROTIER_AGENT_ALLOW_INSECURE=1 xerotier-backend-agent enroll --join-key xjk_abc123...
Security Risk: Insecure enrollment transmits your join key and agent credentials over an unencrypted connection. Only use this in isolated development environments. Never use --insecure in production.
Initial Model Path
Preload a local model on startup instead of waiting for the router to stream one. This is useful when you have already downloaded a model to disk:
# Load a model from a local path on startup
XEROTIER_AGENT_INITIAL_MODEL_PATH=/data/models/my-model
When set, the XIM node starts vLLM with this model immediately after registration. The model must already exist at the specified path in a format vLLM can load (safetensors or equivalent).
Tenant Cache Isolation
In multi-tenant deployments where a single XIM node serves requests from different tenants, set a server secret to generate per-tenant cache salts. This prevents cross-tenant data leakage in the KV cache:
# Server secret for tenant-isolated cache keys
XEROTIER_AGENT_VLLM_SALT_SECRET=your-secret-string-here
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_VLLM_SALT_SECRET |
built-in default | Server secret for generating per-tenant cache isolation salts. Production deployments should always set a unique secret. |
XEROTIER_AGENT_ALLOW_INSECURE |
disabled | Set to 1 or true to allow non-HTTPS enrollment. Development only. |
XEROTIER_AGENT_INITIAL_MODEL_PATH |
- | Filesystem path to a local model to preload on startup. XIM node starts vLLM with this model immediately after registration. |
Single-Node Queuing
When a project has exactly one compatible XIM node and that node is at capacity, the router automatically queues incoming requests instead of returning an immediate 503 error. This improves the experience for small deployments running a single XIM node.
How It Works
- A request arrives and the router finds exactly one compatible XIM node.
- That node is at its concurrent request limit.
- Instead of failing, the router parks the request in a waiting queue.
- When the node completes an in-flight request and frees a slot, the queued request is dispatched.
- If the queue timeout expires before capacity becomes available, the request fails with a 503.
Queue Limits
| Parameter | Value | Description |
|---|---|---|
| Max queued per node | 10 | Maximum number of requests that can wait for a single node to free capacity. |
| Queue timeout | 30 seconds | Maximum time a request waits in the queue before receiving a 503 response. |
Router-Side Setting: The queue timeout (default: 30 seconds) is configured on the Xerotier router, not on the agent side. Contact support if you need to adjust the timeout for your deployment.
This behavior is transparent to clients. During queuing, the router holds the HTTP connection open. If the request is dispatched successfully, the client receives a normal response. The X-Request-ID response header can be used to trace queued requests in logs.
AMD CPU Deployment (ZenDNN)
Run inference on AMD EPYC CPUs without a GPU using vLLM with ZenDNN optimization. This requires building custom Docker images locally.
Build Required: Unlike GPU deployment, CPU-based inference requires you to build the Docker images locally. There is no pre-built image available due to the specialized build requirements for AMD CPU optimization.
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| CPU | AMD EPYC with AVX-512 | AMD EPYC 9454 (Genoa) or newer |
| System RAM | 64GB | 96GB+ (scales with model size) |
| Disk Space | 100GB SSD | 500GB+ NVMe SSD |
| CPU Cores | 16 cores | 24+ cores |
Memory Requirements by Model Size
CPU inference requires significantly more system RAM than GPU VRAM:
| Model Size | System RAM Required |
|---|---|
| Sub-1B parameters | ~32GB |
| 3-4B parameters | ~64GB |
| 7-8B parameters | ~96GB |
Building the Docker Images
Building CPU-optimized vLLM with ZenDNN requires a multi-step process. For detailed instructions, see the AMD EPYC inference guide.
Step 1: Clone vLLM Repository
git clone https://github.com/vllm-project/vllm
cd vllm
git checkout v0.11.0
Version Compatibility: ZenTorch requires specific vLLM versions. At the time of writing, v0.11.0 is the recommended version. Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix.
Step 2: Build vLLM CPU Base Image
docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512BF16=1 \
--build-arg VLLM_CPU_AVX512VNNI=1 \
--build-arg VLLM_CPU_DISABLE_AVX512=0 \
--tag vllm-cpu:local \
--target vllm-openai .
Step 3: Create ZenDNN Dockerfile
Create docker/Dockerfile.cpu-amd to add ZenDNN optimization:
# SPDX-License-Identifier: MIT
# ZenDNN Optimization Layer for vLLM CPU Inference
# Builds on top of the vLLM CPU base image to add AMD ZenDNN support
#
# Prerequisites:
# Build the vLLM CPU base image first:
# docker build -f docker/Dockerfile.cpu \
# --build-arg VLLM_CPU_AVX512BF16=1 \
# --build-arg VLLM_CPU_AVX512VNNI=1 \
# --build-arg VLLM_CPU_DISABLE_AVX512=0 \
# --tag vllm-cpu:local \
# --target vllm-openai .
#
# Build:
# docker build -f Dockerfile.xim-vllm-zendnn --tag vllm-cpu-zentorch:local .
# syntax=docker/dockerfile:1
# Set cache stage
FROM ghcr.io/cloudnull/xerotier/xerotier:latest AS xerotier_base
FROM vllm-cpu:local AS vllm_base
# NOTE(cloudnull): Both of these should be kept in sync with the versions used in the
# vLLM repository's submodules for ZenDNN and ZenTorch.
ARG ZENDNN_VERSION=zendnn-2026-WW09
ARG ZENTORCH_VERSION=zentorch-2026-WW09
RUN apt-get update -y \
&& apt-get install -y --no-install-recommends \
make ninja-build ccache git curl wget ca-certificates \
gcc-12 g++-12 gfortran \
libtcmalloc-minimal4 libnuma-dev libjemalloc2 \
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \
--slave /usr/bin/g++ g++ /usr/bin/g++-12 \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --break-system-packages "cmake>=3.25"
WORKDIR /build
RUN git clone https://github.com/amd/ZenDNN-pytorch-plugin.git
WORKDIR /build/ZenDNN-pytorch-plugin
RUN git checkout ${ZENTORCH_VERSION}
# NOTE(cloudnull): Patch ZenDNN CMake files to use the same version for both the main plugin and the ZenDNN submodule.
# This ensures that the plugin and the ZenDNN library it builds against are always in sync, preventing
# potential compatibility issues.
RUN sed -i -E "s/ZENDNN_TAG\s\".+\"/ZENDNN_TAG \"${ZENDNN_VERSION}\"/" /build/ZenDNN-pytorch-plugin/cmake/modules/zendnn.cmake
RUN sed -i -E "s/GIT_TAG\s.+/GIT_TAG ${ZENDNN_VERSION}/" /build/ZenDNN-pytorch-plugin/cmake/modules/zendnnl.cmake
RUN --mount=type=cache,target=/root/.cache/pip \
uv pip install -r requirements.txt && \
uv pip install --extra-index-url https://download.pytorch.org/whl/cpu --force-reinstall torch torchvision torchaudio
# NOTE(cloudnull): This is just a dirty hack to make zentorch 15.2RC work.
# Stub out DLRM EmbeddingBag/QuantEmbedBag ops - their ZenDNN kernel
# implementations are incompatible with the ZenDNN version CMake
# downloads. These are recommender-system ops not needed for LLM
# inference (PagedAttention compiles fine). The source files are kept
# in CMakeLists.txt so that symbols referenced by the dispatch table
# remain defined; the stubs throw at runtime if ever called.
RUN printf '#include <ATen/ATen.h>\n\
#include <torch/library.h>\n\
namespace zentorch {\n\
at::Tensor zentorch_get_packed_embedding_weight(\n\
at::Tensor& /*w*/, at::Tensor& /*o*/, at::Tensor& /*p*/) {\n\
TORCH_CHECK(false,\n\
"zentorch EmbeddingBag ops disabled (not needed for LLM inference)");\n\
return {};\n\
}\n\
} // namespace zentorch\n' > src/cpu/cpp/EmbeddingBag.cpp \
&& > src/cpu/cpp/QuantEmbedBag.cpp
RUN CC=gcc CXX=g++ python setup.py bdist_wheel
FROM vllm-cpu:local
ARG ZENDNN_VERSION=zendnn-2026-WW09
ARG ZENTORCH_VERSION=zentorch-2026-WW09
ARG BUILD_DATE=unknown
# NOTE(cloudnull): Install runtime dependencies
# - libzmq5: ZeroMQ runtime library (required for dynamic linking)
# - ca-certificates: For HTTPS connections to vLLM
# - curl: For health checks
RUN apt-get update && apt-get install -y \
libzmq5 \
ca-certificates \
curl \
&& rm -rf /var/lib/apt/lists/*
COPY --from=vllm_base /build/ZenDNN-pytorch-plugin/dist/*.whl /tmp/
RUN uv pip install /tmp/*.whl lmcache llmcompressor
# NOTE(cloudnull): This is another dirty hack to make zentorch 15.2RC work.
# Wrap _meta_registrations import so zentorch loads even when
# EmbeddingBag ops are stubbed out. Meta registrations are only
# needed for torch.compile tracing, not eager-mode inference.
RUN sed -i 's/^from \._meta_registrations import \*/try:\n from ._meta_registrations import * # noqa\nexcept (AttributeError, ImportError):\n pass # EmbeddingBag ops stubbed out/' \
/opt/venv/lib/python3.12/site-packages/zentorch/__init__.py
# Create non-root user
RUN useradd -l -r -u 5152 -m inference --home-dir /var/lib/inference
RUN mkdir -p /var/lib/inference/.cache/xerotier && mkdir -p /var/lib/inference/.config/xerotier && chown -R inference:inference /var/lib/inference
# Copy binary from xerotier_base
COPY --from=xerotier_base /usr/local/bin/xerotier-backend-agent /usr/local/bin/
# Copy entrypoint script
COPY --from=xerotier_base /usr/local/bin/entrypoint-agent.sh /usr/local/bin/
# Copy vLLM wrapper that patches huggingface_hub for local model paths
COPY --from=xerotier_base /usr/local/bin/xerotier-vllm /usr/local/bin/xerotier-vllm
# Set ownership and permissions
RUN chown inference:inference /usr/local/bin/xerotier-backend-agent && \
chown inference:inference /usr/local/bin/entrypoint-agent.sh && \
chown inference:inference /usr/local/bin/xerotier-vllm && \
chmod +x /usr/local/bin/entrypoint-agent.sh && \
chmod +x /usr/local/bin/xerotier-vllm
# Switch to non-root user
USER inference
# Configure environment
ENV HOME=/var/lib/inference
ENV PYTORCH_ALLOC_CONF=expandable_segments:True
ENV XEROTIER_AGENT_VLLM_PATH=/usr/local/bin/xerotier-vllm
ENV VLLM_PLUGINS="zentorch"
ENV LMCACHE_CONFIG_FILE=/var/lib/inference/.config/xerotier/lmcache_config.yaml
LABEL org.opencontainers.image.authors="Cloudnull" \
org.opencontainers.image.url="https://xerotier.com" \
org.opencontainers.image.documentation="https://xerotier.com/docs" \
org.opencontainers.image.source="https://github.com/cloudnull/xerotier-public" \
org.opencontainers.image.title="Xerotier Inference Agent (vLLM ZenDNN)" \
org.opencontainers.image.description="Xerotier Inference Agent image optimized for AMD GPUs using vLLM and ZenDNN. This image includes the necessary runtime dependencies to run the agent with vLLM on AMD hardware. It is designed to be used with the Xerotier Inference Router for distributed model inference across AMD GPU-equipped nodes." \
org.opencontainers.image.licenses="MIT" \
org.opencontainers.image.base.name="vllm/vllm-openai-zendnn:latest" \
org.opencontainers.image.base.vllm_version="$(uv pip show vllm | grep ^Version: | awk '{print \$2}')" \
org.opencontainers.image.base.zendnn_version="${ZENDNN_VERSION}" \
org.opencontainers.image.base.zentorch_version="${ZENTORCH_VERSION}" \
org.opencontainers.image.vendor="Xerotier" \
org.opencontainers.image.build_type="production" \
org.opencontainers.image.created="${BUILD_DATE}"
# Entrypoint handles enrollment on first run, then starts the agent
# Override with: docker run ... xerotier-backend-agent enroll --help
ENTRYPOINT ["/usr/local/bin/entrypoint-agent.sh"]
Step 4: Build ZenDNN-Optimized Image
docker build -f docker/Dockerfile.cpu-amd \
--build-arg VLLM_CPU_AVX512BF16=1 \
--build-arg VLLM_CPU_AVX512VNNI=1 \
--build-arg VLLM_CPU_DISABLE_AVX512=0 \
--tag vllm-cpu-zentorch:local .
Step 5: Build Xerotier.ai XIM Image
From the Xerotier.ai repository root, build the CPU XIM node:
docker build -f deploy/docker/Dockerfile.xim-vllm-zendnn \
--tag xerotier/backend-agent-cpu:local .
CPU-Specific Environment Variables
| Variable | Default | Description |
|---|---|---|
VLLM_PLUGINS |
zentorch | Enable ZenDNN optimization plugin |
VLLM_CPU_KVCACHE_SPACE |
auto | KV cache size in GB (auto: totalRAM - model - overhead) |
VLLM_CPU_OMP_THREADS_BIND |
auto | CPU core binding range (auto: 0-{nproc-1}) |
VLLM_CPU_NUM_OF_RESERVED_CPU |
auto | CPUs reserved for OS operations (auto: 1) |
Auto-Tuning: The XIM node automatically computes VLLM_CPU_KVCACHE_SPACE, VLLM_CPU_OMP_THREADS_BIND, and VLLM_CPU_NUM_OF_RESERVED_CPU at startup based on system resources. No manual calculation is required. Override via XEROTIER_AGENT_VLLM_ENV if the defaults are unsuitable.
Docker Compose for CPU XIM Node
# SPDX-License-Identifier: MIT
# Xerotier Agent - CPU Inference Stack
#
# Deploys a XIM CPU node for inference without GPU acceleration.
# ZenTorch optimization for AMD EPYC CPUs is available as an opt-in image tag.
#
# IMAGE OPTIONS:
# AMD ZenTorch optimized:
# DOCKER_REGISTRY=ghcr.io/cloudnull/xerotier VERSION=latest
# XEROTIER_CPU_IMAGE=xerotier-xim-amd
# -> ghcr.io/cloudnull/xerotier-public/xerotier-xim-amd-cpu-zendnn:latest
#
# QUICK START:
# 1. Get a join key from your Xerotier dashboard:
# Dashboard -> Infrastructure -> Agents -> Generate Join Key
#
# 2. Set your join key:
# export XEROTIER_AGENT_JOIN_KEY=xjk_your_key_here
#
# 3. Create host directories with correct permissions:
# sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache
# sudo chown -R 5152:5152 /data/xerotier
#
# 4. Start the agent:
# docker compose -f docker-compose.agent-amd-cpu-zendnn.yaml up -d
#
# ENROLLMENT WORKFLOW:
# - On first start, the agent enrolls using your join key
# - Enrollment state is persisted to /data/xerotier/config
# - On subsequent restarts, the agent reconnects automatically
# - You can remove XEROTIER_AGENT_JOIN_KEY after successful enrollment
#
# ENVIRONMENT VARIABLES:
# XEROTIER_AGENT_JOIN_KEY [REQUIRED] Join key from Xerotier dashboard (first run only)
# XEROTIER_CPU_IMAGE Container image name (default: xerotier-xim-amd-cpu-zendnn)
# XEROTIER_AGENT_MAX_CONCURRENT Optional ceiling for concurrent requests (auto-configured when not set)
# XEROTIER_AGENT_LOG_LEVEL Logging level: trace, debug, info, warning, error (default: info)
# XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB Local model cache size in GB (default: 100)
# SHM_SIZE Shared memory size (default: 90g)
#
# CPU TUNING (auto-computed by the agent; override via XEROTIER_AGENT_VLLM_ENV if needed):
# VLLM_CPU_OMP_THREADS_BIND CPU thread binding range (auto: 0-{nproc-1})
# VLLM_CPU_NUM_OF_RESERVED_CPU Reserved CPUs for system (auto: 1)
# VLLM_CPU_KVCACHE_SPACE KV cache memory in GB (auto: totalRAM - model - overhead)
# OMP_NUM_THREADS OpenMP threads (auto: physical core count)
# OMP_PROC_BIND Pin threads to cores (auto: TRUE)
# OMP_WAIT_POLICY Thread wait strategy (auto: PASSIVE)
#
# KERNEL SOCKET BUFFER TUNING (optional):
# The agent sets ZeroMQ socket buffers to 4 MiB for streaming throughput.
# Linux defaults net.core.wmem_max and net.core.rmem_max to 212992 bytes,
# which silently caps the requested buffer size. The agent will attempt to
# raise these limits on startup (requires privileged mode or CAP_SYS_ADMIN).
#
# If the container is not privileged, set these on the host before starting:
# sudo sysctl -w net.core.wmem_max=4194304
# sudo sysctl -w net.core.rmem_max=4194304
#
# To persist across reboots, add to /etc/sysctl.d/99-xerotier.conf:
# net.core.wmem_max = 4194304
# net.core.rmem_max = 4194304
services:
agent:
image: ${DOCKER_REGISTRY:-ghcr.io/cloudnull/xerotier}-public/xim-vllm-zendnn:${VERSION:-latest}
container_name: xim-vllm-zendnn
network_mode: host
ipc: host
privileged: true
shm_size: ${SHM_SIZE:-90g}
# No command - entrypoint handles enrollment + run automatically
volumes:
# Persistent model cache
- /data/xerotier/models:/var/lib/inference/.cache/xerotier/models
# Persistent enrollment state
- /data/xerotier/config:/var/lib/inference/.config/xerotier
# Caching
- /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache
environment:
# Agent Enrollment [REQUIRED for first run]
XEROTIER_AGENT_JOIN_KEY: ${XEROTIER_AGENT_JOIN_KEY:-}
# Agent Configuration
XEROTIER_AGENT_LOG_LEVEL: ${XEROTIER_AGENT_LOG_LEVEL:-info}
# LMCache Configuration
XEROTIER_AGENT_LMCACHE_ENABLED: "true"
XEROTIER_AGENT_LMCACHE_REDIS_URL: "${XEROTIER_AGENT_LMCACHE_REDIS_URL:-}"
restart: unless-stopped
Memory Tuning: If the auto-computed VLLM_CPU_KVCACHE_SPACE causes out-of-memory errors, override it to a lower value via XEROTIER_AGENT_VLLM_ENV="VLLM_CPU_KVCACHE_SPACE=<GB>".
Performance Considerations
- Concurrency: CPU inference supports fewer concurrent requests than GPU. Start with
XEROTIER_AGENT_MAX_CONCURRENT=5and adjust based on model size. - Data Type: Use
--dtype=bfloat16for optimal performance on AMD EPYC with AVX-512 VNNI. - Model Selection: Smaller models (1-8B parameters) work best for CPU inference. Larger models will have significantly higher latency.
- Memory Bandwidth: Inference performance is often memory-bandwidth limited. Ensure your system has adequate memory channels populated.
Troubleshooting
Common issues and their solutions when deploying XIM nodes.
Agent Fails to Start
| Symptom | Solution |
|---|---|
| Join key expired | Generate a new join key from the Agents dashboard |
| Connection refused | Verify network connectivity to Xerotier router mesh |
| Invalid join key format | Ensure the complete key is provided without truncation |
GPU Not Detected
# Verify NVIDIA driver
nvidia-smi
# Verify Container Toolkit installation
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi
# Check Docker runtime configuration
docker info | grep -i runtime
Model Loading Fails
| Symptom | Solution |
|---|---|
| Out of disk space | Increase disk allocation or reduce cache size |
| Model not found | Verify VLLM_MODEL is a valid HuggingFace model ID |
Permission Denied Errors
# Fix host directory permissions
sudo chown -R 5152:5152 /data/xerotier
# Verify permissions
ls -la /data/xerotier
Out of Memory (OOM)
- Reduce
XEROTIER_AGENT_GPU_MEMORY_UTILIZATIONto 0.85 or lower - Reduce
XEROTIER_AGENT_MAX_CONCURRENTto limit concurrent requests - Reduce
XEROTIER_AGENT_MAX_MODEL_LENfor shorter context windows - Use a smaller model or add more GPUs
LMCache Issues
| Symptom | Solution |
|---|---|
| LMCache not enabled (no logs) | Verify XEROTIER_AGENT_LMCACHE_ENABLED=true is set in environment |
| Redis connection failed | Check Valkey is running and XEROTIER_AGENT_LMCACHE_REDIS_URL is correct |
| Config write failed | Ensure /var/lib/inference/.config/xerotier is writable |
| High disk usage | Set explicit XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB limit |
| No cache hits across nodes | Verify all nodes use same XEROTIER_AGENT_LMCACHE_REDIS_URL |
Check LMCache status in logs:
# Verify LMCache initialization
docker-compose logs agent | grep -i lmcache
# Check config file was created
docker-compose exec agent cat /var/lib/inference/.config/xerotier/lmcache_config.yaml
# Test Valkey connectivity
docker-compose exec valkey valkey-cli ping
Common Commands
| Command | Description |
|---|---|
docker-compose logs -f agent |
View agent logs |
docker-compose restart agent |
Restart the agent |
docker-compose down |
Stop all services |
nvidia-smi |
Monitor GPU utilization |
docker stats |
Monitor container resource usage |
Frequently Asked Questions
How do I get a join key?
Navigate to the Agents page in your dashboard and click "Generate Join Key". Configure the region and expiration, then copy the generated key. The full key is only shown once.
Can I run multiple models on one GPU?
The agent loads one model at a time per vLLM instance. To serve multiple models, deploy multiple agents on separate GPUs or use time-sharing (not recommended for production).
How do I update the agent?
Pull the latest image and restart: docker-compose pull && docker-compose up -d. Your model cache and configuration persist through updates.
What models are supported?
Any model compatible with vLLM, including most HuggingFace Transformers models. Check the vLLM supported models list for compatibility.
How much VRAM do I need?
As a rough guide: 7B models need ~16GB, 13B models need ~32GB, 70B models need ~140GB (multiple GPUs). Quantized models (GPTQ, AWQ) reduce requirements significantly.
Can I use AMD GPUs?
Yes. AMD ROCm GPU support is available. See the AMD ROCm GPU Setup section for the Docker Compose configuration. You can also run inference on AMD EPYC CPUs using vLLM with ZenDNN optimization. See the AMD CPU Deployment section for details.
Why do I need to build my own Docker image for CPU inference?
CPU-optimized vLLM with ZenDNN requires specific build flags (AVX-512BF16, AVX-512VNNI) that must match your CPU architecture. Pre-built images cannot provide these optimizations for all CPU variants.
What vLLM version should I use with ZenDNN?
Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix. At the time of writing, vLLM v0.11.0 is recommended. Version mismatches cause plugin loading failures.
Is my data secure?
Yes. XIM nodes only receive requests from your project. All connections use CURVE encryption (ZMQ). Your inference data never leaves your infrastructure.
What happens if my XIM node goes offline?
Requests are automatically routed to other available XIM nodes. If you have fallback enabled, requests can be served by shared infrastructure. Otherwise, they queue until your node reconnects.
Do I need LMCache?
LMCache is optional but recommended for production deployments. It significantly reduces TTFT for repeated prompt prefixes (e.g., system prompts, few-shot examples). If your workload has many unique prompts with no shared prefixes, the benefit is reduced.
Can I use LMCache without Valkey/Redis?
Yes. Set XEROTIER_AGENT_LMCACHE_ENABLED=true without XEROTIER_AGENT_LMCACHE_REDIS_URL to use only local CPU memory and disk caching. This works well for single-node deployments. Add Valkey when you need cache sharing across multiple nodes.
What happens if LMCache fails to initialize?
The XIM node gracefully degrades - it logs a warning and continues without KV cache sharing. Inference still works, just without the TTFT optimization. Check logs for initialization errors if you expected LMCache to be enabled.
How much memory should I allocate for LMCache?
The XIM node auto-calculates 10% of system resources by default, which works for most deployments. For high-traffic systems, consider 10-20% of RAM for CPU cache and 50-100GB for disk cache. Monitor eviction rates to tune sizing.