XIM Advanced Configuration

Fine-tune quantization, speculative decoding, flow control, enrollment, and other advanced settings for your Xerotier Inference Microservice.

Quantization

The XIM node supports both automatic and manual quantization to fit large models into limited GPU VRAM. Auto-quantization inspects the model size and available VRAM at startup and selects the best method automatically.

Quantization Override

Quantization is determined by the database value set when the endpoint is created. The agent CLI can override this with an explicit method:

.env
# Force a specific quantization method XEROTIER_AGENT_VLLM_QUANTIZATION=bitsandbytes

Quantization Environment Variables

Variable Default Description
XEROTIER_AGENT_VLLM_QUANTIZATION - Force a specific quantization method (overrides database value). Supported values: bitsandbytes, bitsandbytes-fp4, fp8, awq, gptq, and more...

Pre-Quantized Models: If the model is already quantized (AWQ, GPTQ format on HuggingFace), the XIM node detects this and passes the appropriate flag to vLLM. No additional configuration is needed.

Speculative Decoding

Speculative decoding can improve generation throughput by speculatively predicting multiple tokens per forward pass and verifying them in a single step. The XIM node supports several speculative methods.

Opt-In Required: Speculative decoding is disabled by default. You must set XEROTIER_AGENT_SPECULATIVE_ENABLED=1 to activate it. The method is auto-detected from the model architecture when not explicitly specified.

Supported Methods

Method Description
deepseek_mtp Multi-Token Prediction (MTP) for DeepSeek models that natively support it.
ngram N-gram based speculation using prompt history. No draft model required.
eagle EAGLE speculative decoding using an external draft model.

Speculative Decoding Environment Variables

Variable Default Description
XEROTIER_AGENT_SPECULATIVE_ENABLED disabled Set to 1 or true to enable speculative decoding. Required.
XEROTIER_AGENT_SPECULATIVE_METHOD auto Force a speculative method: deepseek_mtp, ngram, eagle, etc. Auto-detected from model architecture when not set.
XEROTIER_AGENT_SPECULATIVE_TOKENS method default Number of speculative tokens per step. Higher values increase throughput at the cost of verification overhead.
XEROTIER_AGENT_SPECULATIVE_NGRAM_FALLBACK disabled Set to 1 or true to enable n-gram fallback when the primary method is unavailable.
XEROTIER_AGENT_SPECULATIVE_DRAFT_MODEL_PATH - Filesystem path to an external draft model. Required for eagle and medusa methods.

Example: MTP on DeepSeek

.env
XEROTIER_AGENT_SPECULATIVE_ENABLED=1 # Method is auto-detected for DeepSeek models (deepseek_mtp) # Override if needed: # XEROTIER_AGENT_SPECULATIVE_METHOD=deepseek_mtp

Example: N-Gram Fallback

.env
XEROTIER_AGENT_SPECULATIVE_ENABLED=1 XEROTIER_AGENT_SPECULATIVE_METHOD=ngram XEROTIER_AGENT_SPECULATIVE_TOKENS=3 XEROTIER_AGENT_SPECULATIVE_NGRAM_FALLBACK=1

MoE Kernel Tuning

For Mixture-of-Experts (MoE) models, the XIM node can automatically generate and apply optimized kernel tuning configurations. This improves expert dispatch performance on your specific GPU hardware.

Variable Default Description
XEROTIER_AGENT_MOE_CONFIG_ENABLED enabled Enable automatic MoE kernel tuning config generation. Set to 0 or false to disable.
XEROTIER_AGENT_MOE_CONFIG_PATH auto Custom path for MoE tuned config files. When unset, the XIM node stores configs alongside the model cache.

Non-MoE Models: These settings have no effect on dense (non-MoE) models. The XIM node detects whether the loaded model uses MoE architecture and only generates configs when applicable.

Auto-Configuration

The XIM node includes a dynamic auto-configuration system that inspects the model architecture, GPU hardware, and available memory at startup to select optimal vLLM parameters. This covers tensor parallelism, quantization, context length, and CUDA graph settings.

Auto-Configuration Environment Variables

Variable Default Description
XEROTIER_AGENT_AUTO_CONFIG enabled Master toggle for dynamic auto-configuration. Set to 0 or false to disable all auto-tuning and use only explicit settings.
XEROTIER_AGENT_AUTO_CONFIGURE_GPU enabled Auto-configure tensor parallelism from detected GPU count. Set to 0 or false to use the explicit XEROTIER_AGENT_TENSOR_PARALLEL_SIZE value.
XEROTIER_AGENT_AUTO_CUDA_MITIGATION enabled Auto-apply CUDA graph mitigations for known GPU issues (A30, A40, L40 with TP>1). Set to 0 or false to disable.

Disabling Auto-Configuration

To take full manual control of vLLM parameters, disable all auto-configuration:

.env
# Disable all auto-configuration XEROTIER_AGENT_AUTO_CONFIG=0 XEROTIER_AGENT_AUTO_CONFIGURE_GPU=0 XEROTIER_AGENT_AUTO_CUDA_MITIGATION=0 # Then set explicit values XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2 XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.85 XEROTIER_AGENT_MAX_MODEL_LEN=8192

CUDA Graph Workarounds

Some GPU models (A30, A40, L40) experience CUDA graph capture failures under specific tensor parallelism configurations. The XIM node detects these cases and applies mitigations automatically when XEROTIER_AGENT_AUTO_CUDA_MITIGATION is enabled.

For manual control of CUDA graph behavior:

Variable Default Description
XEROTIER_AGENT_VLLM_DISABLE_CUDA_GRAPHS disabled Force disable CUDA graphs (enforce eager execution). Set to 1 or true if experiencing CUDA graph capture failures.
XEROTIER_AGENT_VLLM_DISABLE_CUSTOM_ALL_REDUCE disabled Force disable custom all-reduce optimization. Set to 1 or true for multi-GPU P2P issues on A30/A40 GPUs.

Flow Control

The XIM node implements credit-based flow control for streaming inference to prevent buffer overflow on the router side. Flow control ensures the XIM node does not send chunks faster than the router can forward them to clients.

Variable Default Description
XEROTIER_AGENT_STREAMING_SOCKET_ENABLED true Enable the dedicated streaming DEALER socket. When enabled, inference chunks use a separate socket to avoid head-of-line blocking on the control channel.
XEROTIER_AGENT_FLOW_CONTROL_ENABLED true Enable credit-based flow control for streaming inference. Set to 0 or false to disable backpressure.
XEROTIER_AGENT_FLOW_CONTROL_WINDOW_BYTES 65536 Initial credit window size in bytes per stream. The XIM node can send up to this many bytes before waiting for the router to replenish credits.
XEROTIER_AGENT_FLOW_CONTROL_REPLENISH_THRESHOLD 0.5 Fraction of the window that triggers a credit replenishment from the router (0.0-1.0).
XEROTIER_AGENT_FLOW_CONTROL_PAUSE_THRESHOLD_BYTES 262144 Total unacknowledged bytes across all streams before the XIM node pauses sending. Acts as a global safety valve.
XEROTIER_AGENT_FLOW_CONTROL_TIMEOUT_SECONDS 30 Seconds to wait for credit recovery before considering the stream stalled.

Defaults are production-tuned: The default flow control settings work well for most deployments. Only adjust these if you observe streaming stalls or excessive backpressure pauses in the XIM node logs.

Lease Configuration

The XIM node maintains a lease with the router mesh via periodic heartbeats. If the router does not receive a renewal within the lease duration window, it marks the XIM node as expired and stops routing requests to it.

Variable Default Description
XEROTIER_AGENT_LEASE_RENEWAL_INTERVAL_MS 10000 How often (in milliseconds) the XIM node sends a lease renewal heartbeat to the router. Default: 10 seconds.
XEROTIER_AGENT_LEASE_DURATION_MS 30000 Requested lease duration in milliseconds. If the router does not receive a renewal within this window, the XIM node is marked expired. Default: 30 seconds.

Keep the ratio safe: The lease duration should be at least 2-3x the renewal interval. A tight margin increases the risk of false lease expirations during transient network issues.

Model Pull Retry

When the XIM node downloads a model from the router or a remote registry, it uses exponential backoff retries on failure. These settings control retry behavior.

Variable Default Description
XEROTIER_AGENT_MODEL_PULL_MAX_ATTEMPTS 5 Maximum number of retry attempts for a failed model pull before giving up.
XEROTIER_AGENT_MODEL_PULL_BASE_DELAY_MS 1000 Base delay in milliseconds between retries. Each subsequent retry doubles the delay (exponential backoff).
XEROTIER_AGENT_MODEL_PULL_MAX_DELAY_MS 15000 Maximum delay in milliseconds between retries. The backoff delay is capped at this value.

Retry Sequence: With defaults, retries occur at approximately 1s, 2s, 4s, 8s, 15s (capped). Increase XEROTIER_AGENT_MODEL_PULL_MAX_ATTEMPTS for unreliable networks.

Enrollment Options

Additional options for the enrollment and startup process.

Insecure Enrollment

By default, enrollment requires HTTPS endpoints. For development or trusted network environments, you can allow non-HTTPS enrollment:

bash
# Via CLI flag xerotier-backend-agent enroll --join-key xjk_abc123... --insecure # Via environment variable XEROTIER_AGENT_ALLOW_INSECURE=1 xerotier-backend-agent enroll --join-key xjk_abc123...

Security Risk: Insecure enrollment transmits your join key and agent credentials over an unencrypted connection. Only use this in isolated development environments. Never use --insecure in production.

Initial Model Path

Preload a local model on startup instead of waiting for the router to stream one. This is useful when you have already downloaded a model to disk:

.env
# Load a model from a local path on startup XEROTIER_AGENT_INITIAL_MODEL_PATH=/data/models/my-model

When set, the XIM node starts vLLM with this model immediately after registration. The model must already exist at the specified path in a format vLLM can load (safetensors or equivalent).

Tenant Cache Isolation

In multi-tenant deployments where a single XIM node serves requests from different tenants, set a server secret to generate per-tenant cache salts. This prevents cross-tenant data leakage in the KV cache:

.env
# Server secret for tenant-isolated cache keys XEROTIER_AGENT_VLLM_SALT_SECRET=your-secret-string-here
Variable Default Description
XEROTIER_AGENT_VLLM_SALT_SECRET built-in default Server secret for generating per-tenant cache isolation salts. Production deployments should always set a unique secret.
XEROTIER_AGENT_ALLOW_INSECURE disabled Set to 1 or true to allow non-HTTPS enrollment. Development only.
XEROTIER_AGENT_INITIAL_MODEL_PATH - Filesystem path to a local model to preload on startup. XIM node starts vLLM with this model immediately after registration.

Single-Node Queuing

When a project has exactly one compatible XIM node and that node is at capacity, the router automatically queues incoming requests instead of returning an immediate 503 error. This improves the experience for small deployments running a single XIM node.

How It Works

  1. A request arrives and the router finds exactly one compatible XIM node.
  2. That node is at its concurrent request limit.
  3. Instead of failing, the router parks the request in a waiting queue.
  4. When the node completes an in-flight request and frees a slot, the queued request is dispatched.
  5. If the queue timeout expires before capacity becomes available, the request fails with a 503.

Queue Limits

Parameter Value Description
Max queued per node 10 Maximum number of requests that can wait for a single node to free capacity.
Queue timeout 30 seconds Maximum time a request waits in the queue before receiving a 503 response.

Router-Side Setting: The queue timeout (default: 30 seconds) is configured on the Xerotier router, not on the agent side. Contact support if you need to adjust the timeout for your deployment.

This behavior is transparent to clients. During queuing, the router holds the HTTP connection open. If the request is dispatched successfully, the client receives a normal response. The X-Request-ID response header can be used to trace queued requests in logs.

AMD CPU Deployment (ZenDNN)

Run inference on AMD EPYC CPUs without a GPU using vLLM with ZenDNN optimization. This requires building custom Docker images locally.

Build Required: Unlike GPU deployment, CPU-based inference requires you to build the Docker images locally. There is no pre-built image available due to the specialized build requirements for AMD CPU optimization.

Hardware Requirements

Component Minimum Recommended
CPU AMD EPYC with AVX-512 AMD EPYC 9454 (Genoa) or newer
System RAM 64GB 96GB+ (scales with model size)
Disk Space 100GB SSD 500GB+ NVMe SSD
CPU Cores 16 cores 24+ cores

Memory Requirements by Model Size

CPU inference requires significantly more system RAM than GPU VRAM:

Model Size System RAM Required
Sub-1B parameters ~32GB
3-4B parameters ~64GB
7-8B parameters ~96GB

Building the Docker Images

Building CPU-optimized vLLM with ZenDNN requires a multi-step process. For detailed instructions, see the AMD EPYC inference guide.

Step 1: Clone vLLM Repository

bash
git clone https://github.com/vllm-project/vllm cd vllm git checkout v0.11.0

Version Compatibility: ZenTorch requires specific vLLM versions. At the time of writing, v0.11.0 is the recommended version. Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix.

Step 2: Build vLLM CPU Base Image

bash
docker build -f docker/Dockerfile.cpu \ --build-arg VLLM_CPU_AVX512BF16=1 \ --build-arg VLLM_CPU_AVX512VNNI=1 \ --build-arg VLLM_CPU_DISABLE_AVX512=0 \ --tag vllm-cpu:local \ --target vllm-openai .

Step 3: Create ZenDNN Dockerfile

Create docker/Dockerfile.cpu-amd to add ZenDNN optimization:

Dockerfile.xim-vllm-zendnn
# SPDX-License-Identifier: MIT # ZenDNN Optimization Layer for vLLM CPU Inference # Builds on top of the vLLM CPU base image to add AMD ZenDNN support # # Prerequisites: # Build the vLLM CPU base image first: # docker build -f docker/Dockerfile.cpu \ # --build-arg VLLM_CPU_AVX512BF16=1 \ # --build-arg VLLM_CPU_AVX512VNNI=1 \ # --build-arg VLLM_CPU_DISABLE_AVX512=0 \ # --tag vllm-cpu:local \ # --target vllm-openai . # # Build: # docker build -f Dockerfile.xim-vllm-zendnn --tag vllm-cpu-zentorch:local . # syntax=docker/dockerfile:1 # Set cache stage FROM ghcr.io/cloudnull/xerotier/xerotier:latest AS xerotier_base FROM vllm-cpu:local AS vllm_base # NOTE(cloudnull): Both of these should be kept in sync with the versions used in the # vLLM repository's submodules for ZenDNN and ZenTorch. ARG ZENDNN_VERSION=zendnn-2026-WW09 ARG ZENTORCH_VERSION=zentorch-2026-WW09 RUN apt-get update -y \ && apt-get install -y --no-install-recommends \ make ninja-build ccache git curl wget ca-certificates \ gcc-12 g++-12 gfortran \ libtcmalloc-minimal4 libnuma-dev libjemalloc2 \ && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \ --slave /usr/bin/g++ g++ /usr/bin/g++-12 \ && rm -rf /var/lib/apt/lists/* RUN pip install --break-system-packages "cmake>=3.25" WORKDIR /build RUN git clone https://github.com/amd/ZenDNN-pytorch-plugin.git WORKDIR /build/ZenDNN-pytorch-plugin RUN git checkout ${ZENTORCH_VERSION} # NOTE(cloudnull): Patch ZenDNN CMake files to use the same version for both the main plugin and the ZenDNN submodule. # This ensures that the plugin and the ZenDNN library it builds against are always in sync, preventing # potential compatibility issues. RUN sed -i -E "s/ZENDNN_TAG\s\".+\"/ZENDNN_TAG \"${ZENDNN_VERSION}\"/" /build/ZenDNN-pytorch-plugin/cmake/modules/zendnn.cmake RUN sed -i -E "s/GIT_TAG\s.+/GIT_TAG ${ZENDNN_VERSION}/" /build/ZenDNN-pytorch-plugin/cmake/modules/zendnnl.cmake RUN --mount=type=cache,target=/root/.cache/pip \ uv pip install -r requirements.txt && \ uv pip install --extra-index-url https://download.pytorch.org/whl/cpu --force-reinstall torch torchvision torchaudio # NOTE(cloudnull): This is just a dirty hack to make zentorch 15.2RC work. # Stub out DLRM EmbeddingBag/QuantEmbedBag ops - their ZenDNN kernel # implementations are incompatible with the ZenDNN version CMake # downloads. These are recommender-system ops not needed for LLM # inference (PagedAttention compiles fine). The source files are kept # in CMakeLists.txt so that symbols referenced by the dispatch table # remain defined; the stubs throw at runtime if ever called. RUN printf '#include <ATen/ATen.h>\n\ #include <torch/library.h>\n\ namespace zentorch {\n\ at::Tensor zentorch_get_packed_embedding_weight(\n\ at::Tensor& /*w*/, at::Tensor& /*o*/, at::Tensor& /*p*/) {\n\ TORCH_CHECK(false,\n\ "zentorch EmbeddingBag ops disabled (not needed for LLM inference)");\n\ return {};\n\ }\n\ } // namespace zentorch\n' > src/cpu/cpp/EmbeddingBag.cpp \ && > src/cpu/cpp/QuantEmbedBag.cpp RUN CC=gcc CXX=g++ python setup.py bdist_wheel FROM vllm-cpu:local ARG ZENDNN_VERSION=zendnn-2026-WW09 ARG ZENTORCH_VERSION=zentorch-2026-WW09 ARG BUILD_DATE=unknown # NOTE(cloudnull): Install runtime dependencies # - libzmq5: ZeroMQ runtime library (required for dynamic linking) # - ca-certificates: For HTTPS connections to vLLM # - curl: For health checks RUN apt-get update && apt-get install -y \ libzmq5 \ ca-certificates \ curl \ && rm -rf /var/lib/apt/lists/* COPY --from=vllm_base /build/ZenDNN-pytorch-plugin/dist/*.whl /tmp/ RUN uv pip install /tmp/*.whl lmcache llmcompressor # NOTE(cloudnull): This is another dirty hack to make zentorch 15.2RC work. # Wrap _meta_registrations import so zentorch loads even when # EmbeddingBag ops are stubbed out. Meta registrations are only # needed for torch.compile tracing, not eager-mode inference. RUN sed -i 's/^from \._meta_registrations import \*/try:\n from ._meta_registrations import * # noqa\nexcept (AttributeError, ImportError):\n pass # EmbeddingBag ops stubbed out/' \ /opt/venv/lib/python3.12/site-packages/zentorch/__init__.py # Create non-root user RUN useradd -l -r -u 5152 -m inference --home-dir /var/lib/inference RUN mkdir -p /var/lib/inference/.cache/xerotier && mkdir -p /var/lib/inference/.config/xerotier && chown -R inference:inference /var/lib/inference # Copy binary from xerotier_base COPY --from=xerotier_base /usr/local/bin/xerotier-backend-agent /usr/local/bin/ # Copy entrypoint script COPY --from=xerotier_base /usr/local/bin/entrypoint-agent.sh /usr/local/bin/ # Copy vLLM wrapper that patches huggingface_hub for local model paths COPY --from=xerotier_base /usr/local/bin/xerotier-vllm /usr/local/bin/xerotier-vllm # Set ownership and permissions RUN chown inference:inference /usr/local/bin/xerotier-backend-agent && \ chown inference:inference /usr/local/bin/entrypoint-agent.sh && \ chown inference:inference /usr/local/bin/xerotier-vllm && \ chmod +x /usr/local/bin/entrypoint-agent.sh && \ chmod +x /usr/local/bin/xerotier-vllm # Switch to non-root user USER inference # Configure environment ENV HOME=/var/lib/inference ENV PYTORCH_ALLOC_CONF=expandable_segments:True ENV XEROTIER_AGENT_VLLM_PATH=/usr/local/bin/xerotier-vllm ENV VLLM_PLUGINS="zentorch" ENV LMCACHE_CONFIG_FILE=/var/lib/inference/.config/xerotier/lmcache_config.yaml LABEL org.opencontainers.image.authors="Cloudnull" \ org.opencontainers.image.url="https://xerotier.com" \ org.opencontainers.image.documentation="https://xerotier.com/docs" \ org.opencontainers.image.source="https://github.com/cloudnull/xerotier-public" \ org.opencontainers.image.title="Xerotier Inference Agent (vLLM ZenDNN)" \ org.opencontainers.image.description="Xerotier Inference Agent image optimized for AMD GPUs using vLLM and ZenDNN. This image includes the necessary runtime dependencies to run the agent with vLLM on AMD hardware. It is designed to be used with the Xerotier Inference Router for distributed model inference across AMD GPU-equipped nodes." \ org.opencontainers.image.licenses="MIT" \ org.opencontainers.image.base.name="vllm/vllm-openai-zendnn:latest" \ org.opencontainers.image.base.vllm_version="$(uv pip show vllm | grep ^Version: | awk '{print \$2}')" \ org.opencontainers.image.base.zendnn_version="${ZENDNN_VERSION}" \ org.opencontainers.image.base.zentorch_version="${ZENTORCH_VERSION}" \ org.opencontainers.image.vendor="Xerotier" \ org.opencontainers.image.build_type="production" \ org.opencontainers.image.created="${BUILD_DATE}" # Entrypoint handles enrollment on first run, then starts the agent # Override with: docker run ... xerotier-backend-agent enroll --help ENTRYPOINT ["/usr/local/bin/entrypoint-agent.sh"]

Step 4: Build ZenDNN-Optimized Image

bash
docker build -f docker/Dockerfile.cpu-amd \ --build-arg VLLM_CPU_AVX512BF16=1 \ --build-arg VLLM_CPU_AVX512VNNI=1 \ --build-arg VLLM_CPU_DISABLE_AVX512=0 \ --tag vllm-cpu-zentorch:local .

Step 5: Build Xerotier.ai XIM Image

From the Xerotier.ai repository root, build the CPU XIM node:

bash
docker build -f deploy/docker/Dockerfile.xim-vllm-zendnn \ --tag xerotier/backend-agent-cpu:local .

CPU-Specific Environment Variables

Variable Default Description
VLLM_PLUGINS zentorch Enable ZenDNN optimization plugin
VLLM_CPU_KVCACHE_SPACE auto KV cache size in GB (auto: totalRAM - model - overhead)
VLLM_CPU_OMP_THREADS_BIND auto CPU core binding range (auto: 0-{nproc-1})
VLLM_CPU_NUM_OF_RESERVED_CPU auto CPUs reserved for OS operations (auto: 1)

Auto-Tuning: The XIM node automatically computes VLLM_CPU_KVCACHE_SPACE, VLLM_CPU_OMP_THREADS_BIND, and VLLM_CPU_NUM_OF_RESERVED_CPU at startup based on system resources. No manual calculation is required. Override via XEROTIER_AGENT_VLLM_ENV if the defaults are unsuitable.

Docker Compose for CPU XIM Node

docker-compose.agent-amd-cpu-zendnn.yaml
# SPDX-License-Identifier: MIT # Xerotier Agent - CPU Inference Stack # # Deploys a XIM CPU node for inference without GPU acceleration. # ZenTorch optimization for AMD EPYC CPUs is available as an opt-in image tag. # # IMAGE OPTIONS: # AMD ZenTorch optimized: # DOCKER_REGISTRY=ghcr.io/cloudnull/xerotier VERSION=latest # XEROTIER_CPU_IMAGE=xerotier-xim-amd # -> ghcr.io/cloudnull/xerotier-public/xerotier-xim-amd-cpu-zendnn:latest # # QUICK START: # 1. Get a join key from your Xerotier dashboard: # Dashboard -> Infrastructure -> Agents -> Generate Join Key # # 2. Set your join key: # export XEROTIER_AGENT_JOIN_KEY=xjk_your_key_here # # 3. Create host directories with correct permissions: # sudo mkdir -p /data/xerotier/models /data/xerotier/config /data/xerotier/lmcache # sudo chown -R 5152:5152 /data/xerotier # # 4. Start the agent: # docker compose -f docker-compose.agent-amd-cpu-zendnn.yaml up -d # # ENROLLMENT WORKFLOW: # - On first start, the agent enrolls using your join key # - Enrollment state is persisted to /data/xerotier/config # - On subsequent restarts, the agent reconnects automatically # - You can remove XEROTIER_AGENT_JOIN_KEY after successful enrollment # # ENVIRONMENT VARIABLES: # XEROTIER_AGENT_JOIN_KEY [REQUIRED] Join key from Xerotier dashboard (first run only) # XEROTIER_CPU_IMAGE Container image name (default: xerotier-xim-amd-cpu-zendnn) # XEROTIER_AGENT_MAX_CONCURRENT Optional ceiling for concurrent requests (auto-configured when not set) # XEROTIER_AGENT_LOG_LEVEL Logging level: trace, debug, info, warning, error (default: info) # XEROTIER_AGENT_MODEL_CACHE_MAX_SIZE_GB Local model cache size in GB (default: 100) # SHM_SIZE Shared memory size (default: 90g) # # CPU TUNING (auto-computed by the agent; override via XEROTIER_AGENT_VLLM_ENV if needed): # VLLM_CPU_OMP_THREADS_BIND CPU thread binding range (auto: 0-{nproc-1}) # VLLM_CPU_NUM_OF_RESERVED_CPU Reserved CPUs for system (auto: 1) # VLLM_CPU_KVCACHE_SPACE KV cache memory in GB (auto: totalRAM - model - overhead) # OMP_NUM_THREADS OpenMP threads (auto: physical core count) # OMP_PROC_BIND Pin threads to cores (auto: TRUE) # OMP_WAIT_POLICY Thread wait strategy (auto: PASSIVE) # # KERNEL SOCKET BUFFER TUNING (optional): # The agent sets ZeroMQ socket buffers to 4 MiB for streaming throughput. # Linux defaults net.core.wmem_max and net.core.rmem_max to 212992 bytes, # which silently caps the requested buffer size. The agent will attempt to # raise these limits on startup (requires privileged mode or CAP_SYS_ADMIN). # # If the container is not privileged, set these on the host before starting: # sudo sysctl -w net.core.wmem_max=4194304 # sudo sysctl -w net.core.rmem_max=4194304 # # To persist across reboots, add to /etc/sysctl.d/99-xerotier.conf: # net.core.wmem_max = 4194304 # net.core.rmem_max = 4194304 services: agent: image: ${DOCKER_REGISTRY:-ghcr.io/cloudnull/xerotier}-public/xim-vllm-zendnn:${VERSION:-latest} container_name: xim-vllm-zendnn network_mode: host ipc: host privileged: true shm_size: ${SHM_SIZE:-90g} # No command - entrypoint handles enrollment + run automatically volumes: # Persistent model cache - /data/xerotier/models:/var/lib/inference/.cache/xerotier/models # Persistent enrollment state - /data/xerotier/config:/var/lib/inference/.config/xerotier # Caching - /data/xerotier/lmcache:/var/lib/inference/.cache/lmcache environment: # Agent Enrollment [REQUIRED for first run] XEROTIER_AGENT_JOIN_KEY: ${XEROTIER_AGENT_JOIN_KEY:-} # Agent Configuration XEROTIER_AGENT_LOG_LEVEL: ${XEROTIER_AGENT_LOG_LEVEL:-info} # LMCache Configuration XEROTIER_AGENT_LMCACHE_ENABLED: "true" XEROTIER_AGENT_LMCACHE_REDIS_URL: "${XEROTIER_AGENT_LMCACHE_REDIS_URL:-}" restart: unless-stopped

Memory Tuning: If the auto-computed VLLM_CPU_KVCACHE_SPACE causes out-of-memory errors, override it to a lower value via XEROTIER_AGENT_VLLM_ENV="VLLM_CPU_KVCACHE_SPACE=<GB>".

Performance Considerations

  • Concurrency: CPU inference supports fewer concurrent requests than GPU. Start with XEROTIER_AGENT_MAX_CONCURRENT=5 and adjust based on model size.
  • Data Type: Use --dtype=bfloat16 for optimal performance on AMD EPYC with AVX-512 VNNI.
  • Model Selection: Smaller models (1-8B parameters) work best for CPU inference. Larger models will have significantly higher latency.
  • Memory Bandwidth: Inference performance is often memory-bandwidth limited. Ensure your system has adequate memory channels populated.

Troubleshooting

Common issues and their solutions when deploying XIM nodes.

Agent Fails to Start

Symptom Solution
Join key expired Generate a new join key from the Agents dashboard
Connection refused Verify network connectivity to Xerotier router mesh
Invalid join key format Ensure the complete key is provided without truncation

GPU Not Detected

bash
# Verify NVIDIA driver nvidia-smi # Verify Container Toolkit installation docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi # Check Docker runtime configuration docker info | grep -i runtime

Model Loading Fails

Symptom Solution
Out of disk space Increase disk allocation or reduce cache size
Model not found Verify VLLM_MODEL is a valid HuggingFace model ID

Permission Denied Errors

bash
# Fix host directory permissions sudo chown -R 5152:5152 /data/xerotier # Verify permissions ls -la /data/xerotier

Out of Memory (OOM)

  • Reduce XEROTIER_AGENT_GPU_MEMORY_UTILIZATION to 0.85 or lower
  • Reduce XEROTIER_AGENT_MAX_CONCURRENT to limit concurrent requests
  • Reduce XEROTIER_AGENT_MAX_MODEL_LEN for shorter context windows
  • Use a smaller model or add more GPUs

LMCache Issues

Symptom Solution
LMCache not enabled (no logs) Verify XEROTIER_AGENT_LMCACHE_ENABLED=true is set in environment
Redis connection failed Check Valkey is running and XEROTIER_AGENT_LMCACHE_REDIS_URL is correct
Config write failed Ensure /var/lib/inference/.config/xerotier is writable
High disk usage Set explicit XEROTIER_AGENT_LMCACHE_DISK_SIZE_GB limit
No cache hits across nodes Verify all nodes use same XEROTIER_AGENT_LMCACHE_REDIS_URL

Check LMCache status in logs:

bash
# Verify LMCache initialization docker-compose logs agent | grep -i lmcache # Check config file was created docker-compose exec agent cat /var/lib/inference/.config/xerotier/lmcache_config.yaml # Test Valkey connectivity docker-compose exec valkey valkey-cli ping

Common Commands

Command Description
docker-compose logs -f agent View agent logs
docker-compose restart agent Restart the agent
docker-compose down Stop all services
nvidia-smi Monitor GPU utilization
docker stats Monitor container resource usage

Frequently Asked Questions

How do I get a join key?

Navigate to the Agents page in your dashboard and click "Generate Join Key". Configure the region and expiration, then copy the generated key. The full key is only shown once.

Can I run multiple models on one GPU?

The agent loads one model at a time per vLLM instance. To serve multiple models, deploy multiple agents on separate GPUs or use time-sharing (not recommended for production).

How do I update the agent?

Pull the latest image and restart: docker-compose pull && docker-compose up -d. Your model cache and configuration persist through updates.

What models are supported?

Any model compatible with vLLM, including most HuggingFace Transformers models. Check the vLLM supported models list for compatibility.

How much VRAM do I need?

As a rough guide: 7B models need ~16GB, 13B models need ~32GB, 70B models need ~140GB (multiple GPUs). Quantized models (GPTQ, AWQ) reduce requirements significantly.

Can I use AMD GPUs?

Yes. AMD ROCm GPU support is available. See the AMD ROCm GPU Setup section for the Docker Compose configuration. You can also run inference on AMD EPYC CPUs using vLLM with ZenDNN optimization. See the AMD CPU Deployment section for details.

Why do I need to build my own Docker image for CPU inference?

CPU-optimized vLLM with ZenDNN requires specific build flags (AVX-512BF16, AVX-512VNNI) that must match your CPU architecture. Pre-built images cannot provide these optimizations for all CPU variants.

What vLLM version should I use with ZenDNN?

Check the ZenDNN-pytorch-plugin repository for the latest compatibility matrix. At the time of writing, vLLM v0.11.0 is recommended. Version mismatches cause plugin loading failures.

Is my data secure?

Yes. XIM nodes only receive requests from your project. All connections use CURVE encryption (ZMQ). Your inference data never leaves your infrastructure.

What happens if my XIM node goes offline?

Requests are automatically routed to other available XIM nodes. If you have fallback enabled, requests can be served by shared infrastructure. Otherwise, they queue until your node reconnects.

Do I need LMCache?

LMCache is optional but recommended for production deployments. It significantly reduces TTFT for repeated prompt prefixes (e.g., system prompts, few-shot examples). If your workload has many unique prompts with no shared prefixes, the benefit is reduced.

Can I use LMCache without Valkey/Redis?

Yes. Set XEROTIER_AGENT_LMCACHE_ENABLED=true without XEROTIER_AGENT_LMCACHE_REDIS_URL to use only local CPU memory and disk caching. This works well for single-node deployments. Add Valkey when you need cache sharing across multiple nodes.

What happens if LMCache fails to initialize?

The XIM node gracefully degrades - it logs a warning and continues without KV cache sharing. Inference still works, just without the TTFT optimization. Check logs for initialization errors if you expected LMCache to be enabled.

How much memory should I allocate for LMCache?

The XIM node auto-calculates 10% of system resources by default, which works for most deployments. For high-traffic systems, consider 10-20% of RAM for CPU cache and 50-100GB for disk cache. Monitor eviction rates to tune sizing.