XIM Advanced Configuration

Quantization

The XIM node supports both automatic and manual quantization to fit large models into limited GPU VRAM. Auto-quantization inspects the model size and available VRAM at startup and selects the best method automatically.

Quantization Override

Quantization is determined by the database value set when the endpoint is created. The agent CLI can override this with an explicit method:

.env

                    # Force a specific quantization method
XEROTIER_AGENT_VLLM_QUANTIZATION=bitsandbytes
                

Quantization Environment Variables

Quantization environment variables
Variable	Default	Description
`XEROTIER_AGENT_VLLM_QUANTIZATION`	-	Force a specific quantization method (overrides database value). Supported values: `bitsandbytes`, `bitsandbytes-fp4`, `fp8`, `awq`, `gptq`, and more...

Pre-Quantized Models: If the model is already quantized (AWQ, GPTQ format on HuggingFace), the XIM node detects this and passes the appropriate flag to vLLM. No additional configuration is needed.

Speculative Decoding

Speculative decoding can improve generation throughput by speculatively predicting multiple tokens per forward pass and verifying them in a single step. The XIM node supports several speculative methods.

Opt-In Required: Speculative decoding is disabled by default. You must set XEROTIER_AGENT_SPECULATIVE_ENABLED=1 to activate it. The method is auto-detected from the model architecture when not explicitly specified.

Supported Methods

Supported speculative decoding methods and draft-model requirements
Method	Draft Model	Description
`deepseek_mtp`	not required	Native Multi-Token Prediction for DeepSeek V3/R1 models that ship MTP layers.
`qwen3_next_mtp`	not required	Native Multi-Token Prediction for Qwen3-Next models.
`mtp`	not required	Native Multi-Token Prediction for GLM-4.5 MoE models.
`mimo_mtp`	not required	Native Multi-Token Prediction for MIMO models.
`ngram`	not required	N-gram based speculation using prompt history. Default tokens per step: 5.
`suffix`	not required	Suffix-array based speculation. Default tokens per step: 16.
`eagle`	required	EAGLE speculative decoding using an external draft model. Default tokens per step: 3.
`medusa`	required	Medusa speculative decoding using an external draft model. Default tokens per step: 3.
`draft_model`	required	Generic draft-model speculative decoding. Default tokens per step: 5.
`mlp_speculator`	required	MLP-speculator based speculative decoding. Default tokens per step: 3.

Speculative Decoding Environment Variables

Speculative decoding environment variables
Variable	Default	Description
`XEROTIER_AGENT_SPECULATIVE_ENABLED`	disabled	Set to `1` or `true` to enable speculative decoding. Required.
`XEROTIER_AGENT_SPECULATIVE_METHOD`	auto	Force a speculative method. Supported values: `deepseek_mtp`, `qwen3_next_mtp`, `mtp`, `mimo_mtp`, `ngram`, `suffix`, `eagle`, `medusa`, `draft_model`, `mlp_speculator`. Auto-detected from model architecture when not set.
`XEROTIER_AGENT_SPECULATIVE_TOKENS`	method default	Number of speculative tokens per step. Higher values increase throughput at the cost of verification overhead. Per-method defaults: `deepseek_mtp`, `mtp`, `mimo_mtp` = 1; `qwen3_next_mtp` = 2; `eagle`, `medusa`, `mlp_speculator` = 3; `ngram`, `draft_model` = 5; `suffix` = 16.
`XEROTIER_AGENT_SPECULATIVE_NGRAM_FALLBACK`	disabled	Set to `1` or `true` to enable n-gram fallback when the primary method is unavailable.
`XEROTIER_AGENT_SPECULATIVE_DRAFT_MODEL_PATH`	-	Filesystem path to an external draft model. Required for `eagle` and `medusa` methods.

Example: MTP on DeepSeek

.env

                    XEROTIER_AGENT_SPECULATIVE_ENABLED=1
# Method is auto-detected for DeepSeek V3/R1 (deepseek_mtp)
# Override if needed:
# XEROTIER_AGENT_SPECULATIVE_METHOD=deepseek_mtp
                

Qwen3 carve-out: Native MTP auto-detection is enabled by default for DeepSeek, GLM-4.5 MoE, and MIMO families. The Qwen3 family (including Qwen3.5/3.6 MoE) intentionally does not enable speculative decoding by default; operators must opt in with XEROTIER_AGENT_SPECULATIVE_ENABLED=1 and, for Qwen3-Next, may set XEROTIER_AGENT_SPECULATIVE_METHOD=qwen3_next_mtp explicitly.

Example: N-Gram Fallback

.env

                    XEROTIER_AGENT_SPECULATIVE_ENABLED=1
XEROTIER_AGENT_SPECULATIVE_METHOD=ngram
XEROTIER_AGENT_SPECULATIVE_TOKENS=3
XEROTIER_AGENT_SPECULATIVE_NGRAM_FALLBACK=1
                

MoE Kernel Tuning

For Mixture-of-Experts (MoE) models, the XIM node can automatically generate and apply optimized kernel tuning configurations. This improves expert dispatch performance on your specific GPU hardware.

MoE kernel tuning environment variables
Variable	Default	Description
`XEROTIER_AGENT_MOE_CONFIG_ENABLED`	enabled	Enable automatic MoE kernel tuning config generation. Set to `0` or `false` to disable.
`XEROTIER_AGENT_MOE_CONFIG_PATH`	auto	Custom path for MoE tuned config files. When unset, the XIM node stores configs alongside the model cache.

Non-MoE Models: These settings have no effect on dense (non-MoE) models. The XIM node detects whether the loaded model uses MoE architecture and only generates configs when applicable.

Auto-Configuration

The XIM node includes a dynamic auto-configuration system that inspects the model architecture, GPU hardware, and available memory at startup to select optimal vLLM parameters. This covers tensor parallelism, quantization, context length, and CUDA graph settings.

Auto-Configuration Environment Variables

Auto-configuration environment variables
Variable	Default	Description
`XEROTIER_AGENT_AUTO_CONFIG`	enabled	Master toggle for dynamic auto-configuration. Set to `0` or `false` to disable all auto-tuning and use only explicit settings.
`XEROTIER_AGENT_AUTO_CONFIGURE_GPU`	enabled	Auto-configure tensor parallelism from detected GPU count. Set to `0` or `false` to use the explicit `XEROTIER_AGENT_TENSOR_PARALLEL_SIZE` value.
`XEROTIER_AGENT_AUTO_CUDA_MITIGATION`	enabled	Auto-apply CUDA graph mitigations for known GPU issues (A30, A40, L40 with TP>1). Set to `0` or `false` to disable.

Disabling Auto-Configuration

To take full manual control of vLLM parameters, disable all auto-configuration:

.env

                    # Disable all auto-configuration
XEROTIER_AGENT_AUTO_CONFIG=0
XEROTIER_AGENT_AUTO_CONFIGURE_GPU=0
XEROTIER_AGENT_AUTO_CUDA_MITIGATION=0

# Then set explicit values (the default for GPU_MEMORY_UTILIZATION is 0.95;
# the value below is an example override for memory-constrained deployments).
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.85
XEROTIER_AGENT_MAX_MODEL_LEN=8192
                

Boolean parsing is case-sensitive: The XEROTIER_AGENT_AUTO_CONFIG, XEROTIER_AGENT_AUTO_CONFIGURE_GPU, XEROTIER_AGENT_AUTO_CUDA_MITIGATION, XEROTIER_AGENT_VLLM_DISABLE_CUDA_GRAPHS, and XEROTIER_AGENT_VLLM_DISABLE_CUSTOM_ALL_REDUCE env vars compare against the literal strings 0, 1, false, and true. Values such as True, TRUE, FALSE, off, or yes are silently ignored. Use the exact lower-case form.

CUDA Graph Workarounds

Some GPU models (A30, A40, L40) experience CUDA graph capture failures under specific tensor parallelism configurations. The XIM node detects these cases and applies mitigations automatically when XEROTIER_AGENT_AUTO_CUDA_MITIGATION is enabled.

For manual control of CUDA graph behavior:

CUDA graph manual-control environment variables
Variable	Default	Description
`XEROTIER_AGENT_VLLM_DISABLE_CUDA_GRAPHS`	disabled	Force disable CUDA graphs (enforce eager execution). Set to `1` or `true` if experiencing CUDA graph capture failures.
`XEROTIER_AGENT_VLLM_DISABLE_CUSTOM_ALL_REDUCE`	disabled	Force disable custom all-reduce optimization. Set to `1` or `true` for multi-GPU P2P issues on A30/A40 GPUs.

Flow Control

The XIM node implements credit-based flow control for streaming inference to prevent buffer overflow on the router side. Flow control ensures the XIM node does not send chunks faster than the router can forward them to clients.

Flow control environment variables
Variable	Default	Description
`XEROTIER_AGENT_STREAMING_SOCKET_ENABLED`	true	Enable the dedicated streaming DEALER socket. When enabled, inference chunks use a separate socket to avoid head-of-line blocking on the control channel.
`XEROTIER_AGENT_FLOW_CONTROL_ENABLED`	true	Enable credit-based flow control for streaming inference. Set to `0` or `false` to disable backpressure.
`XEROTIER_AGENT_FLOW_CONTROL_WINDOW_BYTES`	65536	Initial credit window size in bytes per stream. The XIM node can send up to this many bytes before waiting for the router to replenish credits.
`XEROTIER_AGENT_FLOW_CONTROL_REPLENISH_THRESHOLD`	0.5	Fraction of the window that triggers a credit replenishment from the router (0.0-1.0).
`XEROTIER_AGENT_FLOW_CONTROL_PAUSE_THRESHOLD_BYTES`	262144	Total unacknowledged bytes across all streams before the XIM node pauses sending. Acts as a global safety valve.
`XEROTIER_AGENT_FLOW_CONTROL_TIMEOUT_SECONDS`	30	Seconds to wait for credit recovery before considering the stream stalled.

When to adjust: Touch these only if the XIM node logs show streaming stalls or excessive backpressure pauses.

Lease Configuration

The XIM node maintains a lease with the router mesh via periodic heartbeats. If the router does not receive a renewal within the lease duration window, it marks the XIM node as expired and stops routing requests to it.

Lease configuration environment variables
Variable	Default	Description
`XEROTIER_AGENT_LEASE_RENEWAL_INTERVAL_MS`	10000	How often (in milliseconds) the XIM node sends a lease renewal heartbeat to the router. Default: 10 seconds.
`XEROTIER_AGENT_LEASE_DURATION_MS`	30000	Requested lease duration in milliseconds. If the router does not receive a renewal within this window, the XIM node is marked expired. Default: 30 seconds.

Keep the ratio safe: The lease duration should be at least 2-3x the renewal interval. A tight margin increases the risk of false lease expirations during transient network issues.

Model Pull Retry

When the XIM node downloads a model from the router or a remote registry, it uses exponential backoff retries on failure. These settings control retry behavior.

Model pull retry environment variables
Variable	Default	Description
`XEROTIER_AGENT_MODEL_PULL_MAX_ATTEMPTS`	5	Maximum number of retry attempts for a failed model pull before giving up.
`XEROTIER_AGENT_MODEL_PULL_BASE_DELAY_MS`	1000	Base delay in milliseconds between retries. Each subsequent retry doubles the delay (exponential backoff).
`XEROTIER_AGENT_MODEL_PULL_MAX_DELAY_MS`	15000	Maximum delay in milliseconds between retries. The backoff delay is capped at this value.

Retry Sequence: With defaults, retries occur at approximately 1s, 2s, 4s, 8s, 15s (capped). Increase XEROTIER_AGENT_MODEL_PULL_MAX_ATTEMPTS for unreliable networks.

Enrollment Options

Additional options for the enrollment and startup process.

Insecure Enrollment

By default, enrollment requires HTTPS endpoints. For development or trusted network environments, you can allow non-HTTPS enrollment:

bash

                    # Via CLI flag
xerotier-xim-agent enroll --join-key xjk_abc123... --insecure

# Via environment variable
XEROTIER_AGENT_ALLOW_INSECURE=1 xerotier-xim-agent enroll --join-key xjk_abc123...
                

Security Risk: Insecure enrollment transmits your join key and agent credentials over an unencrypted connection. Only use this in isolated development environments. Never use --insecure in production.

Initial Model Path

Preload a local model on startup instead of waiting for the router to stream one. This is useful when you have already downloaded a model to disk:

.env

                    # Load a model from a local path on startup
XEROTIER_AGENT_INITIAL_MODEL_PATH=/data/models/my-model
                

When set, the XIM node starts vLLM with this model immediately after registration. The model must already exist at the specified path in a format vLLM can load (safetensors or equivalent).

Tenant Cache Isolation

In multi-tenant deployments where a single XIM node serves requests from different tenants, set a server secret to generate per-tenant cache salts. This prevents cross-tenant data leakage in the KV cache. If this env var is unset, the XIM node falls back to a constant compiled-in salt, which provides no tenant isolation across deployments and should not be relied on in production:

.env

                    # Server secret for tenant-isolated cache keys
XEROTIER_AGENT_VLLM_SALT_SECRET=your-secret-string-here
                

Enrollment options environment variables
Variable	Default	Description
`XEROTIER_AGENT_VLLM_SALT_SECRET`	built-in fallback (constant)	Server secret for generating per-tenant cache isolation salts. When unset, the XIM node falls back to a constant compiled-in value, which means every deployment that leaves it unset shares the same salt and gets no real tenant isolation. Production deployments must set this to a unique, high-entropy secret.
`XEROTIER_AGENT_ALLOW_INSECURE`	disabled	Set to `1` or `true` to allow non-HTTPS enrollment. Development only.
`XEROTIER_AGENT_INITIAL_MODEL_PATH`	-	Filesystem path to a local model to preload on startup. XIM node starts vLLM with this model immediately after registration.

vLLM Runtime Tuning

Knobs that control vLLM startup behavior, KV cache backend selection, and prefix caching policy. These exist for slow-disk, large-model, and multi-tenant deployments.

Startup Timeouts

Large models on slow storage can exceed the default vLLM startup window. The XIM node exposes two timeout knobs plus a degraded-state ceiling. Defaults track the value vLLM itself respects at the time the XIM node was built; run xerotier-xim-agent --help to print the current effective ceiling:

vLLM startup timeout environment variables
Variable	Default	Description
`XEROTIER_AGENT_VLLM_STARTUP_TIMEOUT_SECONDS`	vLLM-defined	Hard ceiling (in seconds) on how long the XIM node waits for vLLM to finish loading a model before marking the start attempt as failed. Increase for large models or slow disks.
`XEROTIER_AGENT_VLLM_STARTUP_INACTIVITY_GRACE_SECONDS`	vLLM-defined	Inactivity grace window during vLLM startup. If vLLM produces no progress output for this long, the XIM node considers the start attempt stuck.
`XEROTIER_AGENT_DEGRADED_TO_FAILED_TIMEOUT_SECONDS`	vLLM-defined	How long an XIM node may remain in the degraded state before being promoted to failed. Operators on flaky networks may want a longer window.

KV Cache Backend

The XIM node supports two KV cache backends. The default native backend uses vLLM's built-in paged attention with CPU offload (see KV Cache Offload Issues). The optional lmcache backend integrates the external LMCache library for cross-process and cross-node KV reuse. LMCache configuration is plumbed through the agent config but is opt-in; consult your deployment template for the exact env-var surface.

Prefix Caching

Always enabled: Prefix caching is enabled unconditionally on every XIM node. There is no env-var toggle to turn it off; the XIM node always emits --enable-prefix-caching when starting vLLM. This is intentional and matches the behavior validated for hybrid SSM and dense models alike.

Qwen3 MTP Defaults

Qwen3 family carve-out: Native Multi-Token Prediction defaults were removed for the Qwen3 family (including Qwen3.5 and Qwen3.6 MoE) because the upstream MTP layers correlated with repetition-loop hangs. Operators who want speculative decoding on Qwen3-Next must opt in explicitly with XEROTIER_AGENT_SPECULATIVE_ENABLED=1 and XEROTIER_AGENT_SPECULATIVE_METHOD=qwen3_next_mtp. DeepSeek, GLM-4.5 MoE, and MIMO families continue to auto-enable native MTP when enabled.

ROCm Short-Sequence Workaround (Qwen3.5 GDN)

On AMD ROCm GPUs, Qwen3.5 models that use GDN (Generalized Delta Networks) layers crash when the sequence length is below 64 tokens. The XIM node's bundled xerotier-vllm fork applies a zero-pad workaround to short sequences so these prompts no longer hit the crash path. No operator configuration is required; the fix is active whenever the XIM node detects an affected model on an AMD ROCm device.

Single-Node Queuing

When a project has exactly one compatible XIM node and that node is at capacity, the router automatically queues incoming requests instead of returning an immediate 503 error. This improves the experience for small deployments running a single XIM node.

How It Works

A request arrives and the router finds exactly one compatible XIM node.
That node is at its concurrent request limit.
Instead of failing, the router parks the request in a waiting queue.
When the node completes an in-flight request and frees a slot, the queued request is dispatched.
If the queue timeout expires before capacity becomes available, the request fails with a 503.

Queue Limits

Single-node queue limits
Parameter	Default	Description
Max queued per node	10	Maximum number of requests that can wait for a single node to free capacity. Compile-time constant.
`XEROTIER_SINGLE_AGENT_QUEUE_TIMEOUT_S`	30	Maximum seconds a request waits in the queue before receiving a 503 response. Set this environment variable on the Xerotier router process to override the default.

Router-Side Setting: The queue timeout is configured on the Xerotier router process, not on the XIM node. Self-hosted operators can set XEROTIER_SINGLE_AGENT_QUEUE_TIMEOUT_S in the router environment to adjust the wait window (in seconds).

This behavior is transparent to clients. During queuing, the router holds the HTTP connection open. If the request is dispatched successfully, the client receives a normal response. The X-Request-ID response header can be used to trace queued requests in logs.

AMD CPU Deployment (ZenDNN)

Run inference on AMD EPYC CPUs without a GPU using vLLM with ZenDNN optimization. The ZenDNN CPU image is prebuilt and published for you; there is nothing to build locally.

Prebuilt image: The ZenDNN CPU image is published at ghcr.io/xerotier/container-agents/xim-vllm-zendnn:latest and runs with compose/compose.agent-amd-cpu-zendnn.yaml from the xerotier/container-agents repository.

Hardware Requirements

AMD CPU hardware requirements
Component	Minimum	Recommended
CPU	AMD EPYC with AVX-512	AMD EPYC 9454 (Genoa) or newer
System RAM	64GB	96GB+ (scales with model size)
Disk Space	100GB SSD	500GB+ NVMe SSD
CPU Cores	16 cores	24+ cores

Memory Requirements by Model Size

CPU inference requires significantly more system RAM than GPU VRAM:

System RAM required by model size for CPU inference
Model Size	System RAM Required
Sub-1B parameters	~32GB
3-4B parameters	~64GB
7-8B parameters	~96GB

CPU-Specific Environment Variables

CPU-specific environment variables
Variable	Default	Description
`VLLM_PLUGINS`	zentorch	Enable ZenDNN optimization plugin
`VLLM_CPU_KVCACHE_SPACE`	auto	KV cache size in GB (auto: totalRAM - model - overhead)
`VLLM_CPU_OMP_THREADS_BIND`	auto	CPU core binding range (auto: 0-{nproc-1})
`VLLM_CPU_NUM_OF_RESERVED_CPU`	auto	CPUs reserved for OS operations (auto: 1)

Auto-Tuning: The XIM node automatically computes VLLM_CPU_KVCACHE_SPACE, VLLM_CPU_OMP_THREADS_BIND, and VLLM_CPU_NUM_OF_RESERVED_CPU at startup based on system resources. No manual calculation is required. Override via XEROTIER_AGENT_VLLM_ENV if the defaults are unsuitable.

Docker Compose for CPU XIM Node

The CPU compose file is published in the xerotier/container-agents repository under compose/:

bash

                    git clone https://github.com/xerotier/container-agents.git
cd xerotier-public/compose

export XEROTIER_AGENT_JOIN_KEY=xjk_your_key_here
docker compose -f compose.agent-amd-cpu-zendnn.yaml up -d
                

Memory Tuning: If the auto-computed VLLM_CPU_KVCACHE_SPACE causes out-of-memory errors, override it to a lower value via XEROTIER_AGENT_VLLM_ENV="VLLM_CPU_KVCACHE_SPACE=<GB>".

Performance Considerations

Concurrency: CPU inference supports fewer concurrent requests than GPU. Start with XEROTIER_AGENT_MAX_CONCURRENT=5 and adjust based on model size.
Data Type: Use --dtype=bfloat16 for optimal performance on AMD EPYC with AVX-512 VNNI.
Model Selection: Smaller models (1-8B parameters) work best for CPU inference. Larger models will have significantly higher latency.
Memory Bandwidth: Inference performance is often memory-bandwidth limited. Ensure your system has adequate memory channels populated.

Troubleshooting

Symptoms operators hit at deploy time, and the specific knob or command that resolves each one.

Agent Fails to Start

Symptoms and solutions when an XIM node fails to start
Symptom	Solution
Join key expired	Generate a new join key from the Agents dashboard
Connection refused	Verify network connectivity to Xerotier router mesh
Invalid join key format	Ensure the complete key is provided without truncation

GPU Not Detected

bash

                    # Verify NVIDIA driver
nvidia-smi

# Verify Container Toolkit installation
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi

# Check Docker runtime configuration
docker info | grep -i runtime
                

Model Loading Fails

Symptoms and solutions when model loading fails
Symptom	Solution
Out of disk space	Increase disk allocation or reduce cache size
Model not found	Verify the model name selected for the endpoint exists on HuggingFace, or that `XEROTIER_AGENT_INITIAL_MODEL_PATH` points to a valid local model directory.

Permission Denied Errors

bash

                    # Fix host directory permissions
sudo chown -R 5152:5152 /data/xerotier

# Verify permissions
ls -la /data/xerotier
                

Out of Memory (OOM)

Reduce XEROTIER_AGENT_GPU_MEMORY_UTILIZATION to 0.85 or lower
Reduce XEROTIER_AGENT_MAX_CONCURRENT to limit concurrent requests
Reduce XEROTIER_AGENT_MAX_MODEL_LEN for shorter context windows
Use a smaller model or add more GPUs

KV Cache Offload Issues

The agent ships with vLLM native CPU KV cache offloading enabled by default (25% of system RAM). See xerotier-xim-agent --help under --kv-offload-size-gb for tuning. Setting the environment variable XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB=0 disables offload.

KV cache offload symptoms and solutions
Symptom	Solution
Host memory pressure or OOM	Lower `XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB` below the 25% default, or set it to `0` to disable offload entirely
No TTFT improvement on repeated prefixes	Confirm offload is enabled (non-zero `XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB`) and that requests actually share prefixes

Common Commands

Common XIM node operator commands
Command	Description
`docker-compose logs -f agent`	View agent logs
`docker-compose restart agent`	Restart the agent
`docker-compose down`	Stop all services
`nvidia-smi`	Monitor GPU utilization
`docker stats`	Monitor container resource usage

Frequently Asked Questions

How do I get a join key?

Navigate to the Agents page in your dashboard and click "Generate Join Key". Configure the region and expiration, then copy the generated key. The full key is only shown once.

Can I run multiple models on one GPU?

The agent loads one model at a time per vLLM instance. To serve multiple models, deploy multiple agents on separate GPUs or use time-sharing (not recommended for production).

How do I update the agent?

Pull the latest image and restart: docker-compose pull && docker-compose up -d. Your model cache and configuration persist through updates.

What models are supported?

Any model compatible with vLLM, including most HuggingFace Transformers models. Check the vLLM supported models list for compatibility.

How much VRAM do I need?

The XIM node uses a component-based estimator (weights + GQA-aware KV cache + an activation budget of roughly 20% of weights + a fixed 1.5GB CUDA overhead + a 5% safety margin), not a flat per-parameter multiplier. Rough fp16 ballpark figures including this overhead: 7B around 17-18GB, 13B around 30-32GB, 70B around 150GB+ (typically split across multiple GPUs). Models with strong grouped-query attention (GQA) and FP8 KV cache quantization run lower than this; long context windows and high concurrency run higher. Pre-quantized AWQ/GPTQ/bitsandbytes models reduce the weight component but do not eliminate the CUDA overhead or KV-cache budget. The XIM node logs its own estimate at startup; rely on that when sizing hardware.

Can I use AMD GPUs?

Yes. AMD ROCm GPU support is available. You can also run inference on AMD EPYC CPUs using vLLM with ZenDNN optimization. See the AMD CPU Deployment section for details.

Is my data secure?

XIM nodes only receive requests scoped to your project, and all node-to-router connections use CURVE-encrypted ZMQ. In a fully self-hosted topology (your XIM node and your own Xerotier router), inference data stays within infrastructure you operate. When connecting a self-hosted XIM node to the shared Xerotier router mesh, prompts and responses transit the shared router process during dispatch; choose a topology that matches your data-handling requirements.

What happens if my XIM node goes offline?

Requests are automatically routed to other available XIM nodes. If you have fallback enabled, requests can be served by shared infrastructure. Otherwise, they queue until your node reconnects.

How does KV cache offload work?

Native CPU KV cache offload reduces Time-to-First-Token (TTFT) for repeated prompt prefixes. See the KV Cache Offload Issues section for the default size, the tuning knob, and how to disable it; the same guidance applies here.