XIM Advanced
Knobs for the operators who need them. Quantization, speculative decoding, flow control, lease windows, vLLM startup timeouts, and per-tenant cache salts. Touch these when a symptom forces your hand.
Quantization
The XIM node supports both automatic and manual quantization to fit large models into limited GPU VRAM. Auto-quantization inspects the model size and available VRAM at startup and selects the best method automatically.
Quantization Override
Quantization is determined by the database value set when the endpoint is created. The agent CLI can override this with an explicit method:
# Force a specific quantization method
XEROTIER_AGENT_VLLM_QUANTIZATION=bitsandbytes
Quantization Environment Variables
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_VLLM_QUANTIZATION |
- | Force a specific quantization method (overrides database value). Supported values: bitsandbytes, bitsandbytes-fp4, fp8, awq, gptq, and more... |
Pre-Quantized Models: If the model is already quantized (AWQ, GPTQ format on HuggingFace), the XIM node detects this and passes the appropriate flag to vLLM. No additional configuration is needed.
Speculative Decoding
Speculative decoding can improve generation throughput by speculatively predicting multiple tokens per forward pass and verifying them in a single step. The XIM node supports several speculative methods.
Opt-In Required: Speculative decoding is disabled by default. You must set XEROTIER_AGENT_SPECULATIVE_ENABLED=1 to activate it. The method is auto-detected from the model architecture when not explicitly specified.
Supported Methods
| Method | Draft Model | Description |
|---|---|---|
deepseek_mtp |
not required | Native Multi-Token Prediction for DeepSeek V3/R1 models that ship MTP layers. |
qwen3_next_mtp |
not required | Native Multi-Token Prediction for Qwen3-Next models. |
mtp |
not required | Native Multi-Token Prediction for GLM-4.5 MoE models. |
mimo_mtp |
not required | Native Multi-Token Prediction for MIMO models. |
ngram |
not required | N-gram based speculation using prompt history. Default tokens per step: 5. |
suffix |
not required | Suffix-array based speculation. Default tokens per step: 16. |
eagle |
required | EAGLE speculative decoding using an external draft model. Default tokens per step: 3. |
medusa |
required | Medusa speculative decoding using an external draft model. Default tokens per step: 3. |
draft_model |
required | Generic draft-model speculative decoding. Default tokens per step: 5. |
mlp_speculator |
required | MLP-speculator based speculative decoding. Default tokens per step: 3. |
Speculative Decoding Environment Variables
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_SPECULATIVE_ENABLED |
disabled | Set to 1 or true to enable speculative decoding. Required. |
XEROTIER_AGENT_SPECULATIVE_METHOD |
auto | Force a speculative method. Supported values: deepseek_mtp, qwen3_next_mtp, mtp, mimo_mtp, ngram, suffix, eagle, medusa, draft_model, mlp_speculator. Auto-detected from model architecture when not set. |
XEROTIER_AGENT_SPECULATIVE_TOKENS |
method default | Number of speculative tokens per step. Higher values increase throughput at the cost of verification overhead. Per-method defaults: deepseek_mtp, mtp, mimo_mtp = 1; qwen3_next_mtp = 2; eagle, medusa, mlp_speculator = 3; ngram, draft_model = 5; suffix = 16. |
XEROTIER_AGENT_SPECULATIVE_NGRAM_FALLBACK |
disabled | Set to 1 or true to enable n-gram fallback when the primary method is unavailable. |
XEROTIER_AGENT_SPECULATIVE_DRAFT_MODEL_PATH |
- | Filesystem path to an external draft model. Required for eagle and medusa methods. |
Example: MTP on DeepSeek
XEROTIER_AGENT_SPECULATIVE_ENABLED=1
# Method is auto-detected for DeepSeek V3/R1 (deepseek_mtp)
# Override if needed:
# XEROTIER_AGENT_SPECULATIVE_METHOD=deepseek_mtp
Qwen3 carve-out: Native MTP auto-detection is enabled by default for DeepSeek, GLM-4.5 MoE, and MIMO families. The Qwen3 family (including Qwen3.5/3.6 MoE) intentionally does not enable speculative decoding by default; operators must opt in with XEROTIER_AGENT_SPECULATIVE_ENABLED=1 and, for Qwen3-Next, may set XEROTIER_AGENT_SPECULATIVE_METHOD=qwen3_next_mtp explicitly.
Example: N-Gram Fallback
XEROTIER_AGENT_SPECULATIVE_ENABLED=1
XEROTIER_AGENT_SPECULATIVE_METHOD=ngram
XEROTIER_AGENT_SPECULATIVE_TOKENS=3
XEROTIER_AGENT_SPECULATIVE_NGRAM_FALLBACK=1
MoE Kernel Tuning
For Mixture-of-Experts (MoE) models, the XIM node can automatically generate and apply optimized kernel tuning configurations. This improves expert dispatch performance on your specific GPU hardware.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_MOE_CONFIG_ENABLED |
enabled | Enable automatic MoE kernel tuning config generation. Set to 0 or false to disable. |
XEROTIER_AGENT_MOE_CONFIG_PATH |
auto | Custom path for MoE tuned config files. When unset, the XIM node stores configs alongside the model cache. |
Non-MoE Models: These settings have no effect on dense (non-MoE) models. The XIM node detects whether the loaded model uses MoE architecture and only generates configs when applicable.
Auto-Configuration
The XIM node includes a dynamic auto-configuration system that inspects the model architecture, GPU hardware, and available memory at startup to select optimal vLLM parameters. This covers tensor parallelism, quantization, context length, and CUDA graph settings.
Auto-Configuration Environment Variables
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_AUTO_CONFIG |
enabled | Master toggle for dynamic auto-configuration. Set to 0 or false to disable all auto-tuning and use only explicit settings. |
XEROTIER_AGENT_AUTO_CONFIGURE_GPU |
enabled | Auto-configure tensor parallelism from detected GPU count. Set to 0 or false to use the explicit XEROTIER_AGENT_TENSOR_PARALLEL_SIZE value. |
XEROTIER_AGENT_AUTO_CUDA_MITIGATION |
enabled | Auto-apply CUDA graph mitigations for known GPU issues (A30, A40, L40 with TP>1). Set to 0 or false to disable. |
Disabling Auto-Configuration
To take full manual control of vLLM parameters, disable all auto-configuration:
# Disable all auto-configuration
XEROTIER_AGENT_AUTO_CONFIG=0
XEROTIER_AGENT_AUTO_CONFIGURE_GPU=0
XEROTIER_AGENT_AUTO_CUDA_MITIGATION=0
# Then set explicit values (the default for GPU_MEMORY_UTILIZATION is 0.95;
# the value below is an example override for memory-constrained deployments).
XEROTIER_AGENT_TENSOR_PARALLEL_SIZE=2
XEROTIER_AGENT_GPU_MEMORY_UTILIZATION=0.85
XEROTIER_AGENT_MAX_MODEL_LEN=8192
Boolean parsing is case-sensitive: The XEROTIER_AGENT_AUTO_CONFIG, XEROTIER_AGENT_AUTO_CONFIGURE_GPU, XEROTIER_AGENT_AUTO_CUDA_MITIGATION, XEROTIER_AGENT_VLLM_DISABLE_CUDA_GRAPHS, and XEROTIER_AGENT_VLLM_DISABLE_CUSTOM_ALL_REDUCE env vars compare against the literal strings 0, 1, false, and true. Values such as True, TRUE, FALSE, off, or yes are silently ignored. Use the exact lower-case form.
CUDA Graph Workarounds
Some GPU models (A30, A40, L40) experience CUDA graph capture failures under specific tensor parallelism configurations. The XIM node detects these cases and applies mitigations automatically when XEROTIER_AGENT_AUTO_CUDA_MITIGATION is enabled.
For manual control of CUDA graph behavior:
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_VLLM_DISABLE_CUDA_GRAPHS |
disabled | Force disable CUDA graphs (enforce eager execution). Set to 1 or true if experiencing CUDA graph capture failures. |
XEROTIER_AGENT_VLLM_DISABLE_CUSTOM_ALL_REDUCE |
disabled | Force disable custom all-reduce optimization. Set to 1 or true for multi-GPU P2P issues on A30/A40 GPUs. |
Flow Control
The XIM node implements credit-based flow control for streaming inference to prevent buffer overflow on the router side. Flow control ensures the XIM node does not send chunks faster than the router can forward them to clients.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_STREAMING_SOCKET_ENABLED |
true | Enable the dedicated streaming DEALER socket. When enabled, inference chunks use a separate socket to avoid head-of-line blocking on the control channel. |
XEROTIER_AGENT_FLOW_CONTROL_ENABLED |
true | Enable credit-based flow control for streaming inference. Set to 0 or false to disable backpressure. |
XEROTIER_AGENT_FLOW_CONTROL_WINDOW_BYTES |
65536 | Initial credit window size in bytes per stream. The XIM node can send up to this many bytes before waiting for the router to replenish credits. |
XEROTIER_AGENT_FLOW_CONTROL_REPLENISH_THRESHOLD |
0.5 | Fraction of the window that triggers a credit replenishment from the router (0.0-1.0). |
XEROTIER_AGENT_FLOW_CONTROL_PAUSE_THRESHOLD_BYTES |
262144 | Total unacknowledged bytes across all streams before the XIM node pauses sending. Acts as a global safety valve. |
XEROTIER_AGENT_FLOW_CONTROL_TIMEOUT_SECONDS |
30 | Seconds to wait for credit recovery before considering the stream stalled. |
When to adjust: Touch these only if the XIM node logs show streaming stalls or excessive backpressure pauses.
Lease Configuration
The XIM node maintains a lease with the router mesh via periodic heartbeats. If the router does not receive a renewal within the lease duration window, it marks the XIM node as expired and stops routing requests to it.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_LEASE_RENEWAL_INTERVAL_MS |
10000 | How often (in milliseconds) the XIM node sends a lease renewal heartbeat to the router. Default: 10 seconds. |
XEROTIER_AGENT_LEASE_DURATION_MS |
30000 | Requested lease duration in milliseconds. If the router does not receive a renewal within this window, the XIM node is marked expired. Default: 30 seconds. |
Keep the ratio safe: The lease duration should be at least 2-3x the renewal interval. A tight margin increases the risk of false lease expirations during transient network issues.
Model Pull Retry
When the XIM node downloads a model from the router or a remote registry, it uses exponential backoff retries on failure. These settings control retry behavior.
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_MODEL_PULL_MAX_ATTEMPTS |
5 | Maximum number of retry attempts for a failed model pull before giving up. |
XEROTIER_AGENT_MODEL_PULL_BASE_DELAY_MS |
1000 | Base delay in milliseconds between retries. Each subsequent retry doubles the delay (exponential backoff). |
XEROTIER_AGENT_MODEL_PULL_MAX_DELAY_MS |
15000 | Maximum delay in milliseconds between retries. The backoff delay is capped at this value. |
Retry Sequence: With defaults, retries occur at approximately 1s, 2s, 4s, 8s, 15s (capped). Increase XEROTIER_AGENT_MODEL_PULL_MAX_ATTEMPTS for unreliable networks.
Enrollment Options
Additional options for the enrollment and startup process.
Insecure Enrollment
By default, enrollment requires HTTPS endpoints. For development or trusted network environments, you can allow non-HTTPS enrollment:
# Via CLI flag
xerotier-xim-agent enroll --join-key xjk_abc123... --insecure
# Via environment variable
XEROTIER_AGENT_ALLOW_INSECURE=1 xerotier-xim-agent enroll --join-key xjk_abc123...
Security Risk: Insecure enrollment transmits your join key and agent credentials over an unencrypted connection. Only use this in isolated development environments. Never use --insecure in production.
Initial Model Path
Preload a local model on startup instead of waiting for the router to stream one. This is useful when you have already downloaded a model to disk:
# Load a model from a local path on startup
XEROTIER_AGENT_INITIAL_MODEL_PATH=/data/models/my-model
When set, the XIM node starts vLLM with this model immediately after registration. The model must already exist at the specified path in a format vLLM can load (safetensors or equivalent).
Tenant Cache Isolation
In multi-tenant deployments where a single XIM node serves requests from different tenants, set a server secret to generate per-tenant cache salts. This prevents cross-tenant data leakage in the KV cache. If this env var is unset, the XIM node falls back to a constant compiled-in salt, which provides no tenant isolation across deployments and should not be relied on in production:
# Server secret for tenant-isolated cache keys
XEROTIER_AGENT_VLLM_SALT_SECRET=your-secret-string-here
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_VLLM_SALT_SECRET |
built-in fallback (constant) | Server secret for generating per-tenant cache isolation salts. When unset, the XIM node falls back to a constant compiled-in value, which means every deployment that leaves it unset shares the same salt and gets no real tenant isolation. Production deployments must set this to a unique, high-entropy secret. |
XEROTIER_AGENT_ALLOW_INSECURE |
disabled | Set to 1 or true to allow non-HTTPS enrollment. Development only. |
XEROTIER_AGENT_INITIAL_MODEL_PATH |
- | Filesystem path to a local model to preload on startup. XIM node starts vLLM with this model immediately after registration. |
vLLM Runtime Tuning
Knobs that control vLLM startup behavior, KV cache backend selection, and prefix caching policy. These exist for slow-disk, large-model, and multi-tenant deployments.
Startup Timeouts
Large models on slow storage can exceed the default vLLM startup window. The XIM node exposes two timeout knobs plus a degraded-state ceiling. Defaults track the value vLLM itself respects at the time the XIM node was built; run xerotier-xim-agent --help to print the current effective ceiling:
| Variable | Default | Description |
|---|---|---|
XEROTIER_AGENT_VLLM_STARTUP_TIMEOUT_SECONDS |
vLLM-defined | Hard ceiling (in seconds) on how long the XIM node waits for vLLM to finish loading a model before marking the start attempt as failed. Increase for large models or slow disks. |
XEROTIER_AGENT_VLLM_STARTUP_INACTIVITY_GRACE_SECONDS |
vLLM-defined | Inactivity grace window during vLLM startup. If vLLM produces no progress output for this long, the XIM node considers the start attempt stuck. |
XEROTIER_AGENT_DEGRADED_TO_FAILED_TIMEOUT_SECONDS |
vLLM-defined | How long an XIM node may remain in the degraded state before being promoted to failed. Operators on flaky networks may want a longer window. |
KV Cache Backend
The XIM node supports two KV cache backends. The default native backend uses vLLM's built-in paged attention with CPU offload (see KV Cache Offload Issues). The optional lmcache backend integrates the external LMCache library for cross-process and cross-node KV reuse. LMCache configuration is plumbed through the agent config but is opt-in; consult your deployment template for the exact env-var surface.
Prefix Caching
Always enabled: Prefix caching is enabled unconditionally on every XIM node. There is no env-var toggle to turn it off; the XIM node always emits --enable-prefix-caching when starting vLLM. This is intentional and matches the behavior validated for hybrid SSM and dense models alike.
Qwen3 MTP Defaults
Qwen3 family carve-out: Native Multi-Token Prediction defaults were removed for the Qwen3 family (including Qwen3.5 and Qwen3.6 MoE) because the upstream MTP layers correlated with repetition-loop hangs. Operators who want speculative decoding on Qwen3-Next must opt in explicitly with XEROTIER_AGENT_SPECULATIVE_ENABLED=1 and XEROTIER_AGENT_SPECULATIVE_METHOD=qwen3_next_mtp. DeepSeek, GLM-4.5 MoE, and MIMO families continue to auto-enable native MTP when enabled.
ROCm Short-Sequence Workaround (Qwen3.5 GDN)
On AMD ROCm GPUs, Qwen3.5 models that use GDN (Generalized Delta Networks) layers crash when the sequence length is below 64 tokens. The XIM node's bundled xerotier-vllm fork applies a zero-pad workaround to short sequences so these prompts no longer hit the crash path. No operator configuration is required; the fix is active whenever the XIM node detects an affected model on an AMD ROCm device.
Single-Node Queuing
When a project has exactly one compatible XIM node and that node is at capacity, the router automatically queues incoming requests instead of returning an immediate 503 error. This improves the experience for small deployments running a single XIM node.
How It Works
- A request arrives and the router finds exactly one compatible XIM node.
- That node is at its concurrent request limit.
- Instead of failing, the router parks the request in a waiting queue.
- When the node completes an in-flight request and frees a slot, the queued request is dispatched.
- If the queue timeout expires before capacity becomes available, the request fails with a 503.
Queue Limits
| Parameter | Default | Description |
|---|---|---|
| Max queued per node | 10 | Maximum number of requests that can wait for a single node to free capacity. Compile-time constant. |
XEROTIER_SINGLE_AGENT_QUEUE_TIMEOUT_S |
30 | Maximum seconds a request waits in the queue before receiving a 503 response. Set this environment variable on the Xerotier router process to override the default. |
Router-Side Setting: The queue timeout is configured on the Xerotier router process, not on the XIM node. Self-hosted operators can set XEROTIER_SINGLE_AGENT_QUEUE_TIMEOUT_S in the router environment to adjust the wait window (in seconds).
This behavior is transparent to clients. During queuing, the router holds the HTTP connection open. If the request is dispatched successfully, the client receives a normal response. The X-Request-ID response header can be used to trace queued requests in logs.
AMD CPU Deployment (ZenDNN)
Run inference on AMD EPYC CPUs without a GPU using vLLM with ZenDNN optimization. The ZenDNN CPU image is prebuilt and published for you; there is nothing to build locally.
Prebuilt image: The ZenDNN CPU image is published at ghcr.io/cloudnull/xerotier-public/xim-vllm-zendnn:latest and runs with compose/compose.agent-amd-cpu-zendnn.yaml from the cloudnull/xerotier-public repository.
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| CPU | AMD EPYC with AVX-512 | AMD EPYC 9454 (Genoa) or newer |
| System RAM | 64GB | 96GB+ (scales with model size) |
| Disk Space | 100GB SSD | 500GB+ NVMe SSD |
| CPU Cores | 16 cores | 24+ cores |
Memory Requirements by Model Size
CPU inference requires significantly more system RAM than GPU VRAM:
| Model Size | System RAM Required |
|---|---|
| Sub-1B parameters | ~32GB |
| 3-4B parameters | ~64GB |
| 7-8B parameters | ~96GB |
CPU-Specific Environment Variables
| Variable | Default | Description |
|---|---|---|
VLLM_PLUGINS |
zentorch | Enable ZenDNN optimization plugin |
VLLM_CPU_KVCACHE_SPACE |
auto | KV cache size in GB (auto: totalRAM - model - overhead) |
VLLM_CPU_OMP_THREADS_BIND |
auto | CPU core binding range (auto: 0-{nproc-1}) |
VLLM_CPU_NUM_OF_RESERVED_CPU |
auto | CPUs reserved for OS operations (auto: 1) |
Auto-Tuning: The XIM node automatically computes VLLM_CPU_KVCACHE_SPACE, VLLM_CPU_OMP_THREADS_BIND, and VLLM_CPU_NUM_OF_RESERVED_CPU at startup based on system resources. No manual calculation is required. Override via XEROTIER_AGENT_VLLM_ENV if the defaults are unsuitable.
Docker Compose for CPU XIM Node
The CPU compose file is published in the cloudnull/xerotier-public repository under compose/:
git clone https://github.com/cloudnull/xerotier-public.git
cd xerotier-public/compose
export XEROTIER_AGENT_JOIN_KEY=xjk_your_key_here
docker compose -f compose.agent-amd-cpu-zendnn.yaml up -d
Memory Tuning: If the auto-computed VLLM_CPU_KVCACHE_SPACE causes out-of-memory errors, override it to a lower value via XEROTIER_AGENT_VLLM_ENV="VLLM_CPU_KVCACHE_SPACE=<GB>".
Performance Considerations
- Concurrency: CPU inference supports fewer concurrent requests than GPU. Start with
XEROTIER_AGENT_MAX_CONCURRENT=5and adjust based on model size. - Data Type: Use
--dtype=bfloat16for optimal performance on AMD EPYC with AVX-512 VNNI. - Model Selection: Smaller models (1-8B parameters) work best for CPU inference. Larger models will have significantly higher latency.
- Memory Bandwidth: Inference performance is often memory-bandwidth limited. Ensure your system has adequate memory channels populated.
Troubleshooting
Symptoms operators hit at deploy time, and the specific knob or command that resolves each one.
Agent Fails to Start
| Symptom | Solution |
|---|---|
| Join key expired | Generate a new join key from the Agents dashboard |
| Connection refused | Verify network connectivity to Xerotier router mesh |
| Invalid join key format | Ensure the complete key is provided without truncation |
GPU Not Detected
# Verify NVIDIA driver
nvidia-smi
# Verify Container Toolkit installation
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi
# Check Docker runtime configuration
docker info | grep -i runtime
Model Loading Fails
| Symptom | Solution |
|---|---|
| Out of disk space | Increase disk allocation or reduce cache size |
| Model not found | Verify the model name selected for the endpoint exists on HuggingFace, or that XEROTIER_AGENT_INITIAL_MODEL_PATH points to a valid local model directory. |
Permission Denied Errors
# Fix host directory permissions
sudo chown -R 5152:5152 /data/xerotier
# Verify permissions
ls -la /data/xerotier
Out of Memory (OOM)
- Reduce
XEROTIER_AGENT_GPU_MEMORY_UTILIZATIONto 0.85 or lower - Reduce
XEROTIER_AGENT_MAX_CONCURRENTto limit concurrent requests - Reduce
XEROTIER_AGENT_MAX_MODEL_LENfor shorter context windows - Use a smaller model or add more GPUs
KV Cache Offload Issues
The agent ships with vLLM native CPU KV cache offloading enabled by default (25% of system RAM). See xerotier-xim-agent --help under --kv-offload-size-gb for tuning. Setting the environment variable XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB=0 disables offload.
| Symptom | Solution |
|---|---|
| Host memory pressure or OOM | Lower XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB below the 25% default, or set it to 0 to disable offload entirely |
| No TTFT improvement on repeated prefixes | Confirm offload is enabled (non-zero XEROTIER_AGENT_KV_OFFLOAD_SIZE_GB) and that requests actually share prefixes |
Common Commands
| Command | Description |
|---|---|
docker-compose logs -f agent |
View agent logs |
docker-compose restart agent |
Restart the agent |
docker-compose down |
Stop all services |
nvidia-smi |
Monitor GPU utilization |
docker stats |
Monitor container resource usage |
Frequently Asked Questions
How do I get a join key?
Navigate to the Agents page in your dashboard and click "Generate Join Key". Configure the region and expiration, then copy the generated key. The full key is only shown once.
Can I run multiple models on one GPU?
The agent loads one model at a time per vLLM instance. To serve multiple models, deploy multiple agents on separate GPUs or use time-sharing (not recommended for production).
How do I update the agent?
Pull the latest image and restart: docker-compose pull && docker-compose up -d. Your model cache and configuration persist through updates.
What models are supported?
Any model compatible with vLLM, including most HuggingFace Transformers models. Check the vLLM supported models list for compatibility.
How much VRAM do I need?
The XIM node uses a component-based estimator (weights + GQA-aware KV cache + an activation budget of roughly 20% of weights + a fixed 1.5GB CUDA overhead + a 5% safety margin), not a flat per-parameter multiplier. Rough fp16 ballpark figures including this overhead: 7B around 17-18GB, 13B around 30-32GB, 70B around 150GB+ (typically split across multiple GPUs). Models with strong grouped-query attention (GQA) and FP8 KV cache quantization run lower than this; long context windows and high concurrency run higher. Pre-quantized AWQ/GPTQ/bitsandbytes models reduce the weight component but do not eliminate the CUDA overhead or KV-cache budget. The XIM node logs its own estimate at startup; rely on that when sizing hardware.
Can I use AMD GPUs?
Yes. AMD ROCm GPU support is available. You can also run inference on AMD EPYC CPUs using vLLM with ZenDNN optimization. See the AMD CPU Deployment section for details.
Is my data secure?
XIM nodes only receive requests scoped to your project, and all node-to-router connections use CURVE-encrypted ZMQ. In a fully self-hosted topology (your XIM node and your own Xerotier router), inference data stays within infrastructure you operate. When connecting a self-hosted XIM node to the shared Xerotier router mesh, prompts and responses transit the shared router process during dispatch; choose a topology that matches your data-handling requirements.
What happens if my XIM node goes offline?
Requests are automatically routed to other available XIM nodes. If you have fallback enabled, requests can be served by shared infrastructure. Otherwise, they queue until your node reconnects.
How does KV cache offload work?
Native CPU KV cache offload reduces Time-to-First-Token (TTFT) for repeated prompt prefixes. See the KV Cache Offload Issues section for the default size, the tuning knob, and how to disable it; the same guidance applies here.
// jump to section
Type the number, then Enter. Esc to dismiss.