//

Two ways to serve inference: a XIM node you run on your hardware, or a shared agent the platform runs on its own. Same router, same OpenAI-compatible surface, same dashboard. The choice is operational, not architectural.

Overview

When a request arrives at your endpoint, the router selects an agent based on availability, capacity, and ownership.

  • XIM Nodes: Owned and operated by you, running on your own infrastructure
  • Shared Agents: Platform-managed agents available to all authorized users

Tip: You can use both agent types together. XIM nodes handle your primary workloads while shared agents provide automatic fallback.

Quick Comparison

Aspect XIM (Private) Shared
Ownership User/Project Platform
Access Owner-only Any authorized user
Enrollment Join key required Pre-provisioned
Data isolation Single-tenant Multi-tenant
Control Full lifecycle management None
Cost User provides hardware Platform cost
Model caching Dedicated cache Shared pool
Fallback Can use shared as backup N/A

Tier Types

Every endpoint is assigned a service tier that determines the type of compute used, the scheduling priority, rate limits, and pricing. The tier slug is used when configuring endpoints and when specifying supported_tiers on join keys or agents.

Slug Name Accelerator Tenancy Priority
free Free Shared (any) Multi-tenant 10
cpu_amd_optimized CPU AMD Optimized AMD EPYC CPU (ZenDNN) Multi-tenant 20
cpu_intel_optimized CPU Intel Optimized Intel Xeon CPU (oneAPI) Multi-tenant 20
gpu_nvidia_shared GPU NVIDIA Shared NVIDIA CUDA GPU Multi-tenant 25
gpu_amd_shared GPU AMD Shared AMD ROCm GPU Multi-tenant 25
gpu_intel_shared GPU Intel Shared Intel oneAPI / Arc GPU Multi-tenant 25
self_hosted Self-Hosted User-supplied (CPU class in tier matching; see note) Single-tenant 30

Tier Descriptions

free

The entry-level tier for testing and evaluation. Rate limits are low (10,000 tokens/minute, 64 requests/minute). Streaming is enabled; batching is disabled. Projects on the free tier are limited to 2 endpoints and 2 models.

cpu_amd_optimized

CPU inference optimized for AMD EPYC processors using ZenDNN. Well-suited for batch workloads and cost-sensitive inference where GPU acceleration is not required. Rate limits: 100,000 tokens/minute, 128 requests/minute. Both streaming and batching are enabled.

cpu_intel_optimized

CPU inference optimized for Intel Xeon processors using the Intel oneAPI toolkit. Equivalent rate limits and capabilities to cpu_amd_optimized. Both streaming and batching are enabled.

gpu_nvidia_shared

Shared NVIDIA CUDA GPU inference. Best balance of price and performance for most workloads. Multiple projects share the same physical GPU; request isolation is enforced at the software layer. Rate limits: 500,000 tokens/minute, 256 requests/minute.

gpu_amd_shared

Shared AMD ROCm GPU inference. Cost-effective alternative to NVIDIA GPU tiers. Same rate limits as gpu_nvidia_shared.

gpu_intel_shared

Shared Intel oneAPI GPU inference on Arc or Gaudi accelerators. Same rate limits as gpu_nvidia_shared.

self_hosted

Bring-your-own-accelerator tier for XIM (private) agents. You supply the hardware; the platform provides routing, monitoring, and management. Very high concurrency limits (96 requests) and batch sizes (64). Request timeout is 1800 seconds with an idle timeout of 3600 seconds. Supports model files up to 512 GB. Pricing is hourly rather than per-token.

Note: GPU XIM agents enrolled against self_hosted still serve GPU workloads. To advertise GPU capability explicitly, set supported_tiers on the join key or agent to include the matching shared GPU tier (for example gpu_nvidia_shared).

SLA: Any uptime percentages shown in tier marketing copy or features_json are descriptive only. The platform does not currently measure or enforce per-tier SLAs in code; treat published numbers as targets, not contractual guarantees.

Deprecated tiers: Earlier releases exposed gpu_nvidia_dedicated and gpu_amd_dedicated as primary tier slugs. New endpoints should use the corresponding shared GPU tier (gpu_nvidia_shared, gpu_amd_shared); single-tenant pinning is now expressed via endpoint and agent binding rather than a separate tier slug.

XIM Node Features

XIM nodes give you full control over your inference infrastructure while leveraging Xerotier.ai routing and management capabilities.

Key Capabilities

  • Join key enrollment, Connect your hardware to the platform using secure, time-limited join keys created from the Agents dashboard or via the API. Join keys expire after at most 1 hour.
  • Full lifecycle management, Suspend, resume, and remove agents at any time. View logs and monitor health metrics in real time. Agents transition through pending, active, disconnected, suspended, and dead states.
  • Region assignment, Assign agents to regions (free-form strings up to 24 ASCII characters, e.g. "us-east-1" or "datacenter-3") for geographic or logical organization. The region is inherited from the join key used at enrollment.
  • Dedicated model cache, XIM nodes maintain a private model cache not shared with other users, providing faster loading and data isolation.
  • Storage limits, Optionally configure a storage limit in bytes per agent via storage_limit_enabled and storage_limit_bytes on the join key or agent record. When disabled, agents can use unlimited storage up to disk capacity.
  • Tier support, Each agent can declare which service tiers it supports via supported_tiers. When null, the platform selects appropriate defaults based on the agent's detected accelerator type.

For detailed setup instructions, join key management, enrollment procedures, monitoring, and API examples, see the Private Agents & Join Keys guide.

Shared Agent Features

Shared agents are platform-managed infrastructure available to all users.

No Enrollment Needed

Shared agents are pre-provisioned and ready to use. When you create an endpoint, it can immediately route to shared agents without any setup.

Platform-Managed

Xerotier.ai handles all operational aspects:

  • Provisioning and scaling
  • Health monitoring and recovery
  • Software updates and patches
  • Hardware maintenance

Multi-Tenant with Isolation

Shared agents serve requests from multiple projects. Request isolation is enforced through process-level sandboxing, memory isolation, network segmentation, and request authentication.

Security: Even on shared agents, your requests and data are isolated from other users. Each request runs in a sandboxed environment with memory cleared between requests.

Routing Behavior

The Router uses the following logic to select an agent for each request:

  1. User-owned agents take precedence: If you have active XIM agents enrolled in the requested tier, they are used first
  2. Automatic fallback: When all your agents are busy or offline, requests can fall back to shared agents
  3. Regional affinity: Requests prefer agents in the same or nearby regions
  4. Load balancing: Multiple agents of the same type are load-balanced using a fair-queue scheduler with latency prediction
  5. Heartbeat freshness: Agents whose last heartbeat is older than 30 seconds are treated as disconnected and excluded from routing

Note: Fallback to shared agents must be enabled in your endpoint settings. When disabled, requests will queue until an agent becomes available or the queue timeout (30 seconds) is reached.

Frequently Asked Questions

Can I use both agent types simultaneously?

Yes. You can have XIM nodes for your primary workloads while using shared agents as a fallback. Configure this in your endpoint settings under "Fallback Options."

What happens if my agent goes offline?

When an agent disconnects, requests are routed to other available agents. If no agents are available and fallback is enabled, requests go to shared agents. If fallback is disabled, requests are queued until an agent becomes available. See Agent Lifecycle for details.

How do I migrate from shared to XIM?

Generate a join key from the Agents dashboard or via the API, deploy an XIM agent on your infrastructure, and start it with the join key. See the Agent Enrollment guide for step-by-step instructions.

Are my requests visible to other users on shared agents?

No. Even on shared agents, requests are isolated. Each request runs in a sandboxed environment, memory is cleared between requests, network access is restricted to your project, and logs and metrics are project-scoped.

What hardware do I need for self-hosting?

Requirements depend on your model and workload. GPU agents need an NVIDIA GPU with CUDA support, an AMD GPU with ROCm support, or an Intel GPU with oneAPI support. CPU agents need a modern x86_64 CPU. You also need sufficient RAM and disk space to load your models, a stable network connection, and a Linux environment compatible with vLLM.

Can I run multiple agents?

Yes. Running multiple agents provides redundancy (requests continue if one agent fails), increased throughput (handle more concurrent requests), and geographic distribution (deploy agents closer to your users).

What is the difference between shared and dedicated GPU tiers?

On shared GPU tiers (gpu_nvidia_shared, gpu_amd_shared, gpu_intel_shared), the GPU is shared among multiple projects with software-level isolation. The legacy gpu_nvidia_dedicated and gpu_amd_dedicated slugs are deprecated; single-tenant pinning is now configured at the endpoint and agent level rather than via a separate tier slug. Use the shared GPU tier that matches your hardware and configure exclusive binding on the endpoint when required.

How does /v1/enroll/refresh authenticate?

The refresh endpoint validates the rotation token presented in the request body rather than running through the standard authentication middleware. See Authentication for the full token-rotation flow.

What tier slug should I use when configuring a self-hosted agent?

Use self_hosted as the tier slug for XIM (private) agents that you enroll yourself. You can also set supported_tiers on the join key or agent to allow the agent to serve requests from additional tiers that match your hardware.