Agent Types

Xerotier.ai supports two types of agents for serving inference requests: Xerotier Inference Microservice (XIM) nodes that you operate, and shared agents managed by the platform.

Overview

Agents are the compute workers that execute model inference. When a request arrives at your endpoint, the Router selects an appropriate agent based on availability, capacity, and ownership.

  • XIM Nodes: Owned and operated by you, running on your own infrastructure
  • Shared Agents: Platform-managed agents available to all authorized users

Tip: You can use both agent types together. XIM nodes handle your primary workloads while shared agents provide automatic fallback.

Quick Comparison

Aspect XIM (Private) Shared
Ownership User/Project Platform
Access Owner-only Any authorized user
Enrollment Join key required Pre-provisioned
Data isolation Single-tenant Multi-tenant
Control Full lifecycle management None
Cost User provides hardware Platform cost
Model caching Dedicated cache Shared pool
Fallback Can use shared as backup N/A

Tier Types

Every endpoint is assigned a service tier that determines the type of compute used, the scheduling priority, rate limits, and pricing. The tier slug is used when configuring endpoints and when specifying supported_tiers on join keys or agents.

Slug Name Accelerator Tenancy Priority Preemptable
free Free Shared (any) Multi-tenant 10 Yes
cpu_amd_optimized CPU AMD Optimized AMD EPYC CPU (ZenDNN) Multi-tenant 20 No
cpu_intel_optimized CPU Intel Optimized Intel Xeon CPU (oneAPI) Multi-tenant 20 No
gpu_nvidia_shared GPU NVIDIA Shared NVIDIA CUDA GPU Multi-tenant 25 No
gpu_amd_shared GPU AMD Shared AMD ROCm GPU Multi-tenant 25 No
gpu_intel_shared GPU Intel Shared Intel oneAPI / Arc GPU Multi-tenant 25 No
gpu_nvidia_dedicated GPU NVIDIA Dedicated NVIDIA CUDA GPU (exclusive) Single-tenant (pinned) 35 No
gpu_amd_dedicated GPU AMD Dedicated AMD ROCm GPU (exclusive) Single-tenant (pinned) 35 No
self_hosted Self-Hosted User-supplied (any) Single-tenant 30 No

Tier Descriptions

free

The entry-level tier for testing and evaluation. Requests are preemptable -- they may be interrupted to make room for higher-priority traffic. Rate limits are low (10,000 tokens/minute, 64 requests/minute). Streaming is enabled; batching is disabled. Projects on the free tier are limited to 2 endpoints and 2 models.

cpu_amd_optimized

CPU inference optimized for AMD EPYC processors using ZenDNN. Well-suited for batch workloads and cost-sensitive inference where GPU acceleration is not required. Rate limits: 100,000 tokens/minute, 128 requests/minute. Both streaming and batching are enabled.

cpu_intel_optimized

CPU inference optimized for Intel Xeon processors using the Intel oneAPI toolkit. Equivalent rate limits and capabilities to cpu_amd_optimized. Both streaming and batching are enabled.

gpu_nvidia_shared

Shared NVIDIA CUDA GPU inference. Best balance of price and performance for most workloads. Multiple projects share the same physical GPU; request isolation is enforced at the software layer. Rate limits: 500,000 tokens/minute, 256 requests/minute. 99.9% SLA.

gpu_amd_shared

Shared AMD ROCm GPU inference. Cost-effective alternative to NVIDIA GPU tiers. Same rate limits as gpu_nvidia_shared. 99.9% SLA.

gpu_intel_shared

Shared Intel oneAPI GPU inference on Arc or Gaudi accelerators. Same rate limits as gpu_nvidia_shared. 99.9% SLA.

gpu_nvidia_dedicated

Exclusive NVIDIA CUDA GPU with pinned worker binding. The entire GPU is reserved for your project -- no sharing with other tenants. Higher scheduling priority (35) than shared GPU tiers. Higher concurrency limits (96 requests) and larger batch sizes (32). Request timeout is 600 seconds with an idle timeout of 1800 seconds. 99.95% SLA.

gpu_amd_dedicated

Exclusive AMD ROCm GPU with pinned worker binding. Equivalent capabilities to gpu_nvidia_dedicated using AMD hardware. 99.95% SLA.

self_hosted

Bring-your-own-accelerator tier for XIM (private) agents. You supply the hardware; the platform provides routing, monitoring, and management. No GPU vendor restriction -- any accelerator supported by vLLM is usable. Very high concurrency limits (96 requests) and batch sizes (64). Request timeout is 1800 seconds with an idle timeout of 3600 seconds. Supports model files up to 512 GB. Pricing is hourly rather than per-token.

XIM Node Features

XIM nodes give you full control over your inference infrastructure while leveraging Xerotier.ai routing and management capabilities.

Key Capabilities

  • Join key enrollment -- Connect your hardware to the platform using secure, time-limited join keys created from the Agents dashboard or via the API. Join keys expire after at most 1 hour.
  • Full lifecycle management -- Suspend, resume, and remove agents at any time. View logs and monitor health metrics in real time. Agents transition through pending, active, disconnected, suspended, and dead states.
  • Region assignment -- Assign agents to regions (free-form strings up to 24 ASCII characters, e.g. "us-east-1" or "datacenter-3") for geographic or logical organization. The region is inherited from the join key used at enrollment.
  • Dedicated model cache -- XIM nodes maintain a private model cache not shared with other users, providing faster loading and data isolation.
  • Storage limits -- Optionally configure a storage limit in bytes per agent via storage_limit_enabled and storage_limit_bytes on the join key or agent record. When disabled, agents can use unlimited storage up to disk capacity.
  • Tier support -- Each agent can declare which service tiers it supports via supported_tiers. When null, the platform selects appropriate defaults based on the agent's detected accelerator type.

For detailed setup instructions, join key management, enrollment procedures, monitoring, and API examples, see the Private Agents & Join Keys guide.

Shared Agent Features

Shared agents are platform-managed infrastructure available to all users.

No Enrollment Needed

Shared agents are pre-provisioned and ready to use. When you create an endpoint, it can immediately route to shared agents without any setup.

Platform-Managed

Xerotier.ai handles all operational aspects:

  • Provisioning and scaling
  • Health monitoring and recovery
  • Software updates and patches
  • Hardware maintenance

Multi-Tenant with Isolation

Shared agents serve requests from multiple projects. Request isolation is enforced through process-level sandboxing, memory isolation, network segmentation, and request authentication.

Security: Even on shared agents, your requests and data are isolated from other users. Each request runs in a sandboxed environment with memory cleared between requests.

Routing Behavior

The Router uses the following logic to select an agent for each request:

  1. User-owned agents take precedence: If you have active XIM agents enrolled in the requested tier, they are used first
  2. Automatic fallback: When all your agents are busy or offline, requests can fall back to shared agents
  3. Regional affinity: Requests prefer agents in the same or nearby regions
  4. Load balancing: Multiple agents of the same type are load-balanced using a fair-queue scheduler with latency prediction
  5. Heartbeat freshness: Agents whose last heartbeat is older than 30 seconds are treated as disconnected and excluded from routing

Note: Fallback to shared agents must be enabled in your endpoint settings. When disabled, requests will queue until an agent becomes available or the queue timeout (30 seconds) is reached.

Frequently Asked Questions

Can I use both agent types simultaneously?

Yes. You can have XIM nodes for your primary workloads while using shared agents as a fallback. Configure this in your endpoint settings under "Fallback Options."

What happens if my agent goes offline?

When an agent disconnects, requests are routed to other available agents. If no agents are available and fallback is enabled, requests go to shared agents. If fallback is disabled, requests are queued until an agent becomes available. See Agent Lifecycle for details.

How do I migrate from shared to XIM?

Generate a join key from the Agents dashboard or via the API, deploy an XIM agent on your infrastructure, and start it with the join key. See the Agent Enrollment guide for step-by-step instructions.

Are my requests visible to other users on shared agents?

No. Even on shared agents, requests are isolated. Each request runs in a sandboxed environment, memory is cleared between requests, network access is restricted to your project, and logs and metrics are project-scoped.

What hardware do I need for self-hosting?

Requirements depend on your model and workload. GPU agents need an NVIDIA GPU with CUDA support, an AMD GPU with ROCm support, or an Intel GPU with oneAPI support. CPU agents need a modern x86_64 CPU. You also need sufficient RAM and disk space to load your models, a stable network connection, and a Linux environment compatible with vLLM.

Can I run multiple agents?

Yes. Running multiple agents provides redundancy (requests continue if one agent fails), increased throughput (handle more concurrent requests), and geographic distribution (deploy agents closer to your users).

What is the difference between shared and dedicated GPU tiers?

On shared GPU tiers (gpu_nvidia_shared, gpu_amd_shared, gpu_intel_shared), the GPU is shared among multiple projects with software-level isolation. On dedicated GPU tiers (gpu_nvidia_dedicated, gpu_amd_dedicated), the entire GPU is exclusively assigned to your project with pinned worker binding. Dedicated tiers have higher scheduling priority, higher concurrency limits, and a stronger SLA (99.95% vs 99.9%).

What tier slug should I use when configuring a self-hosted agent?

Use self_hosted as the tier slug for XIM (private) agents that you enroll yourself. You can also set supported_tiers on the join key or agent to allow the agent to serve requests from additional tiers that match your hardware.