Service Tiers

Xerotier offers 7 service tiers with different pricing, performance, and availability characteristics. The tier is configured per endpoint, not per request.

Overview

Each endpoint in Xerotier is assigned a service tier that determines its pricing, rate limits, timeouts, maximum model size, and which accelerator hardware can serve requests. The tier is set when creating or updating an endpoint in the dashboard -- it cannot be overridden per request.

Tiers are divided into three categories:

  • Free -- For evaluation and testing with shared CPU resources.
  • Compute (CPU and GPU) -- Pay-per-token tiers with dedicated accelerator types.
  • XIM -- Bring your own infrastructure with no token metering.

Tier Comparison

Pricing and Rate Limits

Tier Pricing Tokens/Min Requests/Min Max Model Size
Free $0.00 10,000 60 4 GB
CPU AMD Optimized $0.225 / 1M tokens 100,000 120 32 GB
CPU Intel Optimized $0.225 / 1M tokens 100,000 120 32 GB
GPU NVIDIA Shared $0.75 / 1M tokens 250,000 240 76 GB
GPU AMD Shared $0.65 / 1M tokens 250,000 240 76 GB
GPU Intel Shared $0.60 / 1M tokens 250,000 240 76 GB
XIM ~$10 / month Unlimited Unlimited 304 GB

Request rate limits include burst capacity of 50% above the base limit (minimum 3 for free, minimum 10 for paid tiers), allowing short-term traffic spikes.

Timeouts and Concurrency

Tier Request Timeout Idle Stream Timeout Max Concurrent Batch Size
Free 30s 120s 5 1
CPU AMD / Intel 300s 600s 20 8
GPU NVIDIA / AMD / Intel 300s 600s 50 16
XIM 1800s 3600s 100 64

Tier Details

Free

Perfect for testing and evaluation. Uses shared CPU infrastructure with preemptable scheduling. Free accounts are limited to 2 endpoints and 2 models. Requests can be preempted by higher-priority traffic.

  • Accelerator: Shared CPU
  • Priority: 0 (lowest)
  • Preemptable: Yes
  • Monthly token allowance: 100,000 tokens

CPU AMD Optimized

ZenDNN-optimized inference on AMD EPYC processors. Good for latency-tolerant workloads with moderate throughput requirements. Requests are not preemptable.

  • Accelerator: AMD ZenDNN CPU
  • Priority: 20
  • Preemptable: No

CPU Intel Optimized

oneAPI-optimized inference on Intel Xeon processors. Similar to AMD CPU tier with slightly higher scheduling priority.

  • Accelerator: Intel oneAPI CPU
  • Priority: 25
  • Preemptable: No

GPU NVIDIA Shared

Best balance of price and performance with NVIDIA CUDA GPUs. Shared infrastructure with 99.9% SLA. Highest scheduling priority among compute tiers. Requests are preemptable, meaning higher-priority internal operations can temporarily displace active requests during peak load.

  • Accelerator: NVIDIA CUDA
  • Priority: 30 (highest)
  • Preemptable: Yes

GPU AMD Shared

Cost-effective GPU inference with AMD ROCm. Same performance characteristics as the NVIDIA tier at a lower price point.

  • Accelerator: AMD ROCm
  • Priority: 30
  • Preemptable: Yes

GPU Intel Shared

GPU inference with Intel oneAPI and Arc/Max GPUs. Lowest per-token cost among GPU tiers.

  • Accelerator: Intel oneAPI GPU
  • Priority: 30
  • Preemptable: Yes

XIM

Bring your own hardware and infrastructure. No token metering or rate limiting. Flat monthly fee covers the Xerotier routing and management layer. Supports the largest models (up to 304 GB) with extended timeouts.

  • Accelerator: Your own hardware (CPU or GPU)
  • Priority: 0
  • Preemptable: No
  • Full data control

Tier Selection

The service tier is configured per endpoint in the Xerotier dashboard or API. When you create or update an endpoint, you select which tier it uses. All requests to that endpoint are routed to workers compatible with the configured tier.

The service_tier request parameter is accepted for OpenAI API compatibility but does not affect routing. The endpoint's configured tier always takes precedence. The actual tier used is returned in the service_tier field of every response (both streaming chunks and non-streaming completions).

The router selects backends within the configured tier using configurable routing strategies:

  • Least Loaded -- Prefer workers with the lowest queue depth.
  • Lowest Latency -- Prefer workers with the lowest predicted latency.
  • Model Affinity -- Prefer workers that already have the model loaded.
  • Round Robin -- Cycle through available workers evenly.
  • Composite -- Weighted combination of multiple strategies.

You can also pass optional X-SLO-TTFT-Ms and X-SLO-TPOT-Ms request headers to hint at latency targets. The router boosts preference for workers likely to meet these targets. See the API Reference for details.

Custom Models and Tier Restrictions

Custom models -- models that you upload and manage privately within your project -- are restricted to the XIM tier. To use a model on shared compute tiers (Free, CPU, or GPU), the model must be published to the public catalog and promoted to the shared catalog role.

This restriction ensures that shared infrastructure only runs models that have been explicitly published and vetted. Private models never run on shared agents.

Model Type Allowed Tiers
Custom (private, not in catalog) XIM only
Catalog model (deployable role) XIM only
Catalog model (shared role) All tiers

Catalog Roles and Tier Access

Models in the public catalog are assigned a catalog role that controls which tiers they can be deployed on:

  • Deployable: The default role for newly shared models. The model is visible in the catalog but can only be used with the XIM tier. This means other users can see the model but must deploy it on their own infrastructure.
  • Shared: The model is available on all tiers, including shared compute. This role is set automatically when the platform detects the model loaded on shared infrastructure, and reverts to deployable when no shared agents have it cached.

The catalog role is managed automatically by the platform. When an administrator provisions a catalog model onto shared infrastructure, the role is promoted to shared. When shared infrastructure no longer has the model cached, the role is demoted back to deployable.

For details on how model sharing works and how to publish models to the catalog, see the Model Sharing documentation.

Shared Tier Caveats

Important: Shared models are subject to change. Models available on shared agents may be evicted or replaced at any time. Endpoints using shared models on shared tiers may become unavailable if the backing model is removed from shared infrastructure. For guaranteed availability and permanence, deploy models on XIM nodes.

Key considerations when using shared tiers with catalog models:

  • Model availability is not guaranteed. A model that is available on shared tiers today may be demoted to deployable-only tomorrow if shared agents evict it due to capacity constraints.
  • Endpoints may become unavailable. If the backing model is removed from shared infrastructure, endpoints on shared tiers will return 503 Service Unavailable until the model is re-provisioned or the endpoint is moved to the XIM tier.
  • Use XIM for critical workloads. If your application requires guaranteed model availability, deploy on the XIM tier where you control the infrastructure and model lifecycle.
  • Monitor endpoint health. Use the dashboard or API to monitor endpoint provisioning status. If an endpoint shows as unprovisioned, the backing model may have been evicted from shared infrastructure.

Preemption

Some tiers allow requests to be preempted by higher-priority traffic. This means an active request may be interrupted if the system needs to serve a higher-priority request and no other workers are available.

Tier Preemptable Priority
FreeYes0
CPU AMD OptimizedNo20
CPU Intel OptimizedNo25
GPU NVIDIA SharedYes30
GPU AMD SharedYes30
GPU Intel SharedYes30
XIMNo0

CPU tiers and the XIM tier are never preempted. Free and GPU shared tiers are preemptable. Higher priority values indicate higher scheduling priority when the system must choose which requests to serve.

When no compatible workers are available, the API returns a 503 Service Unavailable response with a Retry-After header indicating when to retry.

Completion Storage

When store: true is set in a request, the completion is saved for later retrieval. Completions pass through two storage stages before expiration:

Tier Hot Storage (Redis) Cold Storage (Object Store) Total Retention
Free 1 hour 1 day ~25 hours
CPU AMD Optimized 5 hours 7 days ~7.2 days
CPU Intel Optimized 6 hours 7 days ~7.25 days
GPU NVIDIA / AMD / Intel 7 hours 7 days ~7.3 days
XIM 48 hours 14 days 16 days

Hot storage provides fast retrieval from Redis. After the hot tier duration, completions are archived to object storage (cold tier) for the remaining retention period. After total retention expires, completions are automatically deleted.

Free Account Limits

Free accounts have the following restrictions:

  • Maximum 2 endpoints
  • Maximum 2 models (up to 4 GB each)
  • 100,000 tokens per month
  • 60 requests per minute
  • 30-second request timeout
  • Requests are preemptable

Choosing a Tier

Use Case Recommended Tier
Testing and prototyping Free
Cost-sensitive batch processing CPU AMD or CPU Intel
Low-latency production workloads GPU NVIDIA Shared
Cost-optimized GPU inference GPU AMD or GPU Intel Shared
Large models (>32 GB) GPU tiers (up to 76 GB) or XIM (up to 304 GB)
Data sovereignty / compliance XIM
Long-running requests (>5 min) XIM (1800s timeout)

To change your endpoint's tier, go to the endpoint settings page in the Xerotier dashboard and select the new tier. The change takes effect immediately for new requests.

Code Examples

The service_tier is configured per endpoint, not per request. The following examples show how to make inference requests on an endpoint that has been assigned a specific tier. The actual tier used is returned in the service_tier response field.

Python (OpenAI SDK)

Python
from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_my-project_abc123" ) response = client.chat.completions.create( model="my-model", messages=[{"role": "user", "content": "Hello"}] ) print(response.choices[0].message.content) # The service_tier field shows which tier processed the request print(f"Service tier: {response.service_tier}")

Node.js (OpenAI SDK)

Node.js
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1', apiKey: 'xero_my-project_abc123' }); const response = await client.chat.completions.create({ model: 'my-model', messages: [{ role: 'user', content: 'Hello' }] }); console.log(response.choices[0].message.content); // The service_tier field shows which tier processed the request console.log(`Service tier: ${response.service_tier}`);

curl

curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_my-project_abc123" \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [{"role": "user", "content": "Hello"}] }' # The response includes "service_tier" indicating the endpoint's configured tier