// Features

Service Tiers

Seven tiers, chosen per endpoint at creation. Free for evals, pay-per-token on shared CPU and GPU pools, or flat-hourly on your own workers. The request-level service_tier hint only nudges priority and billing within the tier you already picked.

Tiers are divided into three categories:

  • Free, For evaluation and testing with shared CPU resources.
  • Compute (CPU and GPU Shared), Pay-per-token tiers with shared accelerator pools.
  • Self-Hosted, Bring your own infrastructure with flat hourly billing and no token metering.

User-selectable tier slugs (identifiers used in the API and configuration) are: free, cpu_amd_optimized, cpu_intel_optimized, gpu_nvidia_shared, gpu_amd_shared, gpu_intel_shared, self_hosted.

Dedicated GPU tiers are deprecated. The legacy gpu_nvidia_dedicated and gpu_amd_dedicated slugs are retained for backwards compatibility only. New endpoints created against these slugs are transparently remapped to their shared equivalents (gpu_nvidia_shared and gpu_amd_shared). Do not select a dedicated tier when creating new endpoints; pick the shared variant directly. Pinned, single-tenant GPU capacity is not currently offered as a self-service tier.

Tier Comparison

Pricing and Rate Limits

Tier (slug) Pricing Tokens/Min Requests/Min Max Model Size
Free (free) $0.00 10,000 64 48 GB
CPU AMD Optimized (cpu_amd_optimized) $0.50 / 1M tokens 100,000 128 Unlimited
CPU Intel Optimized (cpu_intel_optimized) $0.50 / 1M tokens 100,000 128 Unlimited
GPU NVIDIA Shared (gpu_nvidia_shared) $1.25 / 1M tokens 500,000 256 Unlimited
GPU AMD Shared (gpu_amd_shared) $1.00 / 1M tokens 500,000 256 Unlimited
GPU Intel Shared (gpu_intel_shared) $1.00 / 1M tokens 500,000 256 Unlimited
Self-Hosted (self_hosted) ~$20 / month (hourly) Unlimited Unlimited 512 GB

"Unlimited" in the rate limit columns means no per-minute token or request cap is enforced. Model size limit of "Unlimited" in the database means no size check is applied -- the limit is determined by available VRAM on the assigned workers.

Cached tokens are charged at 25% of the full per-token rate on all billable tiers. See Cached Token Pricing for details.

Timeouts and Concurrency

Tier Request Timeout Idle Timeout Max Concurrent Batch Size
Free 30s 120s 8 1
CPU AMD / Intel 300s 600s 24 8
GPU NVIDIA / AMD / Intel Shared 300s 600s 48 16
Self-Hosted 1800s 3600s 96 64

All tiers except Free support both streaming responses and request batching. The Free tier supports streaming but not batching.

Tier Details

Free (free)

Shared CPU pool, no card on file. Two endpoints, two models, 100,000 tokens per month. The right tier for evaluating the routing surface and proving an SDK call works.

Accelerator
Shared CPU pool
Routing priority
10 (lowest)
Monthly allowance
100,000 tokens
Endpoint cap
2
Model cap
2 (<= 48 GB each)

CPU AMD Optimized (cpu_amd_optimized)

ZenDNN on AMD EPYC. Pick this for batch workloads where accelerator cost matters more than tail latency.

Accelerator
AMD ZenDNN CPU
Routing priority
20
Distinguishing trait
Same rate caps as Intel CPU; pick by available capacity.

CPU Intel Optimized (cpu_intel_optimized)

Intel oneAPI on Xeon. Same caps, concurrency, and priority as the AMD CPU tier; choose based on which silicon your workload benchmarks against.

Accelerator
Intel oneAPI CPU
Routing priority
20

GPU NVIDIA Shared (gpu_nvidia_shared)

NVIDIA CUDA on the shared GPU pool. The widest model compatibility of the shared GPU tiers because most published kernels target CUDA first.

Accelerator
NVIDIA CUDA
Routing priority
25
Per-token rate
$1.25 / 1M (highest shared-GPU rate; widest kernel coverage)

GPU AMD Shared (gpu_amd_shared)

AMD ROCm on the shared GPU pool. Same rate caps and scheduling priority as the NVIDIA shared tier at $1.00 / 1M tokens.

Accelerator
AMD ROCm
Routing priority
25
Per-token rate
$1.00 / 1M

GPU Intel Shared (gpu_intel_shared)

Intel oneAPI on Arc and Max GPUs. Tied with AMD ROCm for the lowest per-token rate among shared GPU tiers.

Accelerator
Intel oneAPI GPU
Routing priority
25
Per-token rate
$1.00 / 1M

Self-Hosted (self_hosted)

Your workers, your infrastructure. No token metering, no request-rate cap, 1800s request timeout. Billed as a flat hourly rate for the routing and management layer (approximately $20 per worker-month at 24x7 enrollment). Workers are pinned to the project; model size is capped at a declared 512 GB.

Accelerator
Your own workers, any vendor or form factor
Routing priority
30 (highest)
Max declared model size
512 GB
Request timeout
1800s
Billing
Flat hourly; no token meter

Note on accelerator matching. The Self-Hosted tier has no required GPU vendor set in its tier definition. As a result, the tier's compatibility set advertised to the router is [.cpu]. Routing to a self-hosted GPU worker still works because the router additionally filters candidate workers by the worker's own declared accelerator and by the model's VRAM footprint, and self-hosted endpoints are bound to specific worker agents. Operators sizing self-hosted clusters should plan capacity based on the workers they enrol, not on the tier's accelerator list.

Tier Selection

The service tier is configured per endpoint in the Xerotier dashboard or API. When you create or update an endpoint, you select which tier it uses. All requests to that endpoint are routed to workers compatible with the configured tier.

The endpoint's configured tier always determines which worker pool is eligible to serve a request; it cannot be overridden per request. The OpenAI-compatible service_tier request parameter does, however, influence routing within the configured tier and billing:

service_tier value Routing score adjustment Token billing
flex -15 (de-prioritized against other in-flight work) Standard per-token rate
default (or omitted) No adjustment Standard per-token rate
priority +15 (preferred against other in-flight work) 1.25x the standard per-token rate

The actual tier used is returned in the service_tier field of every response (both streaming chunks and non-streaming completions). When a request uses priority, both prompt and completion tokens are billed at the 1.25x multiplier on top of the endpoint tier's per-token rate; cached tokens are still discounted by the tier's cached token multiplier before the priority multiplier is applied.

The router selects backends within the configured tier using configurable routing strategies:

  • Least Loaded, Prefer workers with the lowest queue depth.
  • Lowest Latency, Prefer workers with the lowest predicted latency.
  • Model Affinity, Prefer workers that already have the model loaded.
  • Round Robin, Cycle through available workers evenly.
  • Composite, Weighted combination of multiple strategies.

You can also pass optional X-SLO-TTFT-Ms and X-SLO-TPOT-Ms request headers to hint at latency targets. The router boosts preference for workers likely to meet these targets. See the SLO Tracking documentation for details.

Cached Token Pricing

When prefix caching is active and input tokens are served from the KV cache (reported as prompt_tokens_details.cached_tokens in the usage object), those tokens are billed at a reduced rate. The cached_token_cost_multiplier for each tier determines the fraction of the full per-token price charged for cached tokens.

Tier Full Rate Cached Token Multiplier Effective Cached Rate
Free $0.00 / 1M 0.0 (not metered) $0.00 / 1M
CPU AMD / Intel Optimized $0.50 / 1M 0.25 (75% discount) $0.125 / 1M
GPU NVIDIA Shared $1.25 / 1M 0.25 (75% discount) $0.3125 / 1M
GPU AMD / Intel Shared $1.00 / 1M 0.25 (75% discount) $0.25 / 1M
Self-Hosted Flat hourly rate 0.0 (not metered) N/A

Maximizing your prefix cache hit rate directly reduces your token costs. See Prefix Caching for prompt structuring recommendations.

Custom Models and Tier Restrictions

Custom models, models that you upload and manage privately within your project -- are restricted to the Self-Hosted tier. To use a model on shared compute tiers (Free, CPU, or GPU Shared), the model must be published to the public catalog and promoted to the shared catalog role.

This restriction ensures that shared infrastructure only runs models that have been explicitly published and vetted. Private models never run on shared agents.

Model Type Allowed Tiers
Custom (private, not in catalog) Self-Hosted only
Catalog model (deployable role) Self-Hosted only
Catalog model (shared role) All tiers

Catalog Roles and Tier Access

Models in the public catalog are assigned a catalog role that controls which tiers they can be deployed on:

  • Deployable: The default role for newly shared models. The model is visible in the catalog but can only be used with the Self-Hosted tier. Other users can see the model in the catalog but must deploy it on their own infrastructure.
  • Shared: The model is available on all tiers, including shared compute.

The catalog role is managed automatically by the platform based on the state of the shared-agent model cache. There is no manual promotion endpoint and no operator workflow to flip the role directly:

  • When at least one shared agent has the model cached and ready to serve, the platform promotes the catalog role to shared.
  • When no shared agent has the model cached, the platform demotes the catalog role back to deployable.

Because promotion follows shared-agent cache state, the set of catalog models available on shared tiers can change over time as agents load, evict, or rotate models.

For details on how model sharing works and how to publish models to the catalog, see the Model Sharing documentation.

Shared Tier Caveats

Key considerations when using shared tiers with catalog models:

  • Model availability is not guaranteed. A model that is available on shared tiers today may be demoted to deployable-only tomorrow if shared agents evict it due to capacity constraints.
  • Endpoints may become unavailable. If the backing model is removed from shared infrastructure, endpoints on shared tiers will return 503 Service Unavailable until the model is re-provisioned or the endpoint is moved to the Self-Hosted tier.
  • Use Self-Hosted for critical workloads. If your application requires guaranteed model availability, deploy on the Self-Hosted tier, where you control the infrastructure and model lifecycle.
  • Monitor endpoint health. Use the dashboard or API to monitor endpoint provisioning status. If an endpoint shows as unprovisioned, the backing model may have been evicted from shared infrastructure.

Completion Storage

When store: true is set in a request, the completion is saved for later retrieval. Completions pass through two storage stages before expiration:

Tier Hot Storage (Redis) Cold Storage (Object Store) Total Retention
Free 1 hour 1 day 25 hours
CPU AMD / Intel Optimized 5 hours 7 days 7 days, 5 hours
GPU NVIDIA / AMD / Intel Shared 7 hours 7 days 7 days, 7 hours
Self-Hosted 48 hours 14 days 16 days

Hot storage provides fast retrieval from Redis. After the hot tier duration, completions are archived to object storage (cold tier) for the remaining retention period. After total retention expires, completions are automatically deleted.

Free Account Limits

Account-wide caps on the Free tier, on top of the per-minute rate limits in the comparison table:

  • 2 endpoints per account.
  • 2 models per account, 48 GB each.
  • 100,000 tokens per month across both endpoints.
  • 30s request timeout; batching is disabled.

Choosing a Tier

Use Case Recommended Tier
Testing and prototyping Free
Batch jobs where cost matters more than tail latency CPU AMD or CPU Intel Optimized
Production traffic on a CUDA-targeting model GPU NVIDIA Shared
GPU traffic where $1.00 / 1M tokens beats $1.25 GPU AMD or GPU Intel Shared
Models larger than 48 GB Any paid Compute tier, or Self-Hosted up to 512 GB
Data sovereignty / compliance Self-Hosted
Requests longer than 5 minutes Self-Hosted (1800s timeout)

Change a tier from the endpoint settings page; new requests pick up the change on the next dispatch.

Code Examples

The endpoint's tier is fixed at creation; these requests do not select one. Pass an optional service_tier body field (flex, default, or priority) to nudge routing priority and billing inside the configured tier (see Tier Selection). The response echoes the actual tier used.

Python (OpenAI SDK)

from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_my-project_abc123" ) response = client.chat.completions.create( model="my-model", messages=[{"role": "user", "content": "Hello"}] ) print(response.choices[0].message.content) # The service_tier field shows which tier processed the request print(f"Service tier: {response.service_tier}")

Node.js (OpenAI SDK)

import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1', apiKey: 'xero_my-project_abc123' }); const response = await client.chat.completions.create({ model: 'my-model', messages: [{ role: 'user', content: 'Hello' }] }); console.log(response.choices[0].message.content); // The service_tier field shows which tier processed the request console.log(`Service tier: ${response.service_tier}`);

curl

curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_my-project_abc123" \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [{"role": "user", "content": "Hello"}] }' # The response includes "service_tier" indicating the endpoint's configured tier