Service Tiers
Seven tiers, chosen per endpoint at creation. Free for evals, pay-per-token on shared CPU and GPU pools, or flat-hourly on your own workers. The request-level service_tier hint only nudges priority and billing within the tier you already picked.
Tiers are divided into three categories:
- Free, For evaluation and testing with shared CPU resources.
- Compute (CPU and GPU Shared), Pay-per-token tiers with shared accelerator pools.
- Self-Hosted, Bring your own infrastructure with flat hourly billing and no token metering.
User-selectable tier slugs (identifiers used in the API and configuration) are:
free, cpu_amd_optimized, cpu_intel_optimized,
gpu_nvidia_shared, gpu_amd_shared, gpu_intel_shared,
self_hosted.
Dedicated GPU tiers are deprecated. The legacy
gpu_nvidia_dedicated and gpu_amd_dedicated slugs are
retained for backwards compatibility only. New endpoints created against these
slugs are transparently remapped to their shared equivalents
(gpu_nvidia_shared and gpu_amd_shared). Do not select
a dedicated tier when creating new endpoints; pick the shared variant
directly. Pinned, single-tenant GPU capacity is not currently offered as a
self-service tier.
Tier Comparison
Pricing and Rate Limits
| Tier (slug) | Pricing | Tokens/Min | Requests/Min | Max Model Size |
|---|---|---|---|---|
Free (free) |
$0.00 | 10,000 | 64 | 48 GB |
CPU AMD Optimized (cpu_amd_optimized) |
$0.50 / 1M tokens | 100,000 | 128 | Unlimited |
CPU Intel Optimized (cpu_intel_optimized) |
$0.50 / 1M tokens | 100,000 | 128 | Unlimited |
GPU NVIDIA Shared (gpu_nvidia_shared) |
$1.25 / 1M tokens | 500,000 | 256 | Unlimited |
GPU AMD Shared (gpu_amd_shared) |
$1.00 / 1M tokens | 500,000 | 256 | Unlimited |
GPU Intel Shared (gpu_intel_shared) |
$1.00 / 1M tokens | 500,000 | 256 | Unlimited |
Self-Hosted (self_hosted) |
~$20 / month (hourly) | Unlimited | Unlimited | 512 GB |
"Unlimited" in the rate limit columns means no per-minute token or request cap is enforced. Model size limit of "Unlimited" in the database means no size check is applied -- the limit is determined by available VRAM on the assigned workers.
Cached tokens are charged at 25% of the full per-token rate on all billable tiers. See Cached Token Pricing for details.
Timeouts and Concurrency
| Tier | Request Timeout | Idle Timeout | Max Concurrent | Batch Size |
|---|---|---|---|---|
| Free | 30s | 120s | 8 | 1 |
| CPU AMD / Intel | 300s | 600s | 24 | 8 |
| GPU NVIDIA / AMD / Intel Shared | 300s | 600s | 48 | 16 |
| Self-Hosted | 1800s | 3600s | 96 | 64 |
All tiers except Free support both streaming responses and request batching. The Free tier supports streaming but not batching.
Tier Details
Free (free)
Shared CPU pool, no card on file. Two endpoints, two models, 100,000 tokens per month. The right tier for evaluating the routing surface and proving an SDK call works.
- Accelerator
- Shared CPU pool
- Routing priority
- 10 (lowest)
- Monthly allowance
- 100,000 tokens
- Endpoint cap
- 2
- Model cap
- 2 (<= 48 GB each)
CPU AMD Optimized (cpu_amd_optimized)
ZenDNN on AMD EPYC. Pick this for batch workloads where accelerator cost matters more than tail latency.
- Accelerator
- AMD ZenDNN CPU
- Routing priority
- 20
- Distinguishing trait
- Same rate caps as Intel CPU; pick by available capacity.
CPU Intel Optimized (cpu_intel_optimized)
Intel oneAPI on Xeon. Same caps, concurrency, and priority as the AMD CPU tier; choose based on which silicon your workload benchmarks against.
- Accelerator
- Intel oneAPI CPU
- Routing priority
- 20
GPU NVIDIA Shared (gpu_nvidia_shared)
NVIDIA CUDA on the shared GPU pool. The widest model compatibility of the shared GPU tiers because most published kernels target CUDA first.
- Accelerator
- NVIDIA CUDA
- Routing priority
- 25
- Per-token rate
- $1.25 / 1M (highest shared-GPU rate; widest kernel coverage)
GPU AMD Shared (gpu_amd_shared)
AMD ROCm on the shared GPU pool. Same rate caps and scheduling priority as the NVIDIA shared tier at $1.00 / 1M tokens.
- Accelerator
- AMD ROCm
- Routing priority
- 25
- Per-token rate
- $1.00 / 1M
GPU Intel Shared (gpu_intel_shared)
Intel oneAPI on Arc and Max GPUs. Tied with AMD ROCm for the lowest per-token rate among shared GPU tiers.
- Accelerator
- Intel oneAPI GPU
- Routing priority
- 25
- Per-token rate
- $1.00 / 1M
Self-Hosted (self_hosted)
Your workers, your infrastructure. No token metering, no request-rate cap, 1800s request timeout. Billed as a flat hourly rate for the routing and management layer (approximately $20 per worker-month at 24x7 enrollment). Workers are pinned to the project; model size is capped at a declared 512 GB.
- Accelerator
- Your own workers, any vendor or form factor
- Routing priority
- 30 (highest)
- Max declared model size
- 512 GB
- Request timeout
- 1800s
- Billing
- Flat hourly; no token meter
Note on accelerator matching. The Self-Hosted tier has no
required GPU vendor set in its tier definition. As a result, the tier's
compatibility set advertised to the router is [.cpu]. Routing to
a self-hosted GPU worker still works because the router additionally filters
candidate workers by the worker's own declared accelerator and by the model's
VRAM footprint, and self-hosted endpoints are bound to specific worker
agents. Operators sizing self-hosted clusters should plan capacity based on
the workers they enrol, not on the tier's accelerator list.
Tier Selection
The service tier is configured per endpoint in the Xerotier dashboard or API. When you create or update an endpoint, you select which tier it uses. All requests to that endpoint are routed to workers compatible with the configured tier.
The endpoint's configured tier always determines which worker pool is eligible
to serve a request; it cannot be overridden per request. The OpenAI-compatible
service_tier request parameter does, however, influence routing
within the configured tier and billing:
service_tier value |
Routing score adjustment | Token billing |
|---|---|---|
flex |
-15 (de-prioritized against other in-flight work) | Standard per-token rate |
default (or omitted) |
No adjustment | Standard per-token rate |
priority |
+15 (preferred against other in-flight work) | 1.25x the standard per-token rate |
The actual tier used is returned in the service_tier field of every
response (both streaming chunks and non-streaming completions). When a request
uses priority, both prompt and completion tokens are billed at the
1.25x multiplier on top of the endpoint tier's per-token rate; cached tokens are
still discounted by the tier's cached token multiplier before the priority
multiplier is applied.
The router selects backends within the configured tier using configurable routing strategies:
- Least Loaded, Prefer workers with the lowest queue depth.
- Lowest Latency, Prefer workers with the lowest predicted latency.
- Model Affinity, Prefer workers that already have the model loaded.
- Round Robin, Cycle through available workers evenly.
- Composite, Weighted combination of multiple strategies.
You can also pass optional X-SLO-TTFT-Ms and X-SLO-TPOT-Ms
request headers to hint at latency targets. The router boosts preference for workers
likely to meet these targets. See the
SLO Tracking documentation for details.
Cached Token Pricing
When prefix caching is active and input tokens are served from the KV cache
(reported as prompt_tokens_details.cached_tokens in the usage
object), those tokens are billed at a reduced rate. The
cached_token_cost_multiplier for each tier determines the
fraction of the full per-token price charged for cached tokens.
| Tier | Full Rate | Cached Token Multiplier | Effective Cached Rate |
|---|---|---|---|
| Free | $0.00 / 1M | 0.0 (not metered) | $0.00 / 1M |
| CPU AMD / Intel Optimized | $0.50 / 1M | 0.25 (75% discount) | $0.125 / 1M |
| GPU NVIDIA Shared | $1.25 / 1M | 0.25 (75% discount) | $0.3125 / 1M |
| GPU AMD / Intel Shared | $1.00 / 1M | 0.25 (75% discount) | $0.25 / 1M |
| Self-Hosted | Flat hourly rate | 0.0 (not metered) | N/A |
Maximizing your prefix cache hit rate directly reduces your token costs. See Prefix Caching for prompt structuring recommendations.
Custom Models and Tier Restrictions
Custom models, models that you upload and manage privately within your project --
are restricted to the Self-Hosted tier. To use a model on shared
compute tiers (Free, CPU, or GPU Shared), the model must be published to the public catalog
and promoted to the shared catalog role.
This restriction ensures that shared infrastructure only runs models that have been explicitly published and vetted. Private models never run on shared agents.
| Model Type | Allowed Tiers |
|---|---|
| Custom (private, not in catalog) | Self-Hosted only |
| Catalog model (deployable role) | Self-Hosted only |
| Catalog model (shared role) | All tiers |
Catalog Roles and Tier Access
Models in the public catalog are assigned a catalog role that controls which tiers they can be deployed on:
- Deployable: The default role for newly shared models. The model is visible in the catalog but can only be used with the Self-Hosted tier. Other users can see the model in the catalog but must deploy it on their own infrastructure.
- Shared: The model is available on all tiers, including shared compute.
The catalog role is managed automatically by the platform based on the state of the shared-agent model cache. There is no manual promotion endpoint and no operator workflow to flip the role directly:
-
When at least one shared agent has the model cached and ready to serve, the
platform promotes the catalog role to
shared. -
When no shared agent has the model cached, the platform demotes the catalog
role back to
deployable.
Because promotion follows shared-agent cache state, the set of catalog models available on shared tiers can change over time as agents load, evict, or rotate models.
For details on how model sharing works and how to publish models to the catalog, see the Model Sharing documentation.
Completion Storage
When store: true is set in a request, the completion is saved for
later retrieval. Completions pass through two storage stages before expiration:
| Tier | Hot Storage (Redis) | Cold Storage (Object Store) | Total Retention |
|---|---|---|---|
| Free | 1 hour | 1 day | 25 hours |
| CPU AMD / Intel Optimized | 5 hours | 7 days | 7 days, 5 hours |
| GPU NVIDIA / AMD / Intel Shared | 7 hours | 7 days | 7 days, 7 hours |
| Self-Hosted | 48 hours | 14 days | 16 days |
Hot storage provides fast retrieval from Redis. After the hot tier duration, completions are archived to object storage (cold tier) for the remaining retention period. After total retention expires, completions are automatically deleted.
Free Account Limits
Account-wide caps on the Free tier, on top of the per-minute rate limits in the comparison table:
- 2 endpoints per account.
- 2 models per account, 48 GB each.
- 100,000 tokens per month across both endpoints.
- 30s request timeout; batching is disabled.
Choosing a Tier
| Use Case | Recommended Tier |
|---|---|
| Testing and prototyping | Free |
| Batch jobs where cost matters more than tail latency | CPU AMD or CPU Intel Optimized |
| Production traffic on a CUDA-targeting model | GPU NVIDIA Shared |
| GPU traffic where $1.00 / 1M tokens beats $1.25 | GPU AMD or GPU Intel Shared |
| Models larger than 48 GB | Any paid Compute tier, or Self-Hosted up to 512 GB |
| Data sovereignty / compliance | Self-Hosted |
| Requests longer than 5 minutes | Self-Hosted (1800s timeout) |
Change a tier from the endpoint settings page; new requests pick up the change on the next dispatch.
Code Examples
The endpoint's tier is fixed at creation; these requests do
not select one. Pass an optional service_tier
body field (flex, default, or
priority) to nudge routing priority and billing
inside the configured tier
(see Tier Selection). The
response echoes the actual tier used.
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_my-project_abc123"
)
response = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
# The service_tier field shows which tier processed the request
print(f"Service tier: {response.service_tier}")
Node.js (OpenAI SDK)
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
apiKey: 'xero_my-project_abc123'
});
const response = await client.chat.completions.create({
model: 'my-model',
messages: [{ role: 'user', content: 'Hello' }]
});
console.log(response.choices[0].message.content);
// The service_tier field shows which tier processed the request
console.log(`Service tier: ${response.service_tier}`);
curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_my-project_abc123" \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [{"role": "user", "content": "Hello"}]
}'
# The response includes "service_tier" indicating the endpoint's configured tier