Service Tiers - Xerotier

Tiers are divided into three categories:

Free, For evaluation and testing with shared CPU resources.
Compute (CPU and GPU Shared), Pay-per-token tiers with shared accelerator pools.
Self-Hosted, Bring your own infrastructure with flat hourly billing and no token metering.

User-selectable tier slugs (identifiers used in the API and configuration) are: free, cpu_amd_optimized, cpu_intel_optimized, gpu_nvidia_shared, gpu_amd_shared, gpu_intel_shared, self_hosted.

Dedicated GPU tiers are deprecated. The legacy gpu_nvidia_dedicated and gpu_amd_dedicated slugs are retained for backwards compatibility only. New endpoints created against these slugs are transparently remapped to their shared equivalents (gpu_nvidia_shared and gpu_amd_shared). Do not select a dedicated tier when creating new endpoints; pick the shared variant directly. Pinned, single-tenant GPU capacity is not currently offered as a self-service tier.

Tier Comparison

Pricing and Rate Limits

Tier (slug)	Pricing	Tokens/Min	Requests/Min	Max Model Size
Free (`free`)	$0.00	10,000	64	48 GB
CPU AMD Optimized (`cpu_amd_optimized`)	$0.50 / 1M tokens	100,000	128	Unlimited
CPU Intel Optimized (`cpu_intel_optimized`)	$0.50 / 1M tokens	100,000	128	Unlimited
GPU NVIDIA Shared (`gpu_nvidia_shared`)	$1.25 / 1M tokens	500,000	256	Unlimited
GPU AMD Shared (`gpu_amd_shared`)	$1.00 / 1M tokens	500,000	256	Unlimited
GPU Intel Shared (`gpu_intel_shared`)	$1.00 / 1M tokens	500,000	256	Unlimited
Self-Hosted (`self_hosted`)	~$20 / month (hourly)	Unlimited	Unlimited	512 GB

"Unlimited" in the rate limit columns means no per-minute token or request cap is enforced. Model size limit of "Unlimited" in the database means no size check is applied -- the limit is determined by available VRAM on the assigned workers.

Cached tokens are charged at 25% of the full per-token rate on all billable tiers. See Cached Token Pricing for details.

Timeouts and Concurrency

Tier	Request Timeout	Idle Timeout	Max Concurrent	Batch Size
Free	30s	120s	8	1
CPU AMD / Intel	300s	600s	24	8
GPU NVIDIA / AMD / Intel Shared	300s	600s	48	16
Self-Hosted	1800s	3600s	96	64

All tiers except Free support both streaming responses and request batching. The Free tier supports streaming but not batching.

Tier Details

Free (`free`)

Shared CPU pool, no card on file. Two endpoints, two models, 100,000 tokens per month. The right tier for evaluating the routing surface and proving an SDK call works.

Accelerator: Shared CPU pool
Routing priority: 10 (lowest)
Monthly allowance: 100,000 tokens
Endpoint cap: 2
Model cap: 2 (<= 48 GB each)

CPU AMD Optimized (`cpu_amd_optimized`)

ZenDNN on AMD EPYC. Pick this for batch workloads where accelerator cost matters more than tail latency.

Accelerator: AMD ZenDNN CPU
Routing priority: 20
Distinguishing trait: Same rate caps as Intel CPU; pick by available capacity.

CPU Intel Optimized (`cpu_intel_optimized`)

Intel oneAPI on Xeon. Same caps, concurrency, and priority as the AMD CPU tier; choose based on which silicon your workload benchmarks against.

Accelerator: Intel oneAPI CPU
Routing priority: 20

GPU NVIDIA Shared (`gpu_nvidia_shared`)

NVIDIA CUDA on the shared GPU pool. The widest model compatibility of the shared GPU tiers because most published kernels target CUDA first.

Accelerator: NVIDIA CUDA
Routing priority: 25
Per-token rate: $1.25 / 1M (highest shared-GPU rate; widest kernel coverage)

GPU AMD Shared (`gpu_amd_shared`)

AMD ROCm on the shared GPU pool. Same rate caps and scheduling priority as the NVIDIA shared tier at $1.00 / 1M tokens.

Accelerator: AMD ROCm
Routing priority: 25
Per-token rate: $1.00 / 1M

GPU Intel Shared (`gpu_intel_shared`)

Intel oneAPI on Arc and Max GPUs. Tied with AMD ROCm for the lowest per-token rate among shared GPU tiers.

Accelerator: Intel oneAPI GPU
Routing priority: 25
Per-token rate: $1.00 / 1M

Self-Hosted (`self_hosted`)

Your workers, your infrastructure. No token metering, no request-rate cap, 1800s request timeout. Billed as a flat hourly rate for the routing and management layer (approximately $20 per worker-month at 24x7 enrollment). Workers are pinned to the project; model size is capped at a declared 512 GB.

Accelerator: Your own workers, any vendor or form factor
Routing priority: 30 (highest)
Max declared model size: 512 GB
Request timeout: 1800s
Billing: Flat hourly; no token meter

Note on accelerator matching. The Self-Hosted tier has no required GPU vendor set in its tier definition. As a result, the tier's compatibility set advertised to the router is [.cpu]. Routing to a self-hosted GPU worker still works because the router additionally filters candidate workers by the worker's own declared accelerator and by the model's VRAM footprint, and self-hosted endpoints are bound to specific worker agents. Operators sizing self-hosted clusters should plan capacity based on the workers they enrol, not on the tier's accelerator list.

Tier Selection

The service tier is configured per endpoint in the Xerotier dashboard or API. When you create or update an endpoint, you select which tier it uses. All requests to that endpoint are routed to workers compatible with the configured tier.

The endpoint's configured tier always determines which worker pool is eligible to serve a request; it cannot be overridden per request. The OpenAI-compatible service_tier request parameter does, however, influence routing within the configured tier and billing:

`service_tier` value	Routing score adjustment	Token billing
`flex`	-15 (de-prioritized against other in-flight work)	Standard per-token rate
`default` (or omitted)	No adjustment	Standard per-token rate
`priority`	+15 (preferred against other in-flight work)	1.25x the standard per-token rate

The actual tier used is returned in the service_tier field of every response (both streaming chunks and non-streaming completions). When a request uses priority, both prompt and completion tokens are billed at the 1.25x multiplier on top of the endpoint tier's per-token rate; cached tokens are still discounted by the tier's cached token multiplier before the priority multiplier is applied.

The router selects backends within the configured tier using configurable routing strategies:

Least Loaded, Prefer workers with the lowest queue depth.
Lowest Latency, Prefer workers with the lowest predicted latency.
Model Affinity, Prefer workers that already have the model loaded.
Round Robin, Cycle through available workers evenly.
Composite, Weighted combination of multiple strategies.

You can also pass optional X-SLO-TTFT-Ms and X-SLO-TPOT-Ms request headers to hint at latency targets. The router boosts preference for workers likely to meet these targets. See the SLO Tracking documentation for details.

Cached Token Pricing

When prefix caching is active and input tokens are served from the KV cache (reported as prompt_tokens_details.cached_tokens in the usage object), those tokens are billed at a reduced rate. The cached_token_cost_multiplier for each tier determines the fraction of the full per-token price charged for cached tokens.

Tier	Full Rate	Cached Token Multiplier	Effective Cached Rate
Free	$0.00 / 1M	0.0 (not metered)	$0.00 / 1M
CPU AMD / Intel Optimized	$0.50 / 1M	0.25 (75% discount)	$0.125 / 1M
GPU NVIDIA Shared	$1.25 / 1M	0.25 (75% discount)	$0.3125 / 1M
GPU AMD / Intel Shared	$1.00 / 1M	0.25 (75% discount)	$0.25 / 1M
Self-Hosted	Flat hourly rate	0.0 (not metered)	N/A

Maximizing your prefix cache hit rate directly reduces your token costs. See Prefix Caching for prompt structuring recommendations.

Custom Models and Tier Restrictions

Custom models, models that you upload and manage privately within your project -- are restricted to the Self-Hosted tier. To use a model on shared compute tiers (Free, CPU, or GPU Shared), the model must be published to the public catalog and promoted to the shared catalog role.

This restriction ensures that shared infrastructure only runs models that have been explicitly published and vetted. Private models never run on shared agents.

Model Type	Allowed Tiers
Custom (private, not in catalog)	Self-Hosted only
Catalog model (deployable role)	Self-Hosted only
Catalog model (shared role)	All tiers

Catalog Roles and Tier Access

Models in the public catalog are assigned a catalog role that controls which tiers they can be deployed on:

Deployable: The default role for newly shared models. The model is visible in the catalog but can only be used with the Self-Hosted tier. Other users can see the model in the catalog but must deploy it on their own infrastructure.
Shared: The model is available on all tiers, including shared compute.

The catalog role is managed automatically by the platform based on the state of the shared-agent model cache. There is no manual promotion endpoint and no operator workflow to flip the role directly:

When at least one shared agent has the model cached and ready to serve, the platform promotes the catalog role to shared.
When no shared agent has the model cached, the platform demotes the catalog role back to deployable.

Because promotion follows shared-agent cache state, the set of catalog models available on shared tiers can change over time as agents load, evict, or rotate models.

For details on how model sharing works and how to publish models to the catalog, see the Model Sharing documentation.

Shared Tier Caveats

Important: Shared models are subject to change. Models available on shared agents may be evicted or replaced at any time. Endpoints using shared models on shared tiers may become unavailable if the backing model is removed from shared infrastructure. For guaranteed availability and permanence, deploy the model on Self-Hosted infrastructure that you control.

Key considerations when using shared tiers with catalog models:

Model availability is not guaranteed. A model that is available on shared tiers today may be demoted to deployable-only tomorrow if shared agents evict it due to capacity constraints.
Endpoints may become unavailable. If the backing model is removed from shared infrastructure, endpoints on shared tiers will return 503 Service Unavailable until the model is re-provisioned or the endpoint is moved to the Self-Hosted tier.
Use Self-Hosted for critical workloads. If your application requires guaranteed model availability, deploy on the Self-Hosted tier, where you control the infrastructure and model lifecycle.
Monitor endpoint health. Use the dashboard or API to monitor endpoint provisioning status. If an endpoint shows as unprovisioned, the backing model may have been evicted from shared infrastructure.

Completion Storage

When store: true is set in a request, the completion is saved for later retrieval. Completions pass through two storage stages before expiration:

Tier	Hot Storage (Redis)	Cold Storage (Object Store)	Total Retention
Free	1 hour	1 day	25 hours
CPU AMD / Intel Optimized	5 hours	7 days	7 days, 5 hours
GPU NVIDIA / AMD / Intel Shared	7 hours	7 days	7 days, 7 hours
Self-Hosted	48 hours	14 days	16 days

Hot storage provides fast retrieval from Redis. After the hot tier duration, completions are archived to object storage (cold tier) for the remaining retention period. After total retention expires, completions are automatically deleted.

Free Account Limits

Account-wide caps on the Free tier, on top of the per-minute rate limits in the comparison table:

2 endpoints per account.
2 models per account, 48 GB each.
100,000 tokens per month across both endpoints.
30s request timeout; batching is disabled.

Choosing a Tier

Use Case	Recommended Tier
Testing and prototyping	Free
Batch jobs where cost matters more than tail latency	CPU AMD or CPU Intel Optimized
Production traffic on a CUDA-targeting model	GPU NVIDIA Shared
GPU traffic where $1.00 / 1M tokens beats $1.25	GPU AMD or GPU Intel Shared
Models larger than 48 GB	Any paid Compute tier, or Self-Hosted up to 512 GB
Data sovereignty / compliance	Self-Hosted
Requests longer than 5 minutes	Self-Hosted (1800s timeout)

Change a tier from the endpoint settings page; new requests pick up the change on the next dispatch.

Code Examples

The endpoint's tier is fixed at creation; these requests do not select one. Pass an optional service_tier body field (flex, default, or priority) to nudge routing priority and billing inside the configured tier (see Tier Selection). The response echoes the actual tier used.

Python (OpenAI SDK)

                    from openai import OpenAI

client = OpenAI(
    base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    api_key="xero_my-project_abc123"
)

response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

# The service_tier field shows which tier processed the request
print(f"Service tier: {response.service_tier}")
                

Node.js (OpenAI SDK)

                    import OpenAI from 'openai';

const client = new OpenAI({
    baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
    apiKey: 'xero_my-project_abc123'
});

const response = await client.chat.completions.create({
    model: 'my-model',
    messages: [{ role: 'user', content: 'Hello' }]
});

console.log(response.choices[0].message.content);

// The service_tier field shows which tier processed the request
console.log(`Service tier: ${response.service_tier}`);
                

curl

                    curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
  -H "Authorization: Bearer xero_my-project_abc123" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# The response includes "service_tier" indicating the endpoint's configured tier
                

Tier Comparison

Pricing and Rate Limits

Timeouts and Concurrency

Tier Details

Free (free)

CPU AMD Optimized (cpu_amd_optimized)

CPU Intel Optimized (cpu_intel_optimized)

GPU NVIDIA Shared (gpu_nvidia_shared)

GPU AMD Shared (gpu_amd_shared)

GPU Intel Shared (gpu_intel_shared)

Self-Hosted (self_hosted)

Tier Selection

Cached Token Pricing

Custom Models and Tier Restrictions

Catalog Roles and Tier Access

Shared Tier Caveats

Completion Storage

Free Account Limits

Choosing a Tier

Code Examples

Python (OpenAI SDK)

Node.js (OpenAI SDK)

curl

Free (`free`)

CPU AMD Optimized (`cpu_amd_optimized`)

CPU Intel Optimized (`cpu_intel_optimized`)

GPU NVIDIA Shared (`gpu_nvidia_shared`)

GPU AMD Shared (`gpu_amd_shared`)

GPU Intel Shared (`gpu_intel_shared`)

Self-Hosted (`self_hosted`)