Service Tiers
Xerotier offers 7 service tiers with different pricing, performance, and availability characteristics. The tier is configured per endpoint, not per request.
Overview
Each endpoint in Xerotier is assigned a service tier that determines its pricing, rate limits, timeouts, maximum model size, and which accelerator hardware can serve requests. The tier is set when creating or updating an endpoint in the dashboard -- it cannot be overridden per request.
Tiers are divided into three categories:
- Free -- For evaluation and testing with shared CPU resources.
- Compute (CPU and GPU) -- Pay-per-token tiers with dedicated accelerator types.
- XIM -- Bring your own infrastructure with no token metering.
Tier Comparison
Pricing and Rate Limits
| Tier | Pricing | Tokens/Min | Requests/Min | Max Model Size |
|---|---|---|---|---|
| Free | $0.00 | 10,000 | 60 | 4 GB |
| CPU AMD Optimized | $0.225 / 1M tokens | 100,000 | 120 | 32 GB |
| CPU Intel Optimized | $0.225 / 1M tokens | 100,000 | 120 | 32 GB |
| GPU NVIDIA Shared | $0.75 / 1M tokens | 250,000 | 240 | 76 GB |
| GPU AMD Shared | $0.65 / 1M tokens | 250,000 | 240 | 76 GB |
| GPU Intel Shared | $0.60 / 1M tokens | 250,000 | 240 | 76 GB |
| XIM | ~$10 / month | Unlimited | Unlimited | 304 GB |
Request rate limits include burst capacity of 50% above the base limit (minimum 3 for free, minimum 10 for paid tiers), allowing short-term traffic spikes.
Timeouts and Concurrency
| Tier | Request Timeout | Idle Stream Timeout | Max Concurrent | Batch Size |
|---|---|---|---|---|
| Free | 30s | 120s | 5 | 1 |
| CPU AMD / Intel | 300s | 600s | 20 | 8 |
| GPU NVIDIA / AMD / Intel | 300s | 600s | 50 | 16 |
| XIM | 1800s | 3600s | 100 | 64 |
Tier Details
Free
Perfect for testing and evaluation. Uses shared CPU infrastructure with preemptable scheduling. Free accounts are limited to 2 endpoints and 2 models. Requests can be preempted by higher-priority traffic.
- Accelerator: Shared CPU
- Priority: 0 (lowest)
- Preemptable: Yes
- Monthly token allowance: 100,000 tokens
CPU AMD Optimized
ZenDNN-optimized inference on AMD EPYC processors. Good for latency-tolerant workloads with moderate throughput requirements. Requests are not preemptable.
- Accelerator: AMD ZenDNN CPU
- Priority: 20
- Preemptable: No
CPU Intel Optimized
oneAPI-optimized inference on Intel Xeon processors. Similar to AMD CPU tier with slightly higher scheduling priority.
- Accelerator: Intel oneAPI CPU
- Priority: 25
- Preemptable: No
GPU NVIDIA Shared
Best balance of price and performance with NVIDIA CUDA GPUs. Shared infrastructure with 99.9% SLA. Highest scheduling priority among compute tiers. Requests are preemptable, meaning higher-priority internal operations can temporarily displace active requests during peak load.
- Accelerator: NVIDIA CUDA
- Priority: 30 (highest)
- Preemptable: Yes
GPU AMD Shared
Cost-effective GPU inference with AMD ROCm. Same performance characteristics as the NVIDIA tier at a lower price point.
- Accelerator: AMD ROCm
- Priority: 30
- Preemptable: Yes
GPU Intel Shared
GPU inference with Intel oneAPI and Arc/Max GPUs. Lowest per-token cost among GPU tiers.
- Accelerator: Intel oneAPI GPU
- Priority: 30
- Preemptable: Yes
XIM
Bring your own hardware and infrastructure. No token metering or rate limiting. Flat monthly fee covers the Xerotier routing and management layer. Supports the largest models (up to 304 GB) with extended timeouts.
- Accelerator: Your own hardware (CPU or GPU)
- Priority: 0
- Preemptable: No
- Full data control
Tier Selection
The service tier is configured per endpoint in the Xerotier dashboard or API. When you create or update an endpoint, you select which tier it uses. All requests to that endpoint are routed to workers compatible with the configured tier.
The service_tier request parameter is accepted for OpenAI API
compatibility but does not affect routing. The endpoint's configured tier
always takes precedence. The actual tier used is returned in the
service_tier field of every response (both streaming chunks
and non-streaming completions).
The router selects backends within the configured tier using configurable routing strategies:
- Least Loaded -- Prefer workers with the lowest queue depth.
- Lowest Latency -- Prefer workers with the lowest predicted latency.
- Model Affinity -- Prefer workers that already have the model loaded.
- Round Robin -- Cycle through available workers evenly.
- Composite -- Weighted combination of multiple strategies.
You can also pass optional X-SLO-TTFT-Ms and X-SLO-TPOT-Ms
request headers to hint at latency targets. The router boosts preference for workers
likely to meet these targets. See the
API Reference for details.
Custom Models and Tier Restrictions
Custom models -- models that you upload and manage privately within your project --
are restricted to the XIM tier. To use a model on shared
compute tiers (Free, CPU, or GPU), the model must be published to the public catalog
and promoted to the shared catalog role.
This restriction ensures that shared infrastructure only runs models that have been explicitly published and vetted. Private models never run on shared agents.
| Model Type | Allowed Tiers |
|---|---|
| Custom (private, not in catalog) | XIM only |
| Catalog model (deployable role) | XIM only |
| Catalog model (shared role) | All tiers |
Catalog Roles and Tier Access
Models in the public catalog are assigned a catalog role that controls which tiers they can be deployed on:
- Deployable: The default role for newly shared models. The model is visible in the catalog but can only be used with the XIM tier. This means other users can see the model but must deploy it on their own infrastructure.
-
Shared: The model is available on all tiers, including shared
compute. This role is set automatically when the platform detects the model loaded
on shared infrastructure, and reverts to
deployablewhen no shared agents have it cached.
The catalog role is managed automatically by the platform. When an administrator
provisions a catalog model onto shared infrastructure, the role is promoted to
shared. When shared infrastructure no longer has the model cached, the
role is demoted back to deployable.
For details on how model sharing works and how to publish models to the catalog, see the Model Sharing documentation.
Preemption
Some tiers allow requests to be preempted by higher-priority traffic. This means an active request may be interrupted if the system needs to serve a higher-priority request and no other workers are available.
| Tier | Preemptable | Priority |
|---|---|---|
| Free | Yes | 0 |
| CPU AMD Optimized | No | 20 |
| CPU Intel Optimized | No | 25 |
| GPU NVIDIA Shared | Yes | 30 |
| GPU AMD Shared | Yes | 30 |
| GPU Intel Shared | Yes | 30 |
| XIM | No | 0 |
CPU tiers and the XIM tier are never preempted. Free and GPU shared tiers are preemptable. Higher priority values indicate higher scheduling priority when the system must choose which requests to serve.
When no compatible workers are available, the API returns a
503 Service Unavailable response with a Retry-After
header indicating when to retry.
Completion Storage
When store: true is set in a request, the completion is saved for
later retrieval. Completions pass through two storage stages before expiration:
| Tier | Hot Storage (Redis) | Cold Storage (Object Store) | Total Retention |
|---|---|---|---|
| Free | 1 hour | 1 day | ~25 hours |
| CPU AMD Optimized | 5 hours | 7 days | ~7.2 days |
| CPU Intel Optimized | 6 hours | 7 days | ~7.25 days |
| GPU NVIDIA / AMD / Intel | 7 hours | 7 days | ~7.3 days |
| XIM | 48 hours | 14 days | 16 days |
Hot storage provides fast retrieval from Redis. After the hot tier duration, completions are archived to object storage (cold tier) for the remaining retention period. After total retention expires, completions are automatically deleted.
Free Account Limits
Free accounts have the following restrictions:
- Maximum 2 endpoints
- Maximum 2 models (up to 4 GB each)
- 100,000 tokens per month
- 60 requests per minute
- 30-second request timeout
- Requests are preemptable
Choosing a Tier
| Use Case | Recommended Tier |
|---|---|
| Testing and prototyping | Free |
| Cost-sensitive batch processing | CPU AMD or CPU Intel |
| Low-latency production workloads | GPU NVIDIA Shared |
| Cost-optimized GPU inference | GPU AMD or GPU Intel Shared |
| Large models (>32 GB) | GPU tiers (up to 76 GB) or XIM (up to 304 GB) |
| Data sovereignty / compliance | XIM |
| Long-running requests (>5 min) | XIM (1800s timeout) |
To change your endpoint's tier, go to the endpoint settings page in the Xerotier dashboard and select the new tier. The change takes effect immediately for new requests.
Code Examples
The service_tier is configured per endpoint, not per request. The
following examples show how to make inference requests on an endpoint that has
been assigned a specific tier. The actual tier used is returned in the
service_tier response field.
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_my-project_abc123"
)
response = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
# The service_tier field shows which tier processed the request
print(f"Service tier: {response.service_tier}")
Node.js (OpenAI SDK)
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
apiKey: 'xero_my-project_abc123'
});
const response = await client.chat.completions.create({
model: 'my-model',
messages: [{ role: 'user', content: 'Hello' }]
});
console.log(response.choices[0].message.content);
// The service_tier field shows which tier processed the request
console.log(`Service tier: ${response.service_tier}`);
curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_my-project_abc123" \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [{"role": "user", "content": "Hello"}]
}'
# The response includes "service_tier" indicating the endpoint's configured tier