// pricing

What each tier costs.

Pay per token on managed inference. Bring your own hardware and pay nothing per request. Free tier through shared GPU; only tiers with active infrastructure shown.

// service matrix

Every tier, every limit

Type to narrow columns by tier name, hardware, or feature. Press / to focus, Esc to clear.

Feature Free CPU AMD Optimized GPU NVIDIA Shared GPU AMD Shared Self-Hosted
Pricing
Price per 1M tokens Free $0.35 $1.25 $1.00 Free
Hourly rate (XIM) - - - - $0.03/hr
Rate Limits
Tokens per minute 10,000 100,000 500,000 500,000 Unlimited
Requests per minute 64 128 256 256 Unlimited
Hardware
Hardware type CPU CPU GPU GPU Your Accelerators
Max model size 64 GB 20 GB Unlimited Unlimited 512 GB
Endpoint Configuration
Max replicas 1 1 1 1 1
Concurrent requests 8 24 48 48 96
Max batch size 1 8 16 16 64
Request timeout 30s 600s 300s 300s 1800s
Features
Streaming Yes Yes Yes Yes Yes
Batching - Yes Yes Yes Yes
CPU support Yes Yes - - Yes
NVIDIA CUDA - - Yes - -
Storage
Storage pool (platform-wide) Shared Shared Shared Shared Shared
Minimum billable (display floor) 1 GB 1 GB 1 GB 1 GB 1 GB
Cold tier retention Up to 14 days Up to 14 days Up to 14 days Up to 14 days Up to 14 days
Encryption at rest Yes Yes Yes Yes Yes
// storage

Storage that bills the way you expect

One pool per project. One platform-wide rate. No per-content-type meters; no charges for bytes you do not store.

Per GB, per month

Storage is billed at a single platform-wide rate, not per service tier: base rate * markup multiplier, rounded up to the nearest thousandth. The current default is $0.013/GB/mo.

Display floor

The Usage dashboard reports billable storage as max(1 GB, actual usage). Per-cycle charges are prorated by the actual bytes you store, not the floor.

Shared per project

Models, completions, responses, conversations, batch files, and uploads share a single pool per project. No separate meter per content type.

Cold tier retention

Cold tier content is retained for up to 14 days before automatic expiration, depending on the service tier. A hot tier cache fronts recently used content for faster reads.

Encryption at rest

Stored content is encrypted with AES-256-GCM across both hot and cold tiers. Keys are managed per project for tenant isolation.

Metered from the first byte

No tier includes free storage. Usage is billed from the first byte you store, subject to the 1 GB display floor.

// questions

Read this before you deploy

Token math, tier semantics, rate-limit behavior, storage billing. Stated, not implied.

Token usage includes both input (prompt) tokens and output (completion) tokens. We use the same tokenization as the model you deploy, so counts match what you would see locally. You are only charged for successful requests.
Yes. On billable tiers, tokens served from the prefix cache are billed at 0.25x the standard per-token rate. The free and self-hosted tiers do not incur token charges, so the multiplier does not apply.
The service_tier request parameter accepts flex, standard, and priority. priority raises routing priority and is billed at 1.25x the standard token cost; flex and standard carry no surcharge.
No, you cannot change an endpoint's tier after deployment. You can deploy new endpoints on different tiers at any time.
Requests exceeding your tier's rate limits receive a 429 (Too Many Requests) response with a Retry-After header. We recommend exponential backoff in your client. Consider a higher tier if you consistently hit limits.
No minimum commitment on any tier. Pay-as-you-go means you only pay for what you use. Stop or delete endpoints any time, no penalty.
All content types share a single storage pool per project. The Usage dashboard displays a 1 GB floor (max(1 GB, actual usage)), while per-cycle charges are prorated by the actual bytes stored. Storage is billed at a per-GB monthly rate administered platform-wide (default $0.013/GB/mo), not per service tier. Cold tier content is retained for up to 14 days before auto-expiration and is encrypted at rest with AES-256-GCM. See the storage documentation for full details.

Deploy a shared endpoint at zero per-request cost.

Move to dedicated hardware when the math says so. The free tier shares GPU; tier upgrades take seconds.