No hidden fees, no surprises. Scale from free to enterprise with predictable costs. Only tiers with active infrastructure are shown.
Full Service Matrix
Feature
Free
Self-Hosted
Pricing
Price per 1M tokens
Free
Free
Hourly rate (dedicated)
-
$0.01/hr
Rate Limits
Tokens per minute
10,000
Unlimited
Requests per minute
20
Unlimited
Hardware
Hardware type
CPU
Your Hardware
Max model size
4 GB
304 GB
Endpoint Configuration
Max replicas
1
Unlimited
Concurrent requests
5
Unlimited
Max batch size
1
Unlimited
Request timeout
30s
120s
Features
Streaming
Yes
Yes
Batching
-
Yes
CPU support
Yes
Yes
NVIDIA CUDA
-
Yes
Frequently Asked Questions
Token usage includes both input (prompt) tokens and output (completion) tokens. We use the same tokenization as the model you deploy, so counts match what you'd see locally. You're only charged for successful requests.
Shared GPU tiers run on multi-tenant infrastructure where resources are shared across users. This is cost-effective but may have variable latency. Dedicated GPU gives you reserved hardware with guaranteed performance and isolation - ideal for production workloads requiring consistent latency.
No, you cannot change your endpoint tier. However, you can deploy new endpoints with different tiers.
Requests exceeding your tier's rate limits will receive a 429 (Too Many Requests) response with a Retry-After header. We recommend implementing exponential backoff in your client. Consider upgrading to a higher tier if you consistently hit limits.
No minimum commitment required for any tier. Pay-as-you-go pricing means you only pay for what you use. You can stop or delete endpoints at any time with no penalty.
Ready to get started?
Deploy your first model in minutes with our free tier. No credit card required.