Full Service Matrix

Feature Free Self-Hosted
Pricing
Price per 1M tokens Free Free
Hourly rate (dedicated) - $0.01/hr
Rate Limits
Tokens per minute 10,000 Unlimited
Requests per minute 20 Unlimited
Hardware
Hardware type CPU Your Hardware
Max model size 4 GB 304 GB
Endpoint Configuration
Max replicas 1 Unlimited
Concurrent requests 5 Unlimited
Max batch size 1 Unlimited
Request timeout 30s 120s
Features
Streaming Yes Yes
Batching - Yes
CPU support Yes Yes
NVIDIA CUDA - Yes

Frequently Asked Questions

Token usage includes both input (prompt) tokens and output (completion) tokens. We use the same tokenization as the model you deploy, so counts match what you'd see locally. You're only charged for successful requests.
Shared GPU tiers run on multi-tenant infrastructure where resources are shared across users. This is cost-effective but may have variable latency. Dedicated GPU gives you reserved hardware with guaranteed performance and isolation - ideal for production workloads requiring consistent latency.
No, you cannot change your endpoint tier. However, you can deploy new endpoints with different tiers.
Requests exceeding your tier's rate limits will receive a 429 (Too Many Requests) response with a Retry-After header. We recommend implementing exponential backoff in your client. Consider upgrading to a higher tier if you consistently hit limits.
No minimum commitment required for any tier. Pay-as-you-go pricing means you only pay for what you use. You can stop or delete endpoints at any time with no penalty.

Ready to get started?

Deploy your first model in minutes with our free tier. No credit card required.