Usage Guides
Learn how to use streaming responses, understand rate limits, and handle errors effectively.
Streaming
The Xerotier API supports streaming responses using Server-Sent Events (SSE).
Set stream: true in your request to receive partial results as
they are generated.
Stream Response Format
Each streamed chunk is a JSON object prefixed with data::
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" world"}}]}
data: [DONE]
Python
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
stream = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Write a poem about AI"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Node.js
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
apiKey: 'xero_myproject_your_api_key'
});
const stream = await client.chat.completions.create({
model: 'deepseek-r1-distill-llama-70b',
messages: [{ role: 'user', content: 'Write a poem about AI' }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Write a poem about AI"}],
"stream": true
}'
For streaming error handling and mid-stream error events, see the Streaming Errors section of the Error Handling page. For SDK-specific streaming setup and additional language examples, see SDK & Integration Guides.
Rate Limits
Rate limits are applied per API key and vary by pricing tier.
Rate Limit Headers
API responses include headers indicating your current rate limit status:
| Header | Description |
|---|---|
| X-RateLimit-Limit | Maximum requests per minute for your tier |
| X-RateLimit-Remaining | Remaining requests in the current window |
| X-RateLimit-Reset | Unix timestamp when the rate limit resets |
Tier Limits
| Tier | Requests/min | Tokens/min |
|---|---|---|
| Free | 64 | 10,000 |
| CPU AMD Optimized | 128 | 100,000 |
| GPU NVIDIA Shared NVL | 256 | 500,000 |
| GPU NVIDIA Shared | 256 | 500,000 |
| GPU AMD Shared | 256 | 500,000 |
| Self-Hosted | Unlimited | Unlimited |
Handling Rate Limits
When you exceed rate limits, the API returns a 429 Too Many Requests response. Implement exponential backoff:
import time
from openai import OpenAI, RateLimitError
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
def make_request_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=messages
)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
retry_after = int(e.response.headers.get('Retry-After', 60))
time.sleep(retry_after)
return None
Error Handling
The API uses standard HTTP status codes and returns detailed error messages in JSON format. For the complete error code reference, fault categories, and retry policies, see the Error Handling documentation.
Handling Errors with OpenAI SDK
from openai import OpenAI, APIError, RateLimitError
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
try:
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Hello!"}]
)
except RateLimitError as e:
print(f"Rate limited. Retry after: {e.response.headers.get('Retry-After')}")
except APIError as e:
print(f"API error: {e.message}")
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
apiKey: 'xero_myproject_your_api_key'
});
try {
const response = await client.chat.completions.create({
model: 'deepseek-r1-distill-llama-70b',
messages: [{ role: 'user', content: 'Hello!' }]
});
} catch (error) {
if (error.status === 429) {
const retryAfter = error.headers?.get('retry-after') || 60;
console.log(`Rate limited. Retry after ${retryAfter}s`);
} else if (error.status === 401) {
console.log('Invalid API key');
} else {
console.log(`API error: ${error.message}`);
}
}
Log Probabilities
Log probabilities (logprobs) provide per-token confidence scores for model outputs. They are useful for classification confidence scoring, retrieval evaluation, autocomplete ranking, and perplexity measurement.
Basic Request
Set logprobs: true and optionally top_logprobs (0-20)
to receive per-token log probabilities in the response.
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Is 2+2=4? Answer yes or no."}],
"logprobs": true,
"top_logprobs": 3,
"max_tokens": 5
}'
Python -- Parsing Logprobs
import openai
import math
client = openai.OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Is 2+2=4? Answer yes or no."}],
logprobs=True,
top_logprobs=3,
max_tokens=5
)
# Print per-token probabilities
for token_info in response.choices[0].logprobs.content:
probability = math.exp(token_info.logprob)
print(f"Token: {token_info.token!r} "
f"logprob: {token_info.logprob:.4f} "
f"probability: {probability:.4%}")
for alt in token_info.top_logprobs:
alt_prob = math.exp(alt.logprob)
print(f" alt: {alt.token!r} probability: {alt_prob:.4%}")
Confidence Scoring
Compute overall response confidence by summing log probabilities across all tokens (equivalent to multiplying individual probabilities).
import math
def compute_confidence(logprobs_content):
"""Compute overall and per-token confidence from logprobs."""
total_logprob = sum(t.logprob for t in logprobs_content)
overall_confidence = math.exp(total_logprob)
per_token = [
{"token": t.token, "confidence": math.exp(t.logprob)}
for t in logprobs_content
]
return overall_confidence, per_token
confidence, tokens = compute_confidence(
response.choices[0].logprobs.content
)
print(f"Overall confidence: {confidence:.4%}")
for t in tokens:
print(f" {t['token']!r}: {t['confidence']:.4%}")
Streaming with Logprobs
Logprobs are included per-chunk when streaming is enabled.
stream = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Hello!"}],
logprobs=True,
top_logprobs=3,
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta
logprobs = chunk.choices[0].logprobs
if delta.content:
print(delta.content, end="", flush=True)
if logprobs and logprobs.content:
for token_info in logprobs.content:
print(f"\n [{token_info.token!r} logprob={token_info.logprob:.3f}]",
end="")
Node.js Example
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
apiKey: "xero_myproject_your_api_key",
});
const response = await client.chat.completions.create({
model: "deepseek-r1-distill-llama-70b",
messages: [{ role: "user", content: "Is 2+2=4? Answer yes or no." }],
logprobs: true,
top_logprobs: 3,
max_tokens: 5,
});
for (const token of response.choices[0].logprobs.content) {
const probability = Math.exp(token.logprob);
console.log(
`Token: "${token.token}" probability: ${(probability * 100).toFixed(2)}%`
);
}