The three things most operators hit on day one: streaming wire format, burst behavior under rate limits, and how to read an error envelope. Log probabilities live here in full; streaming and errors point you at the canonical reference pages.
Streaming
The Xerotier API supports streaming responses using Server-Sent Events (SSE).
Set stream: true in your request to receive partial results as
they are generated.
Stream Response Format
Each streamed chunk is a JSON object prefixed with data::
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" world"}}]}
data: [DONE]
Python
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
stream = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Write a poem about AI"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Node.js
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
apiKey: 'xero_myproject_your_api_key'
});
const stream = await client.chat.completions.create({
model: 'deepseek-r1-distill-llama-70b',
messages: [{ role: 'user', content: 'Write a poem about AI' }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Write a poem about AI"}],
"stream": true
}'
For streaming error handling and mid-stream error events, see the Streaming Errors section of the Error Handling page. For SDK-specific streaming setup and additional language examples, see SDK & Integrations.
Rate Limits
Rate limits are applied per API key and vary by pricing tier.
Rate Limit Headers
API responses include headers indicating your current rate limit status:
| Header | Description |
|---|---|
| X-RateLimit-Limit | Maximum requests per minute for your tier |
| X-RateLimit-Remaining | Remaining requests in the current window |
| X-RateLimit-Reset | Seconds until the current rate-limit window resets (not a Unix timestamp) |
| RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset | IETF draft-prefixed duplicates of the X-RateLimit-* headers above, emitted on every response for clients that prefer the non-prefixed form |
| X-RateLimit-Warning | Set to approaching_limit when remaining quota is low, before any 429 is returned |
| Retry-After | On 429 responses, the number of seconds the client should wait before retrying |
Tier Limits
| Tier | Requests/min | Tokens/min |
|---|---|---|
| Free | 64 | 10,000 |
| CPU AMD Optimized | 128 | 100,000 |
| GPU NVIDIA Shared NVL | 256 | 500,000 |
| GPU NVIDIA Shared | 256 | 500,000 |
| GPU AMD Shared | 256 | 500,000 |
| Self-Hosted | Unlimited | Unlimited |
Handling Rate Limits
When you exceed rate limits, the API returns a 429 Too Many Requests
response with a JSON error envelope using type: "rate_limit_error".
Implement exponential backoff using the Retry-After header.
import time
from openai import OpenAI, RateLimitError
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
def make_request_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=messages
)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
retry_after = int(e.response.headers.get('Retry-After', 60))
time.sleep(retry_after)
return None
Error Handling
The API uses standard HTTP status codes and returns detailed error messages in JSON format. For the complete error code reference, fault categories, and retry policies, see the Error Handling documentation.
Handling Errors with OpenAI SDK
from openai import OpenAI, APIError, RateLimitError
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
try:
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Hello!"}]
)
except RateLimitError as e:
print(f"Rate limited. Retry after: {e.response.headers.get('Retry-After')}")
except APIError as e:
print(f"API error: {e.message}")
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
apiKey: 'xero_myproject_your_api_key'
});
try {
const response = await client.chat.completions.create({
model: 'deepseek-r1-distill-llama-70b',
messages: [{ role: 'user', content: 'Hello!' }]
});
} catch (error) {
if (error.status === 429) {
const retryAfter = error.headers?.get('retry-after') || 60;
console.log(`Rate limited. Retry after ${retryAfter}s`);
} else if (error.status === 401) {
console.log('Invalid API key');
} else {
console.log(`API error: ${error.message}`);
}
}
Log Probabilities
Log probabilities (logprobs) provide per-token confidence scores for model outputs. They are useful for classification confidence scoring, retrieval evaluation, autocomplete ranking, and perplexity measurement.
Basic Request
Set logprobs: true and optionally top_logprobs (0-20)
to receive per-token log probabilities in the response.
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Is 2+2=4? Answer yes or no."}],
"logprobs": true,
"top_logprobs": 3,
"max_tokens": 5
}'
Python: Parsing Logprobs
import openai
import math
client = openai.OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Is 2+2=4? Answer yes or no."}],
logprobs=True,
top_logprobs=3,
max_tokens=5
)
# Print per-token probabilities
for token_info in response.choices[0].logprobs.content:
probability = math.exp(token_info.logprob)
print(f"Token: {token_info.token!r} "
f"logprob: {token_info.logprob:.4f} "
f"probability: {probability:.4%}")
for alt in token_info.top_logprobs:
alt_prob = math.exp(alt.logprob)
print(f" alt: {alt.token!r} probability: {alt_prob:.4%}")
Confidence Scoring
Compute overall response confidence by summing log probabilities across all tokens (equivalent to multiplying individual probabilities).
import math
def compute_confidence(logprobs_content):
"""Compute overall and per-token confidence from logprobs."""
total_logprob = sum(t.logprob for t in logprobs_content)
overall_confidence = math.exp(total_logprob)
per_token = [
{"token": t.token, "confidence": math.exp(t.logprob)}
for t in logprobs_content
]
return overall_confidence, per_token
confidence, tokens = compute_confidence(
response.choices[0].logprobs.content
)
print(f"Overall confidence: {confidence:.4%}")
for t in tokens:
print(f" {t['token']!r}: {t['confidence']:.4%}")
Streaming with Logprobs
Logprobs are included per-chunk when streaming is enabled.
stream = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Hello!"}],
logprobs=True,
top_logprobs=3,
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta
logprobs = chunk.choices[0].logprobs
if delta.content:
print(delta.content, end="", flush=True)
if logprobs and logprobs.content:
for token_info in logprobs.content:
print(f"\n [{token_info.token!r} logprob={token_info.logprob:.3f}]",
end="")
Node.js Example
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
apiKey: "xero_myproject_your_api_key",
});
const response = await client.chat.completions.create({
model: "deepseek-r1-distill-llama-70b",
messages: [{ role: "user", content: "Is 2+2=4? Answer yes or no." }],
logprobs: true,
top_logprobs: 3,
max_tokens: 5,
});
for (const token of response.choices[0].logprobs.content) {
const probability = Math.exp(token.logprob);
console.log(
`Token: "${token.token}" probability: ${(probability * 100).toFixed(2)}%`
);
}