Usage Guides

Learn how to use streaming responses, understand rate limits, and handle errors effectively.

Streaming

The Xerotier API supports streaming responses using Server-Sent Events (SSE). Set stream: true in your request to receive partial results as they are generated.

Stream Response Format

Each streamed chunk is a JSON object prefixed with data::

Response
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"Hello"}}]} data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" world"}}]} data: [DONE]

Python

Python
from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) stream = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[{"role": "user", "content": "Write a poem about AI"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")

Node.js

Node.js
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1', apiKey: 'xero_myproject_your_api_key' }); const stream = await client.chat.completions.create({ model: 'deepseek-r1-distill-llama-70b', messages: [{ role: 'user', content: 'Write a poem about AI' }], stream: true }); for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { process.stdout.write(content); } }

curl

curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_myproject_your_api_key" \ -H "Content-Type: application/json" \ -N \ -d '{ "model": "deepseek-r1-distill-llama-70b", "messages": [{"role": "user", "content": "Write a poem about AI"}], "stream": true }'

For streaming error handling and mid-stream error events, see the Streaming Errors section of the Error Handling page. For SDK-specific streaming setup and additional language examples, see SDK & Integration Guides.

Rate Limits

Rate limits are applied per API key and vary by pricing tier.

Rate Limit Headers

API responses include headers indicating your current rate limit status:

Header Description
X-RateLimit-Limit Maximum requests per minute for your tier
X-RateLimit-Remaining Remaining requests in the current window
X-RateLimit-Reset Unix timestamp when the rate limit resets

Tier Limits

Tier Requests/min Tokens/min
Free 64 10,000
CPU AMD Optimized 128 100,000
GPU NVIDIA Shared NVL 256 500,000
GPU NVIDIA Shared 256 500,000
GPU AMD Shared 256 500,000
Self-Hosted Unlimited Unlimited

Handling Rate Limits

When you exceed rate limits, the API returns a 429 Too Many Requests response. Implement exponential backoff:

Python
import time from openai import OpenAI, RateLimitError client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) def make_request_with_retry(messages, max_retries=3): for attempt in range(max_retries): try: return client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=messages ) except RateLimitError as e: if attempt == max_retries - 1: raise retry_after = int(e.response.headers.get('Retry-After', 60)) time.sleep(retry_after) return None

Error Handling

The API uses standard HTTP status codes and returns detailed error messages in JSON format. For the complete error code reference, fault categories, and retry policies, see the Error Handling documentation.

Handling Errors with OpenAI SDK

Python
from openai import OpenAI, APIError, RateLimitError client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) try: response = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[{"role": "user", "content": "Hello!"}] ) except RateLimitError as e: print(f"Rate limited. Retry after: {e.response.headers.get('Retry-After')}") except APIError as e: print(f"API error: {e.message}")
Node.js
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1', apiKey: 'xero_myproject_your_api_key' }); try { const response = await client.chat.completions.create({ model: 'deepseek-r1-distill-llama-70b', messages: [{ role: 'user', content: 'Hello!' }] }); } catch (error) { if (error.status === 429) { const retryAfter = error.headers?.get('retry-after') || 60; console.log(`Rate limited. Retry after ${retryAfter}s`); } else if (error.status === 401) { console.log('Invalid API key'); } else { console.log(`API error: ${error.message}`); } }

Log Probabilities

Log probabilities (logprobs) provide per-token confidence scores for model outputs. They are useful for classification confidence scoring, retrieval evaluation, autocomplete ranking, and perplexity measurement.

Basic Request

Set logprobs: true and optionally top_logprobs (0-20) to receive per-token log probabilities in the response.

curl
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_myproject_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-r1-distill-llama-70b", "messages": [{"role": "user", "content": "Is 2+2=4? Answer yes or no."}], "logprobs": true, "top_logprobs": 3, "max_tokens": 5 }'

Python -- Parsing Logprobs

Python
import openai import math client = openai.OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) response = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[{"role": "user", "content": "Is 2+2=4? Answer yes or no."}], logprobs=True, top_logprobs=3, max_tokens=5 ) # Print per-token probabilities for token_info in response.choices[0].logprobs.content: probability = math.exp(token_info.logprob) print(f"Token: {token_info.token!r} " f"logprob: {token_info.logprob:.4f} " f"probability: {probability:.4%}") for alt in token_info.top_logprobs: alt_prob = math.exp(alt.logprob) print(f" alt: {alt.token!r} probability: {alt_prob:.4%}")

Confidence Scoring

Compute overall response confidence by summing log probabilities across all tokens (equivalent to multiplying individual probabilities).

Python
import math def compute_confidence(logprobs_content): """Compute overall and per-token confidence from logprobs.""" total_logprob = sum(t.logprob for t in logprobs_content) overall_confidence = math.exp(total_logprob) per_token = [ {"token": t.token, "confidence": math.exp(t.logprob)} for t in logprobs_content ] return overall_confidence, per_token confidence, tokens = compute_confidence( response.choices[0].logprobs.content ) print(f"Overall confidence: {confidence:.4%}") for t in tokens: print(f" {t['token']!r}: {t['confidence']:.4%}")

Streaming with Logprobs

Logprobs are included per-chunk when streaming is enabled.

Python
stream = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[{"role": "user", "content": "Hello!"}], logprobs=True, top_logprobs=3, stream=True ) for chunk in stream: delta = chunk.choices[0].delta logprobs = chunk.choices[0].logprobs if delta.content: print(delta.content, end="", flush=True) if logprobs and logprobs.content: for token_info in logprobs.content: print(f"\n [{token_info.token!r} logprob={token_info.logprob:.3f}]", end="")

Node.js Example

Node.js
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", apiKey: "xero_myproject_your_api_key", }); const response = await client.chat.completions.create({ model: "deepseek-r1-distill-llama-70b", messages: [{ role: "user", content: "Is 2+2=4? Answer yes or no." }], logprobs: true, top_logprobs: 3, max_tokens: 5, }); for (const token of response.choices[0].logprobs.content) { const probability = Math.exp(token.logprob); console.log( `Token: "${token.token}" probability: ${(probability * 100).toFixed(2)}%` ); }