//

The three things most operators hit on day one: streaming wire format, burst behavior under rate limits, and how to read an error envelope. Log probabilities live here in full; streaming and errors point you at the canonical reference pages.

Streaming

The Xerotier API supports streaming responses using Server-Sent Events (SSE). Set stream: true in your request to receive partial results as they are generated.

Stream Response Format

Each streamed chunk is a JSON object prefixed with data::

Response
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"Hello"}}]} data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" world"}}]} data: [DONE]

Python

Python
from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) stream = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[{"role": "user", "content": "Write a poem about AI"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")

Node.js

Node.js
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1', apiKey: 'xero_myproject_your_api_key' }); const stream = await client.chat.completions.create({ model: 'deepseek-r1-distill-llama-70b', messages: [{ role: 'user', content: 'Write a poem about AI' }], stream: true }); for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { process.stdout.write(content); } }

curl

curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_myproject_your_api_key" \ -H "Content-Type: application/json" \ -N \ -d '{ "model": "deepseek-r1-distill-llama-70b", "messages": [{"role": "user", "content": "Write a poem about AI"}], "stream": true }'

For streaming error handling and mid-stream error events, see the Streaming Errors section of the Error Handling page. For SDK-specific streaming setup and additional language examples, see SDK & Integrations.

Rate Limits

Rate limits are applied per API key and vary by pricing tier.

Rate Limit Headers

API responses include headers indicating your current rate limit status:

Header Description
X-RateLimit-Limit Maximum requests per minute for your tier
X-RateLimit-Remaining Remaining requests in the current window
X-RateLimit-Reset Seconds until the current rate-limit window resets (not a Unix timestamp)
RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset IETF draft-prefixed duplicates of the X-RateLimit-* headers above, emitted on every response for clients that prefer the non-prefixed form
X-RateLimit-Warning Set to approaching_limit when remaining quota is low, before any 429 is returned
Retry-After On 429 responses, the number of seconds the client should wait before retrying

Tier Limits

Tier Requests/min Tokens/min
Free 64 10,000
GPU NVIDIA Shared NVL 256 500,000
CPU AMD Optimized 128 100,000
GPU NVIDIA Shared 256 500,000
GPU AMD Shared 256 500,000
Self-Hosted Unlimited Unlimited

Handling Rate Limits

When you exceed rate limits, the API returns a 429 Too Many Requests response with a JSON error envelope using type: "rate_limit_error". Implement exponential backoff using the Retry-After header.

Python
import time from openai import OpenAI, RateLimitError client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) def make_request_with_retry(messages, max_retries=3): for attempt in range(max_retries): try: return client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=messages ) except RateLimitError as e: if attempt == max_retries - 1: raise retry_after = int(e.response.headers.get('Retry-After', 60)) time.sleep(retry_after) return None

Error Handling

The API uses standard HTTP status codes and returns detailed error messages in JSON format. For the complete error code reference, fault categories, and retry policies, see the Error Handling documentation.

Handling Errors with OpenAI SDK

Python
from openai import OpenAI, APIError, RateLimitError client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) try: response = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[{"role": "user", "content": "Hello!"}] ) except RateLimitError as e: print(f"Rate limited. Retry after: {e.response.headers.get('Retry-After')}") except APIError as e: print(f"API error: {e.message}")
Node.js
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1', apiKey: 'xero_myproject_your_api_key' }); try { const response = await client.chat.completions.create({ model: 'deepseek-r1-distill-llama-70b', messages: [{ role: 'user', content: 'Hello!' }] }); } catch (error) { if (error.status === 429) { const retryAfter = error.headers?.get('retry-after') || 60; console.log(`Rate limited. Retry after ${retryAfter}s`); } else if (error.status === 401) { console.log('Invalid API key'); } else { console.log(`API error: ${error.message}`); } }

Log Probabilities

Log probabilities (logprobs) provide per-token confidence scores for model outputs. They are useful for classification confidence scoring, retrieval evaluation, autocomplete ranking, and perplexity measurement.

Basic Request

Set logprobs: true and optionally top_logprobs (0-20) to receive per-token log probabilities in the response.

curl
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_myproject_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-r1-distill-llama-70b", "messages": [{"role": "user", "content": "Is 2+2=4? Answer yes or no."}], "logprobs": true, "top_logprobs": 3, "max_tokens": 5 }'

Python: Parsing Logprobs

Python
import openai import math client = openai.OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) response = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[{"role": "user", "content": "Is 2+2=4? Answer yes or no."}], logprobs=True, top_logprobs=3, max_tokens=5 ) # Print per-token probabilities for token_info in response.choices[0].logprobs.content: probability = math.exp(token_info.logprob) print(f"Token: {token_info.token!r} " f"logprob: {token_info.logprob:.4f} " f"probability: {probability:.4%}") for alt in token_info.top_logprobs: alt_prob = math.exp(alt.logprob) print(f" alt: {alt.token!r} probability: {alt_prob:.4%}")

Confidence Scoring

Compute overall response confidence by summing log probabilities across all tokens (equivalent to multiplying individual probabilities).

Python
import math def compute_confidence(logprobs_content): """Compute overall and per-token confidence from logprobs.""" total_logprob = sum(t.logprob for t in logprobs_content) overall_confidence = math.exp(total_logprob) per_token = [ {"token": t.token, "confidence": math.exp(t.logprob)} for t in logprobs_content ] return overall_confidence, per_token confidence, tokens = compute_confidence( response.choices[0].logprobs.content ) print(f"Overall confidence: {confidence:.4%}") for t in tokens: print(f" {t['token']!r}: {t['confidence']:.4%}")

Streaming with Logprobs

Logprobs are included per-chunk when streaming is enabled.

Python
stream = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[{"role": "user", "content": "Hello!"}], logprobs=True, top_logprobs=3, stream=True ) for chunk in stream: delta = chunk.choices[0].delta logprobs = chunk.choices[0].logprobs if delta.content: print(delta.content, end="", flush=True) if logprobs and logprobs.content: for token_info in logprobs.content: print(f"\n [{token_info.token!r} logprob={token_info.logprob:.3f}]", end="")

Node.js Example

Node.js
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", apiKey: "xero_myproject_your_api_key", }); const response = await client.chat.completions.create({ model: "deepseek-r1-distill-llama-70b", messages: [{ role: "user", content: "Is 2+2=4? Answer yes or no." }], logprobs: true, top_logprobs: 3, max_tokens: 5, }); for (const token of response.choices[0].logprobs.content) { const probability = Math.exp(token.logprob); console.log( `Token: "${token.token}" probability: ${(probability * 100).toFixed(2)}%` ); }

Back to top