SDK & Integration Guides
Use any OpenAI-compatible SDK to connect to Xerotier.ai. Change the base URL and API key -- everything else works the same.
Overview
The Xerotier API is fully compatible with the OpenAI Chat Completions API. Any SDK or library that supports the OpenAI API can connect to Xerotier by changing the base URL and API key. Xerotier also provides extensions (SLO headers, auto-clamping, cached token reporting) that are available via standard HTTP headers and response fields.
For practical how-to recipes covering streaming, rate limiting, error handling patterns, and log probabilities, see Usage Guides.
OpenAI Compatibility Matrix
The following table shows which OpenAI Chat Completions parameters Xerotier.ai supports.
Supported Request Parameters
| Parameter | Status | Notes |
|---|---|---|
model |
Supported | Model name as configured on your endpoint. |
messages |
Supported | System, user, assistant, tool, and developer message roles. |
max_tokens |
Supported | Subject to auto-clamping if value exceeds model capacity. |
max_completion_tokens |
Supported | Preferred over max_tokens. Same auto-clamping behavior. |
temperature |
Supported | Sampling temperature (0.0 to 2.0). |
top_p |
Supported | Nucleus sampling parameter. |
stream |
Supported | Server-Sent Events streaming. See Streaming API. |
stream_options |
Supported | Set include_usage: true to receive token usage in the final stream chunk. |
stop |
Supported | Stop sequences (up to 4). |
tools |
Supported | Function/tool calling. See Tool Calling. |
tool_choice |
Supported | auto, none, required, or specific function. |
parallel_tool_calls |
Supported | Enable parallel tool calls in a single response. |
logprobs |
Supported | Return per-token log probabilities. See API Reference. |
top_logprobs |
Supported | Number of top alternative tokens per position (0-20). Requires logprobs: true. |
reasoning_effort |
Supported | Controls reasoning depth for reasoning models: "low", "medium", or "high". |
prediction |
Supported | Predicted output for speculative decoding. See Predicted Outputs. |
service_tier |
Accepted | Accepted for compatibility but ignored. Tier is determined by endpoint configuration. |
seed |
Supported | Deterministic sampling. Use with system_fingerprint for reproducibility. |
n |
Supported | Number of completions. Only n=1 supported in streaming mode. |
frequency_penalty |
Supported | Penalty for repeated tokens (-2.0 to 2.0). |
presence_penalty |
Supported | Penalty for tokens already in context (-2.0 to 2.0). |
logit_bias |
Supported | Map of token IDs to bias values (-100 to 100). |
response_format |
Supported | text, json_object, or json_schema (structured output). |
metadata |
Supported | Up to 16 key-value pairs for request metadata. |
user |
Supported | End-user identifier for abuse monitoring. |
store |
Supported | Store completion for later retrieval via GET /v1/chat/completions/{id}. |
Supported Response Fields
All OpenAI Chat Completions response fields are supported, including:
| Field | Status | Description |
|---|---|---|
service_tier |
Supported | Shows which service tier processed the request. Present in all responses and streaming chunks. |
system_fingerprint |
Supported | Backend configuration identifier for reproducibility tracking. Use with seed for deterministic outputs. |
message.refusal |
Supported | Refusal text when the model declines a request. Streamed via delta.refusal in SSE mode. |
message.annotations |
Supported | Message annotations (URL citations). Defaults to empty array. Reserved for future web search features. |
logprobs |
Supported | Per-token log probabilities with content and refusal arrays. Includes top_logprobs. |
usage.prompt_tokens_details |
Supported | Includes cached_tokens showing how many input tokens were served from prefix cache. |
usage.completion_tokens_details |
Supported | Includes reasoning_tokens, accepted_prediction_tokens, and rejected_prediction_tokens. |
Behavioral Differences
- max_tokens auto-clamping: Instead of returning an error when max_tokens exceeds model capacity, Xerotier automatically clamps the value down. The
X-Xerotier-Max-Tokens-Clampedresponse header indicates the original value was reduced. - Service tier: Determined by your endpoint configuration, not by a request parameter. The
service_tierrequest parameter is accepted for compatibility but has no effect on routing. - stream_options.include_usage: Defaults to false per the OpenAI specification. When false or omitted, the final streaming chunk does not include a
usageobject. Token counts are always tracked internally for billing regardless.
Xerotier Extensions
These features are available via standard HTTP headers and work with any SDK that allows custom headers.
Request Headers
| Header | Description |
|---|---|
X-SLO-TTFT-Ms |
Target time-to-first-token in milliseconds. Influences routing to meet your latency target. |
X-SLO-TPOT-Ms |
Target time-per-output-token in milliseconds. Influences routing to meet your throughput target. |
Response Headers
| Header | Description |
|---|---|
X-Request-ID |
Unique request identifier for debugging and log correlation. |
X-Xerotier-Worker-ID |
Identifies which backend worker served the request. |
X-Xerotier-Max-Tokens-Clamped |
Present when max_tokens was automatically reduced. Value is the original requested amount. |
Python
Use the official openai Python package:
pip install openai
Basic Request
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
# Response metadata
print(f"Service tier: {response.service_tier}")
print(f"System fingerprint: {response.system_fingerprint}")
# Token usage details
if response.usage.prompt_tokens_details:
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
if response.usage.completion_tokens_details:
print(f"Reasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}")
# Check for refusal
if response.choices[0].message.refusal:
print(f"Refusal: {response.choices[0].message.refusal}")
SLO Headers
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Hello!"}],
extra_headers={
"X-SLO-TTFT-Ms": "500",
"X-SLO-TPOT-Ms": "50"
}
)
Node.js
Use the official openai npm package:
npm install openai
Basic Request
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
apiKey: 'xero_myproject_your_api_key'
});
const response = await client.chat.completions.create({
model: 'deepseek-r1-distill-llama-70b',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
console.log(`Service tier: ${response.service_tier}`);
console.log(`System fingerprint: ${response.system_fingerprint}`);
console.log(`Cached tokens: ${response.usage?.prompt_tokens_details?.cached_tokens}`);
console.log(`Reasoning tokens: ${response.usage?.completion_tokens_details?.reasoning_tokens}`);
SLO Headers
const response = await client.chat.completions.create({
model: 'deepseek-r1-distill-llama-70b',
messages: [{ role: 'user', content: 'Hello!' }]
}, {
headers: {
'X-SLO-TTFT-Ms': '500',
'X-SLO-TPOT-Ms': '50'
}
});
Go
Use net/http directly or an OpenAI Go client library.
Using net/http
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
)
func main() {
body := map[string]interface{}{
"model": "deepseek-r1-distill-llama-70b",
"messages": []map[string]string{
{"role": "user", "content": "Hello!"},
},
}
jsonBody, _ := json.Marshal(body)
req, _ := http.NewRequest("POST",
"https://api.xerotier.ai/PROJECT_ID/my-endpoint/v1/chat/completions",
bytes.NewReader(jsonBody))
req.Header.Set("Authorization", "Bearer xero_myproject_your_api_key")
req.Header.Set("Content-Type", "application/json")
// Optional: SLO headers
req.Header.Set("X-SLO-TTFT-Ms", "500")
resp, err := http.DefaultClient.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
fmt.Println(string(data))
// Check response headers
fmt.Println("Request ID:", resp.Header.Get("X-Request-ID"))
fmt.Println("Worker ID:", resp.Header.Get("X-Xerotier-Worker-ID"))
}
curl
Basic Request
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
With SLO Headers
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-H "X-SLO-TTFT-Ms: 500" \
-H "X-SLO-TPOT-Ms: 50" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Streaming
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Write a poem about AI"}],
"stream": true
}'
Verbose Output (View Response Headers)
curl -v https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Hello!"}]
}' 2>&1 | grep -i "x-request-id\|x-xerotier\|x-ratelimit"
LangChain
LangChain supports OpenAI-compatible endpoints via the ChatOpenAI class.
pip install langchain-openai
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key",
model="deepseek-r1-distill-llama-70b"
)
response = llm.invoke("What is the capital of France?")
print(response.content)
With Streaming
for chunk in llm.stream("Write a poem about AI"):
print(chunk.content, end="")
LlamaIndex
LlamaIndex supports OpenAI-compatible endpoints through its OpenAI LLM class.
pip install llama-index-llms-openai
from llama_index.llms.openai import OpenAI
llm = OpenAI(
api_base="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key",
model="deepseek-r1-distill-llama-70b"
)
response = llm.complete("What is the capital of France?")
print(response.text)
Migrating from OpenAI
Switching from OpenAI to Xerotier.ai requires only two changes:
- Change the base URL from
https://api.openai.com/v1to your Xerotier endpoint URL. - Change the API key from
sk-...toxero_projectslug_....
Before (OpenAI)
from openai import OpenAI
client = OpenAI(api_key="sk-your-openai-key")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
After (Xerotier.ai)
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[{"role": "user", "content": "Hello!"}]
)
URL Format
| Format | URL Pattern |
|---|---|
| Path-based | https://api.xerotier.ai/proj_ABC123/ENDPOINT_SLUG/v1 |
API Key Format
Xerotier API keys use the format xero_{project_slug}_{random_characters}.
Create keys in the project dashboard or via the POST /v1/keys API endpoint.
Xerotier Bonus Features
After migrating, you can take advantage of features not available in the OpenAI API:
- SLO headers: Set per-request latency targets that influence routing decisions.
- max_tokens auto-clamping: Requests with excessive max_tokens are automatically adjusted instead of rejected.
- Prefix cache metrics: See how many tokens were served from cache in each response.
- Worker identification: Know which backend worker served each request via response headers.