SDK & Integrations
Point any OpenAI-compatible SDK at a Xerotier base URL and ship. Same request shape, same response shape, plus SLO headers, max-tokens auto-clamping, cached-token accounting, and a defined error envelope.
For practical how-to recipes covering streaming, rate limiting, error handling patterns, and log probabilities, see Usage Guides.
YOUR_MODEL_NAME as a
placeholder. Xerotier has no global catalog of pre-defined model ids;
model names are defined per project on each endpoint. Discover the
models available on your endpoint with
GET /{project_id}/{endpoint_slug}/v1/models, then substitute
the returned id wherever YOUR_MODEL_NAME appears below.
/v1 in the base URL is the endpoint
slug. Unlike the OpenAI API, an instantiated SDK client is bound to a
single endpoint, the "one client, switch models" pattern does not
apply. Customers running multiple endpoints must instantiate one client
per endpoint slug or rebuild the base URL per request.
Migrate from OpenAI
Two changes. Base URL, API key. Same SDK, same response shape.
- Base URL:
https://api.openai.com/v1becomeshttps://api.xerotier.ai/proj_ABC123/ENDPOINT_SLUG/v1. - API key:
sk-...becomesxero_{project_slug}_{random}. Create one in the dashboard or viaPOST /{project_id}/v1/management/api-keys.
# Before
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
# After
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY"
)
response = client.chat.completions.create(
model="YOUR_MODEL_NAME",
messages=[{"role": "user", "content": "Hello!"}]
)
Custom-domain endpoints substitute the host portion only; the
/ENDPOINT_SLUG/v1 suffix remains. The endpoint slug is
fixed at endpoint creation and surfaces in the dashboard URL list.
What Differs from OpenAI
Three behavioral deltas worth knowing before you ship.
- max_tokens auto-clamping: requests exceeding model capacity are clamped, not rejected.
X-Xerotier-Max-Tokens-Clampedreports the original value. - service_tier in responses is the Xerotier endpoint tier slug (e.g.
gpu_nvidia_shared), not the OpenAI vocabulary (default/flex/scale). On requests it does not override endpoint tier but DOES influence routing priority scoring and billing within that tier. - stream_options.include_usage defaults to
falseper spec. When false or omitted, the final SSE chunk carries nousageobject. Token counts are tracked internally for billing regardless.
Legacy /v1/completions and /v1/moderations
are not implemented; chat completions, embeddings, audio, images,
files, batch, and the responses API are. See
API Reference for the full surface.
Xerotier Extensions
Custom headers and envelope fields available to any SDK that allows raw header access.
Request Headers
| Header | Description |
|---|---|
X-SLO-TTFT-Ms |
Target time-to-first-token in milliseconds. Influences routing to meet your latency target. |
X-SLO-TPOT-Ms |
Target time-per-output-token in milliseconds. Influences routing to meet your throughput target. |
Response Headers
| Header | Description |
|---|---|
X-Request-ID |
Unique request identifier for debugging and log correlation. |
X-Xerotier-Worker-ID |
Identifies which backend worker served the request. |
X-Xerotier-Max-Tokens-Clamped |
Present when max_tokens was automatically reduced. Value is the original requested amount. |
X-RateLimit-Limit |
Configured request quota for the current window. |
X-RateLimit-Remaining |
Remaining requests in the current window. |
X-RateLimit-Reset |
Seconds until the current rate-limit window resets. |
X-RateLimit-Warning |
Set when the client is approaching the configured limit. |
Retry-After |
Standard HTTP header returned with 429 responses; seconds to wait before retrying. |
Response Fields Beyond OpenAI
| Field | Description |
|---|---|
x_adjusted_reasoning_effort |
Resolved reasoning effort after model-family clamping (e.g. requested high may resolve to medium on a smaller reasoning model). Present on chat completion responses. |
usage.prompt_tokens_details.cached_tokens |
Prefix-cache hits served for this request. Same field name as OpenAI; populated for every endpoint, not just specific model families. |
Error Envelopes
Xerotier error responses follow the OpenAI { "error": { ... } }
envelope but extend it in two ways that SDK clients switching on
error.type must handle:
-
Non-spec
typevalues are emitted, includingauthorization_error,internal_error,validation_error,service_error,stream_error,forbidden_error, andinsufficient_quota. Treat unknowntypevalues defensively rather than asserting against the OpenAI enum. -
Additional envelope keys
retry_after(seconds) andretry_strategy(e.g.exponential) accompany retryable failures and should be preferred over a fixed backoff.
See Error Handling for the full taxonomy.
Cancellation
Cancel an in-flight streaming completion by issuing
POST /{project_id}/{endpoint_slug}/v1/chat/completions/{id}
using the completion id returned in the first SSE chunk.
Score / Rerank
The reranking endpoint (POST /v1/score) is not exposed
through the OpenAI SDK surface. See
Rerank API for the raw HTTP shape and
examples.
SDK Quick Start
Pick a language. The tabs switch in sync.
Basic Request
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY"
)
response = client.chat.completions.create(
model="YOUR_MODEL_NAME",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
print(f"Service tier: {response.service_tier}")
print(f"System fingerprint: {response.system_fingerprint}")
if response.usage.prompt_tokens_details:
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
if response.usage.completion_tokens_details:
print(f"Reasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}")
if response.choices[0].message.refusal:
print(f"Refusal: {response.choices[0].message.refusal}")
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
apiKey: 'xero_YOUR_PROJECT_SLUG_YOUR_API_KEY'
});
const response = await client.chat.completions.create({
model: 'YOUR_MODEL_NAME',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
console.log(`Service tier: ${response.service_tier}`);
console.log(`System fingerprint: ${response.system_fingerprint}`);
console.log(`Cached tokens: ${response.usage?.prompt_tokens_details?.cached_tokens}`);
console.log(`Reasoning tokens: ${response.usage?.completion_tokens_details?.reasoning_tokens}`);
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
)
func main() {
body := map[string]interface{}{
"model": "YOUR_MODEL_NAME",
"messages": []map[string]string{
{"role": "user", "content": "Hello!"},
},
}
jsonBody, _ := json.Marshal(body)
req, _ := http.NewRequest("POST",
"https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions",
bytes.NewReader(jsonBody))
req.Header.Set("Authorization", "Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY")
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
fmt.Println(string(data))
fmt.Println("Request ID:", resp.Header.Get("X-Request-ID"))
fmt.Println("Worker ID:", resp.Header.Get("X-Xerotier-Worker-ID"))
}
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL_NAME",
"messages": [{"role": "user", "content": "Hello!"}]
}'
SLO Headers
Set per-request latency targets that influence routing. Optional on every call.
response = client.chat.completions.create(
model="YOUR_MODEL_NAME",
messages=[{"role": "user", "content": "Hello!"}],
extra_headers={
"X-SLO-TTFT-Ms": "500",
"X-SLO-TPOT-Ms": "50"
}
)
const response = await client.chat.completions.create({
model: 'YOUR_MODEL_NAME',
messages: [{ role: 'user', content: 'Hello!' }]
}, {
headers: {
'X-SLO-TTFT-Ms': '500',
'X-SLO-TPOT-Ms': '50'
}
});
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-SLO-TTFT-Ms: 500" \
-H "X-SLO-TPOT-Ms: 50" \
-d '{
"model": "YOUR_MODEL_NAME",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Streaming
Set stream: true. Each event is a line prefixed with
data: carrying a JSON chunk; the stream terminates
with a literal data: [DONE]. When
stream_options.include_usage is true, the
final pre-[DONE] chunk carries a populated
usage object. See Streaming API
for the parsing patterns and the two supported wire shapes.
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "YOUR_MODEL_NAME",
"messages": [{"role": "user", "content": "Write a poem about AI"}],
"stream": true
}'
Inspect Response Headers
curl -v https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL_NAME",
"messages": [{"role": "user", "content": "Hello!"}]
}' 2>&1 | grep -i "x-request-id\|x-xerotier\|x-ratelimit"
Typed Response Parsing (Go)
For typed access to service_tier, system_fingerprint, and usage:
type Usage struct {
PromptTokens int `json:"prompt_tokens"`
CompletionTokens int `json:"completion_tokens"`
TotalTokens int `json:"total_tokens"`
}
type ChatResponse struct {
ID string `json:"id"`
Model string `json:"model"`
ServiceTier string `json:"service_tier"`
SystemFingerprint string `json:"system_fingerprint"`
Usage Usage `json:"usage"`
Choices []struct {
Message struct {
Role string `json:"role"`
Content string `json:"content"`
} `json:"message"`
FinishReason string `json:"finish_reason"`
} `json:"choices"`
}
var parsed ChatResponse
if err := json.Unmarshal(data, &parsed); err != nil {
panic(err)
}
fmt.Println("Tier:", parsed.ServiceTier, "Tokens:", parsed.Usage.TotalTokens)
LangChain
LangChain reaches an OpenAI-compatible endpoint through ChatOpenAI. Install langchain-openai, point base_url at the endpoint slug.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY",
model="YOUR_MODEL_NAME"
)
response = llm.invoke("What is the capital of France?")
print(response.content)
# Streaming
for chunk in llm.stream("Write a poem about AI"):
print(chunk.content, end="")
LlamaIndex
LlamaIndex routes through its OpenAI LLM class. Install llama-index-llms-openai.
from llama_index.llms.openai import OpenAI
llm = OpenAI(
api_base="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY",
model="YOUR_MODEL_NAME"
)
response = llm.complete("What is the capital of France?")
print(response.text)
Full Parameter Parity
Every OpenAI Chat Completions parameter listed below is accepted with the same semantics as the upstream spec. The footnote column flags the small set with Xerotier-specific notes; everything else passes through unchanged.
Supported request parameters (27)
| Parameter | Notes |
|---|---|
model |
Model name as configured on your endpoint. |
messages |
System, user, assistant, tool, and developer roles. |
max_tokens |
Auto-clamped if it exceeds model capacity. |
max_completion_tokens |
Preferred over max_tokens. Same auto-clamping. |
temperature |
0.0 to 2.0. |
top_p |
Nucleus sampling. |
stream |
SSE streaming. See Streaming API. |
stream_options |
Set include_usage: true for token usage in the final chunk. |
stop |
String or array of strings. |
tools |
See Tool Calling. |
tool_choice |
auto, none, required, or specific function. |
parallel_tool_calls |
Parallel tool calls in a single response. |
logprobs |
See API Reference. |
top_logprobs |
0-20, engine-enforced cap. Requires logprobs: true. |
reasoning_effort |
"low", "medium", or "high". May be clamped per model; resolved value surfaces as x_adjusted_reasoning_effort. |
prediction |
Speculative decoding. See Predicted Outputs. |
service_tier |
Influences routing priority and billing within the endpoint tier. Does not override the endpoint tier itself. See Service Tiers. |
seed |
Use with system_fingerprint for reproducibility. |
n |
1-128. Router-side fan-out emits multiple choices even in streaming mode. |
frequency_penalty |
-2.0 to 2.0. |
presence_penalty |
-2.0 to 2.0. |
logit_bias |
Token-id map, -100 to 100. |
response_format |
text, json_object, or json_schema. |
metadata |
Up to 16 key-value pairs. |
user |
End-user identifier for abuse monitoring. |
web_search_options |
Enable in-line web search. Populates message.annotations with URL citations. |
store |
Retrieve later via GET /{project_id}/{endpoint_slug}/v1/chat/completions/{id}. |
Supported response fields
| Field | Description |
|---|---|
service_tier |
Present in every response and SSE chunk. Value is the Xerotier endpoint tier slug, not the OpenAI vocabulary. |
system_fingerprint |
Backend configuration identifier for reproducibility tracking. |
message.refusal |
Refusal text when the model declines. SSE delta.refusal coverage on the chat-completions path is sparse; prefer the non-streamed message.refusal. |
message.annotations |
URL citations when web_search_options is set. Defaults to an empty array. Also streams via delta.annotations. |
logprobs |
Per-token log probabilities with content and refusal arrays, including top_logprobs. |
usage.prompt_tokens_details |
Includes cached_tokens served from prefix cache. |
usage.completion_tokens_details |
Includes reasoning_tokens, accepted_prediction_tokens, rejected_prediction_tokens. |
x_adjusted_reasoning_effort |
Xerotier extension. Resolved reasoning effort after model-family clamping. |
Keyboard shortcut: press Shift+C while a code block is focused to copy it. Cmd+Shift+C and Ctrl+Shift+C copy the nearest visible block from anywhere on the page.