// Guides

SDK & Integrations

Point any OpenAI-compatible SDK at a Xerotier base URL and ship. Same request shape, same response shape, plus SLO headers, max-tokens auto-clamping, cached-token accounting, and a defined error envelope.

For practical how-to recipes covering streaming, rate limiting, error handling patterns, and log probabilities, see Usage Guides.

Model names are illustrative. Every code example on this page uses YOUR_MODEL_NAME as a placeholder. Xerotier has no global catalog of pre-defined model ids; model names are defined per project on each endpoint. Discover the models available on your endpoint with GET /{project_id}/{endpoint_slug}/v1/models, then substitute the returned id wherever YOUR_MODEL_NAME appears below.
One client, one endpoint. The path segment before /v1 in the base URL is the endpoint slug. Unlike the OpenAI API, an instantiated SDK client is bound to a single endpoint, the "one client, switch models" pattern does not apply. Customers running multiple endpoints must instantiate one client per endpoint slug or rebuild the base URL per request.

Migrate from OpenAI

Two changes. Base URL, API key. Same SDK, same response shape.

  1. Base URL: https://api.openai.com/v1 becomes https://api.xerotier.ai/proj_ABC123/ENDPOINT_SLUG/v1.
  2. API key: sk-... becomes xero_{project_slug}_{random}. Create one in the dashboard or via POST /{project_id}/v1/management/api-keys.
Python (OpenAI SDK)
# Before from openai import OpenAI client = OpenAI(api_key="sk-...") response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Hello!"}] ) # After from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" ) response = client.chat.completions.create( model="YOUR_MODEL_NAME", messages=[{"role": "user", "content": "Hello!"}] )

Custom-domain endpoints substitute the host portion only; the /ENDPOINT_SLUG/v1 suffix remains. The endpoint slug is fixed at endpoint creation and surfaces in the dashboard URL list.

What Differs from OpenAI

Three behavioral deltas worth knowing before you ship.

  • max_tokens auto-clamping: requests exceeding model capacity are clamped, not rejected. X-Xerotier-Max-Tokens-Clamped reports the original value.
  • service_tier in responses is the Xerotier endpoint tier slug (e.g. gpu_nvidia_shared), not the OpenAI vocabulary (default / flex / scale). On requests it does not override endpoint tier but DOES influence routing priority scoring and billing within that tier.
  • stream_options.include_usage defaults to false per spec. When false or omitted, the final SSE chunk carries no usage object. Token counts are tracked internally for billing regardless.

Legacy /v1/completions and /v1/moderations are not implemented; chat completions, embeddings, audio, images, files, batch, and the responses API are. See API Reference for the full surface.

Xerotier Extensions

Custom headers and envelope fields available to any SDK that allows raw header access.

Request Headers

Header Description
X-SLO-TTFT-Ms Target time-to-first-token in milliseconds. Influences routing to meet your latency target.
X-SLO-TPOT-Ms Target time-per-output-token in milliseconds. Influences routing to meet your throughput target.

Response Headers

Header Description
X-Request-ID Unique request identifier for debugging and log correlation.
X-Xerotier-Worker-ID Identifies which backend worker served the request.
X-Xerotier-Max-Tokens-Clamped Present when max_tokens was automatically reduced. Value is the original requested amount.
X-RateLimit-Limit Configured request quota for the current window.
X-RateLimit-Remaining Remaining requests in the current window.
X-RateLimit-Reset Seconds until the current rate-limit window resets.
X-RateLimit-Warning Set when the client is approaching the configured limit.
Retry-After Standard HTTP header returned with 429 responses; seconds to wait before retrying.

Response Fields Beyond OpenAI

Field Description
x_adjusted_reasoning_effort Resolved reasoning effort after model-family clamping (e.g. requested high may resolve to medium on a smaller reasoning model). Present on chat completion responses.
usage.prompt_tokens_details.cached_tokens Prefix-cache hits served for this request. Same field name as OpenAI; populated for every endpoint, not just specific model families.

Error Envelopes

Xerotier error responses follow the OpenAI { "error": { ... } } envelope but extend it in two ways that SDK clients switching on error.type must handle:

  • Non-spec type values are emitted, including authorization_error, internal_error, validation_error, service_error, stream_error, forbidden_error, and insufficient_quota. Treat unknown type values defensively rather than asserting against the OpenAI enum.
  • Additional envelope keys retry_after (seconds) and retry_strategy (e.g. exponential) accompany retryable failures and should be preferred over a fixed backoff.

See Error Handling for the full taxonomy.

Cancellation

Cancel an in-flight streaming completion by issuing POST /{project_id}/{endpoint_slug}/v1/chat/completions/{id} using the completion id returned in the first SSE chunk.

Score / Rerank

The reranking endpoint (POST /v1/score) is not exposed through the OpenAI SDK surface. See Rerank API for the raw HTTP shape and examples.

SDK Quick Start

Pick a language. The tabs switch in sync.

Basic Request

Python
from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" ) response = client.chat.completions.create( model="YOUR_MODEL_NAME", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) print(f"Service tier: {response.service_tier}") print(f"System fingerprint: {response.system_fingerprint}") if response.usage.prompt_tokens_details: print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}") if response.usage.completion_tokens_details: print(f"Reasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}") if response.choices[0].message.refusal: print(f"Refusal: {response.choices[0].message.refusal}")
Node.js
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1', apiKey: 'xero_YOUR_PROJECT_SLUG_YOUR_API_KEY' }); const response = await client.chat.completions.create({ model: 'YOUR_MODEL_NAME', messages: [{ role: 'user', content: 'Hello!' }] }); console.log(response.choices[0].message.content); console.log(`Service tier: ${response.service_tier}`); console.log(`System fingerprint: ${response.system_fingerprint}`); console.log(`Cached tokens: ${response.usage?.prompt_tokens_details?.cached_tokens}`); console.log(`Reasoning tokens: ${response.usage?.completion_tokens_details?.reasoning_tokens}`);
Go
package main import ( "bytes" "encoding/json" "fmt" "io" "net/http" ) func main() { body := map[string]interface{}{ "model": "YOUR_MODEL_NAME", "messages": []map[string]string{ {"role": "user", "content": "Hello!"}, }, } jsonBody, _ := json.Marshal(body) req, _ := http.NewRequest("POST", "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions", bytes.NewReader(jsonBody)) req.Header.Set("Authorization", "Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY") req.Header.Set("Content-Type", "application/json") resp, err := http.DefaultClient.Do(req) if err != nil { panic(err) } defer resp.Body.Close() data, _ := io.ReadAll(resp.Body) fmt.Println(string(data)) fmt.Println("Request ID:", resp.Header.Get("X-Request-ID")) fmt.Println("Worker ID:", resp.Header.Get("X-Xerotier-Worker-ID")) }
curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "YOUR_MODEL_NAME", "messages": [{"role": "user", "content": "Hello!"}] }'

SLO Headers

Set per-request latency targets that influence routing. Optional on every call.

Python
response = client.chat.completions.create( model="YOUR_MODEL_NAME", messages=[{"role": "user", "content": "Hello!"}], extra_headers={ "X-SLO-TTFT-Ms": "500", "X-SLO-TPOT-Ms": "50" } )
Node.js
const response = await client.chat.completions.create({ model: 'YOUR_MODEL_NAME', messages: [{ role: 'user', content: 'Hello!' }] }, { headers: { 'X-SLO-TTFT-Ms': '500', 'X-SLO-TPOT-Ms': '50' } });
curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \ -H "Content-Type: application/json" \ -H "X-SLO-TTFT-Ms: 500" \ -H "X-SLO-TPOT-Ms: 50" \ -d '{ "model": "YOUR_MODEL_NAME", "messages": [{"role": "user", "content": "Hello!"}] }'

Streaming

Set stream: true. Each event is a line prefixed with data: carrying a JSON chunk; the stream terminates with a literal data: [DONE]. When stream_options.include_usage is true, the final pre-[DONE] chunk carries a populated usage object. See Streaming API for the parsing patterns and the two supported wire shapes.

curl
curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \ -H "Content-Type: application/json" \ -N \ -d '{ "model": "YOUR_MODEL_NAME", "messages": [{"role": "user", "content": "Write a poem about AI"}], "stream": true }'

Inspect Response Headers

curl
curl -v https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "YOUR_MODEL_NAME", "messages": [{"role": "user", "content": "Hello!"}] }' 2>&1 | grep -i "x-request-id\|x-xerotier\|x-ratelimit"

Typed Response Parsing (Go)

For typed access to service_tier, system_fingerprint, and usage:

Go
type Usage struct { PromptTokens int `json:"prompt_tokens"` CompletionTokens int `json:"completion_tokens"` TotalTokens int `json:"total_tokens"` } type ChatResponse struct { ID string `json:"id"` Model string `json:"model"` ServiceTier string `json:"service_tier"` SystemFingerprint string `json:"system_fingerprint"` Usage Usage `json:"usage"` Choices []struct { Message struct { Role string `json:"role"` Content string `json:"content"` } `json:"message"` FinishReason string `json:"finish_reason"` } `json:"choices"` } var parsed ChatResponse if err := json.Unmarshal(data, &parsed); err != nil { panic(err) } fmt.Println("Tier:", parsed.ServiceTier, "Tokens:", parsed.Usage.TotalTokens)

LangChain

LangChain reaches an OpenAI-compatible endpoint through ChatOpenAI. Install langchain-openai, point base_url at the endpoint slug.

Python
from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY", model="YOUR_MODEL_NAME" ) response = llm.invoke("What is the capital of France?") print(response.content) # Streaming for chunk in llm.stream("Write a poem about AI"): print(chunk.content, end="")

LlamaIndex

LlamaIndex routes through its OpenAI LLM class. Install llama-index-llms-openai.

Python
from llama_index.llms.openai import OpenAI llm = OpenAI( api_base="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY", model="YOUR_MODEL_NAME" ) response = llm.complete("What is the capital of France?") print(response.text)

Full Parameter Parity

Every OpenAI Chat Completions parameter listed below is accepted with the same semantics as the upstream spec. The footnote column flags the small set with Xerotier-specific notes; everything else passes through unchanged.

Supported request parameters (27)
Parameter Notes
model Model name as configured on your endpoint.
messages System, user, assistant, tool, and developer roles.
max_tokens Auto-clamped if it exceeds model capacity.
max_completion_tokens Preferred over max_tokens. Same auto-clamping.
temperature 0.0 to 2.0.
top_p Nucleus sampling.
stream SSE streaming. See Streaming API.
stream_options Set include_usage: true for token usage in the final chunk.
stop String or array of strings.
tools See Tool Calling.
tool_choice auto, none, required, or specific function.
parallel_tool_calls Parallel tool calls in a single response.
logprobs See API Reference.
top_logprobs 0-20, engine-enforced cap. Requires logprobs: true.
reasoning_effort "low", "medium", or "high". May be clamped per model; resolved value surfaces as x_adjusted_reasoning_effort.
prediction Speculative decoding. See Predicted Outputs.
service_tier Influences routing priority and billing within the endpoint tier. Does not override the endpoint tier itself. See Service Tiers.
seed Use with system_fingerprint for reproducibility.
n 1-128. Router-side fan-out emits multiple choices even in streaming mode.
frequency_penalty -2.0 to 2.0.
presence_penalty -2.0 to 2.0.
logit_bias Token-id map, -100 to 100.
response_format text, json_object, or json_schema.
metadata Up to 16 key-value pairs.
user End-user identifier for abuse monitoring.
web_search_options Enable in-line web search. Populates message.annotations with URL citations.
store Retrieve later via GET /{project_id}/{endpoint_slug}/v1/chat/completions/{id}.
Supported response fields
Field Description
service_tier Present in every response and SSE chunk. Value is the Xerotier endpoint tier slug, not the OpenAI vocabulary.
system_fingerprint Backend configuration identifier for reproducibility tracking.
message.refusal Refusal text when the model declines. SSE delta.refusal coverage on the chat-completions path is sparse; prefer the non-streamed message.refusal.
message.annotations URL citations when web_search_options is set. Defaults to an empty array. Also streams via delta.annotations.
logprobs Per-token log probabilities with content and refusal arrays, including top_logprobs.
usage.prompt_tokens_details Includes cached_tokens served from prefix cache.
usage.completion_tokens_details Includes reasoning_tokens, accepted_prediction_tokens, rejected_prediction_tokens.
x_adjusted_reasoning_effort Xerotier extension. Resolved reasoning effort after model-family clamping.

Keyboard shortcut: press Shift+C while a code block is focused to copy it. Cmd+Shift+C and Ctrl+Shift+C copy the nearest visible block from anywhere on the page.