// Guides

SDK & Integrations

Point any OpenAI-compatible SDK at a Xerotier base URL and ship. Same request shape, same response shape, plus SLO headers, max-tokens auto-clamping, cached-token accounting, and a defined error envelope.

For practical how-to recipes covering streaming, rate limiting, error handling patterns, and log probabilities, see Usage Guides.

Model names are illustrative. Every code example on this page uses YOUR_MODEL_NAME as a placeholder. Xerotier has no global catalog of pre-defined model ids; model names are defined per project on each endpoint. Discover the models available on your endpoint with GET /{project_id}/{endpoint_slug}/v1/models, then substitute the returned id wherever YOUR_MODEL_NAME appears below.

One client, one endpoint. The path segment before /v1 in the base URL is the endpoint slug. Unlike the OpenAI API, an instantiated SDK client is bound to a single endpoint, the "one client, switch models" pattern does not apply. Customers running multiple endpoints must instantiate one client per endpoint slug or rebuild the base URL per request.

Migrate from OpenAI

Two changes. Base URL, API key. Same SDK, same response shape.

Base URL: https://api.openai.com/v1 becomes https://api.xerotier.ai/proj_ABC123/ENDPOINT_SLUG/v1.
API key: sk-... becomes xero_{project_slug}_{random}. Create one in the dashboard or via POST /{project_id}/v1/management/api-keys.

Python (OpenAI SDK)

                    # Before
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

# After
from openai import OpenAI
client = OpenAI(
    base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY"
)
response = client.chat.completions.create(
    model="YOUR_MODEL_NAME",
    messages=[{"role": "user", "content": "Hello!"}]
)
                

Custom-domain endpoints substitute the host portion only; the /ENDPOINT_SLUG/v1 suffix remains. The endpoint slug is fixed at endpoint creation and surfaces in the dashboard URL list.

What Differs from OpenAI

Three behavioral deltas worth knowing before you ship.

max_tokens auto-clamping: requests exceeding model capacity are clamped, not rejected. X-Xerotier-Max-Tokens-Clamped reports the original value.
service_tier in responses is the Xerotier endpoint tier slug (e.g. gpu_nvidia_shared), not the OpenAI vocabulary (default / flex / scale). On requests it does not override endpoint tier but DOES influence routing priority scoring and billing within that tier.
stream_options.include_usage defaults to false per spec. When false or omitted, the final SSE chunk carries no usage object. Token counts are tracked internally for billing regardless.

Legacy /v1/completions and /v1/moderations are not implemented; chat completions, embeddings, audio, images, files, batch, and the responses API are. See API Reference for the full surface.

Xerotier Extensions

Custom headers and envelope fields available to any SDK that allows raw header access.

Request Headers

Header	Description
`X-SLO-TTFT-Ms`	Target time-to-first-token in milliseconds. Influences routing to meet your latency target.
`X-SLO-TPOT-Ms`	Target time-per-output-token in milliseconds. Influences routing to meet your throughput target.

Response Headers

Header	Description
`X-Request-ID`	Unique request identifier for debugging and log correlation.
`X-Xerotier-Worker-ID`	Identifies which backend worker served the request.
`X-Xerotier-Max-Tokens-Clamped`	Present when max_tokens was automatically reduced. Value is the original requested amount.
`X-RateLimit-Limit`	Configured request quota for the current window.
`X-RateLimit-Remaining`	Remaining requests in the current window.
`X-RateLimit-Reset`	Seconds until the current rate-limit window resets.
`X-RateLimit-Warning`	Set when the client is approaching the configured limit.
`Retry-After`	Standard HTTP header returned with 429 responses; seconds to wait before retrying.

Response Fields Beyond OpenAI

Field	Description
`x_adjusted_reasoning_effort`	Resolved reasoning effort after model-family clamping (e.g. requested `high` may resolve to `medium` on a smaller reasoning model). Present on chat completion responses.
`usage.prompt_tokens_details.cached_tokens`	Prefix-cache hits served for this request. Same field name as OpenAI; populated for every endpoint, not just specific model families.

Error Envelopes

Xerotier error responses follow the OpenAI { "error": { ... } } envelope but extend it in two ways that SDK clients switching on error.type must handle:

Non-spec type values are emitted, including authorization_error, internal_error, validation_error, service_error, stream_error, forbidden_error, and insufficient_quota. Treat unknown type values defensively rather than asserting against the OpenAI enum.
Additional envelope keys retry_after (seconds) and retry_strategy (e.g. exponential) accompany retryable failures and should be preferred over a fixed backoff.

See Error Handling for the full taxonomy.

Cancellation

Cancel an in-flight streaming completion by issuing POST /{project_id}/{endpoint_slug}/v1/chat/completions/{id} using the completion id returned in the first SSE chunk.

Score / Rerank

The reranking endpoint (POST /v1/score) is not exposed through the OpenAI SDK surface. See Rerank API for the raw HTTP shape and examples.

SDK Quick Start

Pick a language. The tabs switch in sync.

Basic Request

Python

                    from openai import OpenAI

client = OpenAI(
    base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="YOUR_MODEL_NAME",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)
print(f"Service tier: {response.service_tier}")
print(f"System fingerprint: {response.system_fingerprint}")

if response.usage.prompt_tokens_details:
    print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
if response.usage.completion_tokens_details:
    print(f"Reasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}")

if response.choices[0].message.refusal:
    print(f"Refusal: {response.choices[0].message.refusal}")
                

Node.js

                    import OpenAI from 'openai';

const client = new OpenAI({
    baseURL: 'https://api.xerotier.ai/proj_ABC123/my-endpoint/v1',
    apiKey: 'xero_YOUR_PROJECT_SLUG_YOUR_API_KEY'
});

const response = await client.chat.completions.create({
    model: 'YOUR_MODEL_NAME',
    messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(response.choices[0].message.content);
console.log(`Service tier: ${response.service_tier}`);
console.log(`System fingerprint: ${response.system_fingerprint}`);
console.log(`Cached tokens: ${response.usage?.prompt_tokens_details?.cached_tokens}`);
console.log(`Reasoning tokens: ${response.usage?.completion_tokens_details?.reasoning_tokens}`);
                

                    package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io"
    "net/http"
)

func main() {
    body := map[string]interface{}{
        "model": "YOUR_MODEL_NAME",
        "messages": []map[string]string{
            {"role": "user", "content": "Hello!"},
        },
    }
    jsonBody, _ := json.Marshal(body)

    req, _ := http.NewRequest("POST",
        "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions",
        bytes.NewReader(jsonBody))

    req.Header.Set("Authorization", "Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY")
    req.Header.Set("Content-Type", "application/json")

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()

    data, _ := io.ReadAll(resp.Body)
    fmt.Println(string(data))
    fmt.Println("Request ID:", resp.Header.Get("X-Request-ID"))
    fmt.Println("Worker ID:", resp.Header.Get("X-Xerotier-Worker-ID"))
}
                

curl

                    curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
  -H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "YOUR_MODEL_NAME",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
                

SLO Headers

Set per-request latency targets that influence routing. Optional on every call.

Python

                    response = client.chat.completions.create(
    model="YOUR_MODEL_NAME",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={
        "X-SLO-TTFT-Ms": "500",
        "X-SLO-TPOT-Ms": "50"
    }
)
                

Node.js

                    const response = await client.chat.completions.create({
    model: 'YOUR_MODEL_NAME',
    messages: [{ role: 'user', content: 'Hello!' }]
}, {
    headers: {
        'X-SLO-TTFT-Ms': '500',
        'X-SLO-TPOT-Ms': '50'
    }
});
                

curl

                    curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
  -H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-SLO-TTFT-Ms: 500" \
  -H "X-SLO-TPOT-Ms: 50" \
  -d '{
    "model": "YOUR_MODEL_NAME",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
                

Streaming

Set stream: true. Each event is a line prefixed with data: carrying a JSON chunk; the stream terminates with a literal data: [DONE]. When stream_options.include_usage is true, the final pre-[DONE] chunk carries a populated usage object. See Streaming API for the parsing patterns and the two supported wire shapes.

curl

                    curl https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
  -H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "YOUR_MODEL_NAME",
    "messages": [{"role": "user", "content": "Write a poem about AI"}],
    "stream": true
  }'
                

Inspect Response Headers

curl

                    curl -v https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
  -H "Authorization: Bearer xero_YOUR_PROJECT_SLUG_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "YOUR_MODEL_NAME",
    "messages": [{"role": "user", "content": "Hello!"}]
  }' 2>&1 | grep -i "x-request-id\|x-xerotier\|x-ratelimit"
                

Typed Response Parsing (Go)

For typed access to service_tier, system_fingerprint, and usage:

                    type Usage struct {
    PromptTokens     int `json:"prompt_tokens"`
    CompletionTokens int `json:"completion_tokens"`
    TotalTokens      int `json:"total_tokens"`
}

type ChatResponse struct {
    ID                string `json:"id"`
    Model             string `json:"model"`
    ServiceTier       string `json:"service_tier"`
    SystemFingerprint string `json:"system_fingerprint"`
    Usage             Usage  `json:"usage"`
    Choices           []struct {
        Message struct {
            Role    string `json:"role"`
            Content string `json:"content"`
        } `json:"message"`
        FinishReason string `json:"finish_reason"`
    } `json:"choices"`
}

var parsed ChatResponse
if err := json.Unmarshal(data, &parsed); err != nil {
    panic(err)
}
fmt.Println("Tier:", parsed.ServiceTier, "Tokens:", parsed.Usage.TotalTokens)
                

LangChain

LangChain reaches an OpenAI-compatible endpoint through ChatOpenAI. Install langchain-openai, point base_url at the endpoint slug.

Python

                    from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY",
    model="YOUR_MODEL_NAME"
)

response = llm.invoke("What is the capital of France?")
print(response.content)

# Streaming
for chunk in llm.stream("Write a poem about AI"):
    print(chunk.content, end="")
                

LlamaIndex

LlamaIndex routes through its OpenAI LLM class. Install llama-index-llms-openai.

Python

                    from llama_index.llms.openai import OpenAI

llm = OpenAI(
    api_base="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    api_key="xero_YOUR_PROJECT_SLUG_YOUR_API_KEY",
    model="YOUR_MODEL_NAME"
)

response = llm.complete("What is the capital of France?")
print(response.text)
                

Full Parameter Parity

Every OpenAI Chat Completions parameter listed below is accepted with the same semantics as the upstream spec. The footnote column flags the small set with Xerotier-specific notes; everything else passes through unchanged.

Supported request parameters (27)

Parameter	Notes
`model`	Model name as configured on your endpoint.
`messages`	System, user, assistant, tool, and developer roles.
`max_tokens`	Auto-clamped if it exceeds model capacity.
`max_completion_tokens`	Preferred over `max_tokens`. Same auto-clamping.
`temperature`	0.0 to 2.0.
`top_p`	Nucleus sampling.
`stream`	SSE streaming. See Streaming API.
`stream_options`	Set `include_usage: true` for token usage in the final chunk.
`stop`	String or array of strings.
`tools`	See Tool Calling.
`tool_choice`	`auto`, `none`, `required`, or specific function.
`parallel_tool_calls`	Parallel tool calls in a single response.
`logprobs`	See API Reference.
`top_logprobs`	0-20, engine-enforced cap. Requires `logprobs: true`.
`reasoning_effort`	`"low"`, `"medium"`, or `"high"`. May be clamped per model; resolved value surfaces as `x_adjusted_reasoning_effort`.
`prediction`	Speculative decoding. See Predicted Outputs.
`service_tier`	Influences routing priority and billing within the endpoint tier. Does not override the endpoint tier itself. See Service Tiers.
`seed`	Use with `system_fingerprint` for reproducibility.
`n`	1-128. Router-side fan-out emits multiple choices even in streaming mode.
`frequency_penalty`	-2.0 to 2.0.
`presence_penalty`	-2.0 to 2.0.
`logit_bias`	Token-id map, -100 to 100.
`response_format`	text, json_object, or json_schema.
`metadata`	Up to 16 key-value pairs.
`user`	End-user identifier for abuse monitoring.
`web_search_options`	Enable in-line web search. Populates `message.annotations` with URL citations.
`store`	Retrieve later via `GET /{project_id}/{endpoint_slug}/v1/chat/completions/{id}`.

Supported response fields

Field	Description
`service_tier`	Present in every response and SSE chunk. Value is the Xerotier endpoint tier slug, not the OpenAI vocabulary.
`system_fingerprint`	Backend configuration identifier for reproducibility tracking.
`message.refusal`	Refusal text when the model declines. SSE `delta.refusal` coverage on the chat-completions path is sparse; prefer the non-streamed `message.refusal`.
`message.annotations`	URL citations when `web_search_options` is set. Defaults to an empty array. Also streams via `delta.annotations`.
`logprobs`	Per-token log probabilities with `content` and `refusal` arrays, including `top_logprobs`.
`usage.prompt_tokens_details`	Includes `cached_tokens` served from prefix cache.
`usage.completion_tokens_details`	Includes `reasoning_tokens`, `accepted_prediction_tokens`, `rejected_prediction_tokens`.
`x_adjusted_reasoning_effort`	Xerotier extension. Resolved reasoning effort after model-family clamping.

Keyboard shortcut: press Shift+C while a code block is focused to copy it. Cmd+Shift+C and Ctrl+Shift+C copy the nearest visible block from anywhere on the page.