API Reference

Core API endpoints for chat completions and model information. The Xerotier.ai API is fully compatible with the OpenAI API specification.

Chat Completions

Creates a model response for the given chat conversation. This is the primary endpoint for interacting with language models.

Create Chat Completion

POST /:project_id/:endpoint_slug/v1/chat/completions

URL path: :project_id is your project's external ID (e.g., proj_ABC123) and :endpoint_slug is the slug of the endpoint to route through (e.g., my-endpoint). Your full base URL is https://api.xerotier.ai/proj_ABC123/<endpoint-slug>/v1.

Note on service tier: The service tier is determined by your endpoint configuration, not by the request body. There is no client-side tier override. The tier controls which accelerator types and workers can serve requests. See the Service Tiers documentation for full details.

Request Body

Parameter Type Description
modelrequired string ID of the model to use (e.g., "deepseek-r1-distill-llama-70b"). This field is informational; the actual model used is determined by the endpoint configuration.
messagesrequired array A list of messages comprising the conversation so far. Supports roles: system, user, assistant, tool, developer.
max_tokensoptional integer Maximum number of tokens to generate. When omitted, the backend uses the remaining context window as the limit. If this value exceeds the available context window, it is automatically clamped. See Auto-Clamping below.
max_completion_tokensoptional integer Upper bound on the number of tokens to generate. Preferred over max_tokens. Same auto-clamping behavior applies. When both are set, this takes precedence.
temperatureoptional number Sampling temperature (0.0-2.0). Higher values make output more random. Default: 0.7
top_poptional number Nucleus sampling parameter (0.0-1.0). Default: 0.9
streamoptional boolean If true, partial message deltas will be sent as Server-Sent Events. Default: true
stream_optionsoptional object Options for streaming mode. Set {"include_usage": true} to include token usage in the final stream chunk. When omitted or set to false, the final chunk will not contain a usage field. Token usage is always tracked internally for billing regardless of this setting.
stopoptional string | array Up to 4 sequences where the API will stop generating.
frequency_penaltyoptional number Penalty for repeated tokens (-2.0 to 2.0). Positive values decrease likelihood of repeating the same tokens. Default: 0
presence_penaltyoptional number Penalty for tokens already in the context (-2.0 to 2.0). Positive values increase likelihood of new topics. Default: 0
noptional integer Number of completions to generate. Default: 1. Only n=1 is supported in streaming mode.
seedoptional integer Seed for deterministic sampling. When set, repeated requests with the same seed and parameters should return the same result.
logprobsoptional boolean Return log probabilities of output tokens. Default: false
top_logprobsoptional integer Number of most likely tokens to return log probabilities for (0-20). Requires logprobs: true.
logit_biasoptional object Map of token IDs to bias values (-100 to 100). Modifies the likelihood of specified tokens appearing in the output.
toolsoptional array A list of tool definitions the model may call. See Tool Calling below.
tool_choiceoptional string | object Controls tool selection: "auto", "none", "required", or {"type": "function", "function": {"name": "fn_name"}}
parallel_tool_callsoptional boolean Enable parallel tool calls. When true, the model may generate multiple tool calls in a single response.
response_formatoptional object Response format: {"type": "text"}, {"type": "json_object"}, or {"type": "json_schema", "json_schema": {"name": "...", "schema": {...}, "strict": true}}
metadataoptional object Up to 16 key-value pairs for request metadata. Keys max 64 characters, values max 512 characters.
useroptional string A unique identifier for the end user. Used for abuse monitoring and usage tracking.
storeoptional boolean Store the completion for later retrieval. Default: false
reasoning_effortoptional string Reasoning effort level for reasoning models (e.g., o1). Valid values: "low", "medium", "high". Controls how much reasoning the model applies before generating output. Invalid values return a 400 error. When set, the idle stream timeout uses the full request deadline to accommodate long reasoning phases.
service_tieroptional string Requested service tier. Accepted for OpenAI API compatibility but has no effect on routing -- the endpoint's configured tier is always used. The actual tier used is returned in the service_tier response field.
predictionoptional object Predicted output content for speculative decoding. When the model can verify the prediction, generation is faster because tokens are validated in parallel rather than generated sequentially. The object must have "type": "content" and a "content" field (string or array of content parts). Token counts for accepted and rejected predictions appear in completion_tokens_details. See Predicted Outputs below.
modalitiesoptional array Output modalities to generate. Currently only ["text"] is supported. Requests with unsupported modalities (e.g., "audio", "image") return a 400 error.

Optional Request Headers

Header Type Description
X-SLO-TTFT-Msoptional number Target time-to-first-token in milliseconds. The router prefers workers likely to meet this target. Must be a positive number; invalid values are ignored.
X-SLO-TPOT-Msoptional number Target time-per-output-token in milliseconds. The router prefers workers likely to meet this target. Must be a positive number; invalid values are ignored.

Response Headers

Header Description
X-Request-ID Unique identifier for the request (matches the response body id field). Present on both streaming and non-streaming responses. Include this in support tickets for request tracing.
X-Xerotier-Worker-ID Identifier of the worker that handled the request. Useful for correlating latency with routing decisions.
X-Xerotier-Max-Tokens-Clamped Present only when max_tokens was automatically reduced to fit the available context window. Format: <original> -> <clamped> (e.g., 32000 -> 18943). See Auto-Clamping below.

Message Object

Parameter Type Description
rolerequired string The role of the message author: system, user, assistant, tool, or developer
contentrequired string | array The content of the message. Can be a string or an array of content parts.
nameoptional string An optional name for the participant. Useful for distinguishing between multiple users or assistants in the same conversation.
tool_call_idoptional string Required when role is tool. The ID of the tool call this message responds to.
tool_callsoptional array Tool calls generated by the model (present in assistant messages).

Example Request

Python
from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) response = client.chat.completions.create( model="deepseek-r1-distill-llama-70b", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ], max_tokens=100, temperature=0.7 ) print(response.choices[0].message.content)
Node.js
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", apiKey: "xero_myproject_your_api_key" }); const response = await client.chat.completions.create({ model: "deepseek-r1-distill-llama-70b", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "What is the capital of France?" } ], max_tokens: 100, temperature: 0.7 }); console.log(response.choices[0].message.content);
curl
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_myproject_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-r1-distill-llama-70b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ], "max_tokens": 100, "temperature": 0.7 }'

Response

{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1706123456, "model": "deepseek-r1-distill-llama-70b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The capital of France is Paris.", "refusal": null, "annotations": [] }, "finish_reason": "stop", "logprobs": null } ], "usage": { "prompt_tokens": 25, "completion_tokens": 8, "total_tokens": 33, "prompt_tokens_details": { "cached_tokens": 0, "audio_tokens": null }, "completion_tokens_details": { "reasoning_tokens": null, "audio_tokens": null, "accepted_prediction_tokens": null, "rejected_prediction_tokens": null } }, "service_tier": "default", "system_fingerprint": "fp_44709d6fcb" }

Log Probabilities Response

When logprobs: true is set in the request, each choice includes a logprobs object with per-token log probabilities. The top_logprobs parameter controls how many alternative tokens are returned (0-20).

Field Type Description
logprobs.content array | null Array of token log probability objects for each content token. Null when the model produces no content tokens (e.g., a pure refusal or tool call).
logprobs.content[].token string The token string.
logprobs.content[].logprob float Log probability of this token. 0.0 means 100% confidence; more negative values indicate lower confidence.
logprobs.content[].bytes array | null UTF-8 byte representation of the token.
logprobs.content[].top_logprobs array Top alternative tokens at this position, each with token, logprob, and bytes fields. Array length matches the top_logprobs request parameter.
logprobs.refusal array | null Array of token log probability objects for refusal tokens. Present when the model refuses to comply with a request. Each entry has the same structure as logprobs.content[] entries (token, logprob, bytes, top_logprobs). Null when the model does not refuse.

Log Probabilities Request and Response Example

# Request with logprobs enabled curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_myproject_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-r1-distill-llama-70b", "messages": [{"role": "user", "content": "Is Paris the capital of France? Answer yes or no."}], "logprobs": true, "top_logprobs": 3, "max_tokens": 5 }' # Response (truncated) { "id": "chatcmpl-abc456", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Yes" }, "finish_reason": "stop", "logprobs": { "content": [ { "token": "Yes", "logprob": -0.00012, "bytes": [89, 101, 115], "top_logprobs": [ {"token": "Yes", "logprob": -0.00012, "bytes": [89, 101, 115]}, {"token": "yes", "logprob": -9.08, "bytes": [121, 101, 115]}, {"token": "YES", "logprob": -12.31, "bytes": [89, 69, 83]} ] } ], "refusal": null } } ], "usage": {"prompt_tokens": 18, "completion_tokens": 1, "total_tokens": 19} }

Usage Object

Field Type Description
prompt_tokens integer Number of tokens in the input prompt.
completion_tokens integer Number of tokens in the generated output.
total_tokens integer Sum of prompt_tokens and completion_tokens.
prompt_tokens_details.cached_tokens integer Number of prompt tokens served from the prefix cache (KV cache reuse). A higher value indicates more cache hits, reducing time-to-first-token. 0 if no tokens were cached.
prompt_tokens_details.audio_tokens integer | null Audio input tokens. Null for text-only models.
completion_tokens_details.reasoning_tokens integer | null Tokens used for internal reasoning. Null for non-reasoning models.
completion_tokens_details.audio_tokens integer | null Audio output tokens. Null for text-only models.
completion_tokens_details.accepted_prediction_tokens integer | null Predicted tokens that appeared in the output.
completion_tokens_details.rejected_prediction_tokens integer | null Predicted tokens that did not appear in the output.

Response Message Fields

The message object in each response choice contains the assistant's output. In addition to role and content, the following fields may be present:

Field Type Description
role string Always "assistant" in completion responses.
content string | null The generated text content. Null when the model produces only tool calls or a refusal.
refusal string | null A refusal message when the model declines to respond (content policy, safety filters). Null when the model does not refuse. When present, content is typically null. In streaming mode, refusal text is delivered incrementally via delta.refusal.
tool_calls array | null Tool calls generated by the model. Present when finish_reason is "tool_calls". See Tool Calling.
annotations array Message annotations such as URL citations. Defaults to an empty array. Reserved for future use with web search and citation features.

Additional Response Fields

Field Type Description
service_tier string | null The service tier used to process the request, as determined by the endpoint configuration (e.g., "free", "gpu_nvidia_shared"). Always present in both streaming and non-streaming responses.
system_fingerprint string | null Identifies the backend system configuration (model weights, quantization, GPU type) used for the request. Can be used alongside the seed parameter for reproducibility debugging. Present in both streaming chunks and non-streaming responses. Returns null when the backend does not provide a fingerprint.

Predicted Outputs

The prediction parameter enables speculative decoding: you supply a predicted output and the model verifies it in parallel rather than generating each token sequentially. When the prediction matches, generation is significantly faster. When it does not match, the model falls back to normal generation.

Prediction Object

Field Type Description
typerequired string Must be "content".
contentrequired string | array The predicted output text. Can be a plain string or an array of content parts (each with type and text fields). Arrays are normalized to a single concatenated string internally.

Example Request

JSON
{ "model": "deepseek-r1-distill-llama-70b", "messages": [ {"role": "user", "content": "Replace 'hello' with 'goodbye' in: hello world, hello there"} ], "prediction": { "type": "content", "content": "goodbye world, goodbye there" } }

Response Token Details

When a prediction is provided, the response usage.completion_tokens_details includes prediction-specific token counts:

JSON
{ "usage": { "prompt_tokens": 32, "completion_tokens": 6, "total_tokens": 38, "completion_tokens_details": { "reasoning_tokens": null, "accepted_prediction_tokens": 4, "rejected_prediction_tokens": 2 } } }
  • accepted_prediction_tokens -- Tokens from your prediction that the model verified and used. Higher values indicate a better prediction.
  • rejected_prediction_tokens -- Tokens from your prediction that the model discarded and regenerated. These still count toward billing.

Best Practices

  • Use predictions for code editing, document reformatting, and template-based generation where you can anticipate the output structure.
  • Accurate predictions reduce latency via parallel verification. Inaccurate predictions may be slower than no prediction at all.
  • Monitor accepted_prediction_tokens vs rejected_prediction_tokens to evaluate prediction quality.

Tool Calling

Xerotier supports OpenAI-compatible function calling. Define tools in your request and the model may generate tool calls in its response.

Tool Definition

JSON
{ "tools": [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City name" } }, "required": ["location"] }, "strict": false } } ], "tool_choice": "auto" }

Tool Call Response

When the model decides to call a tool, the response includes tool_calls instead of text content:

JSON
{ "choices": [{ "message": { "role": "assistant", "content": null, "tool_calls": [ { "id": "call_abc123", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\":\"Paris\"}" } } ] }, "finish_reason": "tool_calls" }] }

Tool Result Message

Send the tool result back using role tool with the matching tool_call_id:

JSON
{ "messages": [ {"role": "user", "content": "What is the weather in Paris?"}, {"role": "assistant", "content": null, "tool_calls": [ {"id": "call_abc123", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}} ]}, {"role": "tool", "tool_call_id": "call_abc123", "content": "{\"temperature\": 18, \"condition\": \"sunny\"}"} ] }

Rate Limits

Every API response includes rate limit headers. Rate limits are enforced per service tier using a sliding window algorithm.

Rate Limit Response Headers

Both standard (draft IETF) and X- prefixed headers are returned for broad client compatibility:

Header Description
RateLimit-Limit / X-RateLimit-Limit Maximum requests allowed per window.
RateLimit-Remaining / X-RateLimit-Remaining Remaining requests in the current window.
RateLimit-Reset / X-RateLimit-Reset Seconds until the current window resets.
X-RateLimit-Warning Set to approaching_limit when remaining requests are below 20% of the limit. Not present otherwise.

Rate Limits by Tier

Tier Requests/Min Burst Capacity
Free 60 +30 (50%, min 3)
CPU AMD / CPU Intel 120 +60 (50%, min 10)
GPU NVIDIA / GPU AMD / GPU Intel 240 +120 (50%, min 10)
XIM Unlimited N/A

Burst capacity allows short-term traffic spikes above the base limit without immediately blocking requests.

429 Rate Limit Exceeded

When the rate limit is exceeded, the API returns a 429 Too Many Requests response with retry guidance:

JSON
{ "error": { "message": "Rate limit exceeded. Please retry after 15 seconds using exponential backoff.", "type": "rate_limit_error", "code": "rate_limit_exceeded", "retry_after": 15, "retry_strategy": { "type": "exponential_backoff", "initial_delay_ms": 15000, "max_delay_ms": 60000, "multiplier": 2, "jitter": true } } }

The Retry-After header is also set. Clients should implement exponential backoff with jitter as described in the retry_strategy object.

Management API Rate Limits

Management API endpoints (batch, conversations, files, uploads, webhooks, and SLOs) are subject to a separate per-project rate limit. This limit applies uniformly across all management endpoints for a given project.

Setting Value
Default limit 60 requests per minute per project
Configuration Configured per-instance by the platform operator
Window Sliding 60-second window

Response headers on management API endpoints:

Header Description
X-RateLimit-Limit Maximum requests allowed per window.
X-RateLimit-Remaining Remaining requests in the current window.
X-RateLimit-Reset Seconds until the window resets.
Retry-After Seconds to wait before retrying (only on 429 responses).

When the limit is exceeded, the API returns HTTP 429 Too Many Requests. Inference endpoints (chat completions, embeddings, responses) have separate per-endpoint rate limits as described above and are not affected by the management API rate limit.

Request Timeouts

Request timeouts are determined by your endpoint's service tier. These are not client-configurable.

Request Deadline Timeout

The maximum time a request can wait for a response before being cancelled:

Tier Timeout
Free30 seconds
CPU AMD / CPU Intel300 seconds
GPU NVIDIA / GPU AMD / GPU Intel300 seconds
XIM1800 seconds

Idle Stream Timeout

For streaming requests, the maximum time between chunks before the stream is terminated:

Tier Idle Timeout
Free120 seconds
CPU AMD / CPU Intel600 seconds
GPU NVIDIA / GPU AMD / GPU Intel600 seconds
XIM3600 seconds

For requests with reasoning_effort set, the idle timeout uses the full request deadline timeout to accommodate long reasoning phases with no output.

Timeout Error Responses

For non-streaming requests, timeout returns HTTP 408. For streaming requests, an SSE error event is sent:

SSE
event: error data: {"error": {"type": "timeout_error", "message": "Request timed out after 30s. Your Free tier has a 30-second timeout limit."}}

max_tokens Auto-Clamping

When max_tokens (or max_completion_tokens) exceeds the available context window for a request, Xerotier automatically clamps the value to the maximum available instead of rejecting the request with an error.

This is useful when clients hardcode a default max_tokens value (e.g., 32000) that may exceed the remaining context as conversations grow. Instead of returning a 400 error, the request succeeds with a reduced output limit.

How It Works

Clamping occurs at two levels for reliability:

  1. Router-level (heuristic): The router estimates input token count from message character length and clamps max_tokens if it would exceed the model's context window. This is a fast-path optimization that avoids an extra round trip to the inference engine.
  2. Agent-level (exact): If a request passes the router estimate but the inference engine rejects it with an exact token count error, the agent automatically retries once with the corrected value. This safety net uses precise tokenizer counts.

Detection

When clamping occurs, the response includes the X-Xerotier-Max-Tokens-Clamped header showing the original and clamped values:

HTTP Header
X-Xerotier-Max-Tokens-Clamped: 32000 -> 18943

Clients can inspect this header to detect that clamping occurred and adjust their max_tokens settings if desired. When no clamping is needed, the header is absent.

Models

List and describe the models available through your Xerotier.ai endpoint.

List Models

GET /proj_ABC123/v1/models

Lists the currently available models and their metadata.

curl
curl https://api.xerotier.ai/proj_ABC123/v1/models \ -H "Authorization: Bearer xero_myproject_your_api_key"

Response

{ "object": "list", "data": [ { "id": "deepseek-r1-distill-llama-70b", "object": "model", "created": 1706000000, "owned_by": "Xerotier.ai" }, { "id": "llama-3.1-8b-instruct", "object": "model", "created": 1706000000, "owned_by": "Xerotier.ai" } ] }

Retrieve Model

GET /proj_ABC123/v1/models/{model}

Retrieves a model instance, providing information about the model.

Endpoints

List inference endpoints configured for your project.

List Endpoints

GET /proj_ABC123/v1/endpoints

Returns all non-deleted endpoints for your project, including those in provisioning, suspended, or error states.

curl
curl https://api.xerotier.ai/proj_ABC123/v1/endpoints \ -H "Authorization: Bearer xero_myproject_your_api_key"
Python
import requests headers = {"Authorization": "Bearer xero_myproject_your_api_key"} response = requests.get( "https://api.xerotier.ai/proj_ABC123/v1/endpoints", headers=headers ) for endpoint in response.json()["data"]: print(f"{endpoint['name']} ({endpoint['status']})")
Node.js
const response = await fetch( "https://api.xerotier.ai/proj_ABC123/v1/endpoints", { headers: { "Authorization": "Bearer xero_myproject_your_api_key" } } ); const data = await response.json(); for (const endpoint of data.data) { console.log(`${endpoint.name} (${endpoint.status})`); }

Response

{ "object": "list", "data": [ { "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "object": "endpoint", "slug": "my-endpoint", "name": "My Endpoint", "model_id": "00000000-1111-0000-1111-000000000000", "model_name": "llama-3.1-8b-instruct", "tier_id": "free", "status": "active", "custom_domain": null, "max_requests_per_minute": 60, "max_tokens_per_minute": 100000, "provisioning_state": null, "provisioned_worker_id": null, "created": 1706123456 } ] }