Chat Completions
The full OpenAI surface area on a different base URL. Chat completions, embeddings, batch, files, responses, models, endpoints, all with the request and response shapes your SDK already speaks. Swap the URL and key; ship the same code.
Chat Completions
Creates a model response for the given chat conversation. This is the primary endpoint for interacting with language models.
Create Chat Completion
POST /:project_id/:endpoint_slug/v1/chat/completions
URL path: :project_id is your project's external ID (e.g., proj_ABC123) and
:endpoint_slug is the slug of the endpoint to route through (e.g., my-endpoint).
Your full base URL is https://api.xerotier.ai/proj_ABC123/<endpoint-slug>/v1.
Request Body
| Parameter | Type | Description |
|---|---|---|
| modelrequired | string | ID of the model to use. The field is informational; the actual model served is determined by the endpoint configuration, so any non-empty string is accepted (the canonical form is the endpoint's configured model name). |
| messagesrequired | array | A list of messages comprising the conversation so far. Supports roles: system, user, assistant, tool, developer. |
| max_tokensoptional | integer | Maximum number of tokens to generate. When omitted, the backend uses the remaining context window as the limit. If this value exceeds the available context window, it is automatically clamped. See Auto-Clamping below. |
| max_completion_tokensoptional | integer | Upper bound on the number of tokens to generate. Preferred over max_tokens. Same auto-clamping behavior applies. When both are set, this takes precedence. |
| temperatureoptional | number | Sampling temperature (0.0-2.0). Higher values make output more random. Default: 0.7 |
| top_poptional | number | Nucleus sampling parameter (0.0-1.0). Default: 0.9 |
| streamoptional | boolean | If true, partial message deltas will be sent as Server-Sent Events. Default: false (omitted or null is treated as non-streaming). |
| stream_optionsoptional | object | Options for streaming mode. Set {"include_usage": true} to include token usage in the final stream chunk. When omitted or set to false, the final chunk will not contain a usage field. Token usage is always tracked internally for billing regardless of this setting. |
| stopoptional | string | array | Up to 4 sequences where the API will stop generating. |
| frequency_penaltyoptional | number | Penalty for repeated tokens (-2.0 to 2.0). Positive values decrease likelihood of repeating the same tokens. Default: 0 |
| presence_penaltyoptional | number | Penalty for tokens already in the context (-2.0 to 2.0). Positive values increase likelihood of new topics. Default: 0 |
| noptional | integer | Number of completions to generate. Default: 1. Only n=1 is supported in streaming mode. |
| seedoptional | integer | Seed for deterministic sampling. When set, repeated requests with the same seed and parameters should return the same result. |
| logprobsoptional | boolean | Return log probabilities of output tokens. Default: false |
| top_logprobsoptional | integer | Number of most likely tokens to return log probabilities for (0-20). Requires logprobs: true. |
| logit_biasoptional | object | Map of token IDs to bias values (-100 to 100). Modifies the likelihood of specified tokens appearing in the output. |
| toolsoptional | array | A list of tool definitions the model may call. See Tool Calling below. |
| tool_choiceoptional | string | object | Controls tool selection: "auto", "none", "required", or {"type": "function", "function": {"name": "fn_name"}} |
| parallel_tool_callsoptional | boolean | Enable parallel tool calls. When true, the model may generate multiple tool calls in a single response. |
| response_formatoptional | object | Response format: {"type": "text"}, {"type": "json_object"}, or {"type": "json_schema", "json_schema": {"name": "...", "schema": {...}, "strict": true}} |
| metadataoptional | object | Up to 16 key-value pairs for request metadata. Keys max 64 characters, values max 512 characters. |
| useroptional | string | A unique identifier for the end user. Used for abuse monitoring and usage tracking. |
| storeoptional | boolean | Store the completion for later retrieval. Default: false |
| reasoning_effortoptional | string | Reasoning effort level for reasoning models. Valid values: "low", "medium", "high". Controls how much reasoning the model applies before generating output. Invalid values return a 400 error. When set, the idle stream timeout uses the full request deadline to accommodate long reasoning phases. Note: On the chat-completions stream, intermediate reasoning text is not emitted in delta chunks (the SSE delta carries content, refusal, and tool_calls only); to receive incremental reasoning content use the Responses API. |
| service_tieroptional | string | Requested service-tier hint. Routing is still bound to the endpoint's configured tier, but the value influences worker selection and billing: "flex" applies a -15 score adjustment (lower priority, no billing change), "priority" applies a +15 score adjustment and a 1.25x billing multiplier. "auto" and "default" are no-ops. The actual tier used is returned in the service_tier response field. See Service Tiers. |
| predictionoptional | object | Predicted output content for speculative decoding. When the model can verify the prediction, generation is faster because tokens are validated in parallel rather than generated sequentially. The object must have "type": "content" and a "content" field (string or array of content parts). Token counts for accepted and rejected predictions appear in completion_tokens_details. See Predicted Outputs below. |
| modalitiesoptional | array | Output modalities to generate. Currently only ["text"] is supported. Requests with unsupported modalities (e.g., "audio", "image") return a 400 error. |
Optional Request Headers
| Header | Type | Description |
|---|---|---|
| X-SLO-TTFT-Msoptional | number | Target time-to-first-token in milliseconds. The router prefers workers likely to meet this target. Must be a positive number; invalid values are ignored. |
| X-SLO-TPOT-Msoptional | number | Target time-per-output-token in milliseconds. The router prefers workers likely to meet this target. Must be a positive number; invalid values are ignored. |
Response Headers
| Header | Description |
|---|---|
| X-Request-ID | Unique identifier for the request (matches the response body id field). Present on both streaming and non-streaming responses. Include this in support tickets for request tracing. |
| X-Xerotier-Worker-ID | Identifier of the worker that handled the request. Useful for correlating latency with routing decisions. |
| X-Xerotier-Max-Tokens-Clamped | Present only when max_tokens was automatically reduced to fit the available context window. Format: <original> -> <clamped> (e.g., 32000 -> 18943). See Auto-Clamping below. |
Message Object
| Parameter | Type | Description |
|---|---|---|
| rolerequired | string | The role of the message author: system, user, assistant, tool, or developer |
| contentrequired | string | array | The content of the message. Can be a string or an array of content parts. |
| nameoptional | string | An optional name for the participant. Useful for distinguishing between multiple users or assistants in the same conversation. |
| tool_call_idoptional | string | Required when role is tool. The ID of the tool call this message responds to. |
| tool_callsoptional | array | Tool calls generated by the model (present in assistant messages). |
Example Request
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
max_tokens=100,
temperature=0.7
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
apiKey: "xero_myproject_your_api_key"
});
const response = await client.chat.completions.create({
model: "deepseek-r1-distill-llama-70b",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is the capital of France?" }
],
max_tokens: 100,
temperature: 0.7
});
console.log(response.choices[0].message.content);
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100,
"temperature": 0.7
}'
Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1706123456,
"model": "deepseek-r1-distill-llama-70b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris.",
"refusal": null,
"annotations": []
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": null
},
"completion_tokens_details": {
"reasoning_tokens": null,
"audio_tokens": null,
"accepted_prediction_tokens": null,
"rejected_prediction_tokens": null
}
},
"service_tier": "default",
"system_fingerprint": "fp_44709d6fcb"
}
Log Probabilities Response
When logprobs: true is set in the request, each choice includes a
logprobs object with per-token log probabilities. The top_logprobs
parameter controls how many alternative tokens are returned (0-20).
| Field | Type | Description |
|---|---|---|
| logprobs.content | array | null | Array of token log probability objects for each content token. Null when the model produces no content tokens (e.g., a pure refusal or tool call). |
| logprobs.content[].token | string | The token string. |
| logprobs.content[].logprob | float | Log probability of this token. 0.0 means 100% confidence; more negative values indicate lower confidence. |
| logprobs.content[].bytes | array | null | UTF-8 byte representation of the token. |
| logprobs.content[].top_logprobs | array | Top alternative tokens at this position, each with token, logprob, and bytes fields. Array length matches the top_logprobs request parameter. |
| logprobs.refusal | array | null | Array of token log probability objects for refusal tokens. Present when the model refuses to comply with a request. Each entry has the same structure as logprobs.content[] entries (token, logprob, bytes, top_logprobs). Null when the model does not refuse. |
Log Probabilities Request and Response Example
# Request with logprobs enabled
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Is Paris the capital of France? Answer yes or no."}],
"logprobs": true,
"top_logprobs": 3,
"max_tokens": 5
}'
# Response (truncated)
{
"id": "chatcmpl-abc456",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Yes"
},
"finish_reason": "stop",
"logprobs": {
"content": [
{
"token": "Yes",
"logprob": -0.00012,
"bytes": [89, 101, 115],
"top_logprobs": [
{"token": "Yes", "logprob": -0.00012, "bytes": [89, 101, 115]},
{"token": "yes", "logprob": -9.08, "bytes": [121, 101, 115]},
{"token": "YES", "logprob": -12.31, "bytes": [89, 69, 83]}
]
}
],
"refusal": null
}
}
],
"usage": {"prompt_tokens": 18, "completion_tokens": 1, "total_tokens": 19}
}
Usage Object
| Field | Type | Description |
|---|---|---|
| prompt_tokens | integer | Number of tokens in the input prompt. |
| completion_tokens | integer | Number of tokens in the generated output. |
| total_tokens | integer | Sum of prompt_tokens and completion_tokens. |
| prompt_tokens_details.cached_tokens | integer | Number of prompt tokens served from the prefix cache (KV cache reuse). A higher value indicates more cache hits, reducing time-to-first-token. 0 if no tokens were cached. |
| prompt_tokens_details.audio_tokens | integer | null | Audio input tokens. Null for text-only models. |
| completion_tokens_details.reasoning_tokens | integer | null | Tokens used for internal reasoning. Null for non-reasoning models. |
| completion_tokens_details.audio_tokens | integer | null | Audio output tokens. Null for text-only models. |
| completion_tokens_details.accepted_prediction_tokens | integer | null | Predicted tokens that appeared in the output. |
| completion_tokens_details.rejected_prediction_tokens | integer | null | Predicted tokens that did not appear in the output. |
Response Message Fields
The message object in each response choice contains the assistant's output.
In addition to role and content, the following fields may be present:
| Field | Type | Description |
|---|---|---|
| role | string | Always "assistant" in completion responses. |
| content | string | null | The generated text content. Null when the model produces only tool calls or a refusal. |
| refusal | string | null | A refusal message when the model declines to respond (content policy, safety filters). Null when the model does not refuse. When present, content is typically null. In streaming mode, refusal text is delivered incrementally via delta.refusal. |
| tool_calls | array | null | Tool calls generated by the model. Present when finish_reason is "tool_calls". See Tool Calling. |
| annotations | array | Message annotations such as URL citations. Defaults to an empty array. Reserved for future use with web search and citation features. |
Additional Response Fields
| Field | Type | Description |
|---|---|---|
| service_tier | string | null | The service tier used to process the request, as determined by the endpoint configuration. Values match the platform tier slugs: "free", "cpu", "gpu", or "self_hosted". Always present in both streaming and non-streaming responses. |
| system_fingerprint | string | null | Identifies the backend system configuration (model weights, quantization, GPU type) used for the request. Can be used alongside the seed parameter for reproducibility debugging. Present in both streaming chunks and non-streaming responses. Returns null when the backend does not provide a fingerprint. |
Predicted Outputs
The prediction parameter enables speculative decoding: you supply a
predicted output and the model verifies it in parallel rather than generating
each token sequentially. When the prediction matches, generation is significantly
faster. When it does not match, the model falls back to normal generation.
Prediction Object
| Field | Type | Description |
|---|---|---|
| typerequired | string | Must be "content". |
| contentrequired | string | array | The predicted output text. Can be a plain string or an array of content parts (each with type and text fields). Arrays are normalized to a single concatenated string internally. |
Example Request
{
"model": "deepseek-r1-distill-llama-70b",
"messages": [
{"role": "user", "content": "Replace 'hello' with 'goodbye' in: hello world, hello there"}
],
"prediction": {
"type": "content",
"content": "goodbye world, goodbye there"
}
}
Response Token Details
When a prediction is provided, the response usage.completion_tokens_details
includes prediction-specific token counts:
{
"usage": {
"prompt_tokens": 32,
"completion_tokens": 6,
"total_tokens": 38,
"completion_tokens_details": {
"reasoning_tokens": null,
"accepted_prediction_tokens": 4,
"rejected_prediction_tokens": 2
}
}
}
- accepted_prediction_tokens, Tokens from your prediction that the model verified and used. Higher values indicate a better prediction.
- rejected_prediction_tokens, Tokens from your prediction that the model discarded and regenerated. These still count toward billing.
Best Practices
- Use predictions for code editing, document reformatting, and template-based generation where you can anticipate the output structure.
- Accurate predictions reduce latency via parallel verification. Inaccurate predictions may be slower than no prediction at all.
- Monitor
accepted_prediction_tokensvsrejected_prediction_tokensto evaluate prediction quality.
Tool Calling
Xerotier supports OpenAI-compatible function calling. Define tools in your request and the model may generate tool calls in its response.
Tool Definition
{
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
}
},
"required": ["location"]
},
"strict": false
}
}
],
"tool_choice": "auto"
}
Tool Call Response
When the model decides to call a tool, the response includes tool_calls instead of text content:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"Paris\"}"
}
}
]
},
"finish_reason": "tool_calls"
}]
}
Tool Result Message
Send the tool result back using role tool with the matching tool_call_id:
{
"messages": [
{"role": "user", "content": "What is the weather in Paris?"},
{"role": "assistant", "content": null, "tool_calls": [
{"id": "call_abc123", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}}
]},
{"role": "tool", "tool_call_id": "call_abc123", "content": "{\"temperature\": 18, \"condition\": \"sunny\"}"}
]
}
Rate Limits
Every API response includes rate limit headers. Rate limits are enforced per service tier using a sliding window algorithm.
Rate Limit Response Headers
Both standard (draft IETF) and X- prefixed headers are returned for broad client compatibility:
| Header | Description |
|---|---|
|
RateLimit-Limit
X-RateLimit-Limit
|
Maximum requests allowed per window. |
|
RateLimit-Remaining
X-RateLimit-Remaining
|
Remaining requests in the current window. |
|
RateLimit-Reset
X-RateLimit-Reset
|
Seconds until the current window resets. |
| X-RateLimit-Warning | Set to approaching_limit when remaining requests are below 20% of the limit. Not present otherwise. |
Rate Limits by Tier
| Tier | Requests/Min | Burst Capacity |
|---|---|---|
| free | 64 | +32 (50%, min 3) |
| cpu | 128 | +64 (50%, min 10) |
| gpu | 256 | +128 (50%, min 10) |
| self_hosted | Unlimited | N/A |
Burst capacity allows short-term traffic spikes above the base limit without immediately blocking requests.
429 Rate Limit Exceeded
When the rate limit is exceeded, the API returns a 429 Too Many Requests response with retry guidance:
{
"error": {
"message": "Rate limit exceeded. Please retry after 15 seconds using exponential backoff.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"retry_after": 15,
"retry_strategy": {
"type": "exponential_backoff",
"initial_delay_ms": 15000,
"max_delay_ms": 60000,
"multiplier": 2,
"jitter": true
}
}
}
The Retry-After header is also set. Clients should implement exponential backoff with jitter as described in the retry_strategy object.
Error envelope type and code values shown here illustrate
intent; the canonical list of error types and their HTTP status mapping is
documented on the Errors page. Treat that page as the
source of truth if a value differs.
Management API Rate Limits
Management API endpoints (batch, conversations, files, uploads, webhooks, and SLOs) are subject to a separate per-project rate limit. This limit applies uniformly across all management endpoints for a given project.
| Setting | Value |
|---|---|
| Default limit | 60 requests per minute per project |
| Configuration | Fixed limit; not configurable per project |
| Window | Sliding 60-second window |
Response headers on management API endpoints:
| Header | Description |
|---|---|
X-RateLimit-Limit |
Maximum requests allowed per window. |
X-RateLimit-Remaining |
Remaining requests in the current window. |
X-RateLimit-Reset |
Seconds until the window resets. |
Retry-After |
Seconds to wait before retrying (only on 429 responses). |
When the limit is exceeded, the API returns HTTP 429 Too Many Requests. Inference endpoints (chat completions, embeddings, responses) have separate per-endpoint rate limits as described above and are not affected by the management API rate limit.
Request Timeouts
Request timeouts are determined by your endpoint's service tier. These are not client-configurable.
Request Deadline Timeout
The maximum time a request can wait for a response before being cancelled:
| Tier | Timeout |
|---|---|
| free | 30 seconds |
| cpu | 300 seconds |
| gpu | 300 seconds |
| self_hosted | 1800 seconds |
Idle Stream Timeout
For streaming requests, the maximum time between chunks before the stream is terminated:
| Tier | Idle Timeout |
|---|---|
| free | 120 seconds |
| cpu | 600 seconds |
| gpu | 600 seconds |
| self_hosted | 3600 seconds |
For requests with reasoning_effort set, the idle timeout uses the full request deadline timeout to accommodate long reasoning phases with no output.
Timeout Error Responses
For non-streaming requests, timeout returns HTTP 408. For streaming requests, an SSE error event is sent:
event: error
data: {"error": {"type": "timeout_error", "message": "Request timed out after 30s. Your free tier has a 30-second timeout limit."}}
max_tokens Auto-Clamping
When max_tokens (or max_completion_tokens) exceeds the
available context window for a request, Xerotier automatically clamps the value
to the maximum available instead of rejecting the request with an error.
This is useful when clients hardcode a default max_tokens value
(e.g., 32000) that may exceed the remaining context as conversations grow.
Instead of returning a 400 error, the request succeeds with a reduced output
limit.
How It Works
Clamping occurs at two levels for reliability:
- Router-level (heuristic): The router estimates input token
count from message character length and clamps
max_tokensif it would exceed the model's context window. This is a fast-path optimization that avoids an extra round trip to the inference engine. - Agent-level (exact): If a request passes the router estimate but the inference engine rejects it with an exact token count error, the agent automatically retries once with the corrected value. This safety net uses precise tokenizer counts.
Detection
When clamping occurs, the response includes the
X-Xerotier-Max-Tokens-Clamped header showing the original and
clamped values:
X-Xerotier-Max-Tokens-Clamped: 32000 -> 18943
Clients can inspect this header to detect that clamping occurred and adjust
their max_tokens settings if desired. When no clamping is needed,
the header is absent.
Models
List and describe the models available through your Xerotier.ai endpoint.
List Models
GET /proj_ABC123/v1/models
Lists the currently available models and their metadata.
curl https://api.xerotier.ai/proj_ABC123/v1/models \
-H "Authorization: Bearer xero_myproject_your_api_key"
Response
{
"object": "list",
"data": [
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"object": "model",
"created": 1706000000,
"owned_by": "my-project"
},
{
"id": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
"object": "model",
"created": 1706000000,
"owned_by": "my-project"
}
]
}
Retrieve Model
GET /proj_ABC123/v1/models/{model}
Retrieves a model instance, providing information about the model.
Reranking and Scoring
Reranking endpoints (/v1/rerank, /v1/score) are documented on the
Reranking page. The /v1/score endpoint scores
a query against one or more candidate documents using the endpoint's configured
reranker model.
Endpoints
List inference endpoints configured for your project.
List Endpoints
GET /proj_ABC123/v1/endpoints
Returns all non-deleted endpoints for your project, including those in provisioning, suspended, or error states.
curl https://api.xerotier.ai/proj_ABC123/v1/endpoints \
-H "Authorization: Bearer xero_myproject_your_api_key"
import requests
headers = {"Authorization": "Bearer xero_myproject_your_api_key"}
response = requests.get(
"https://api.xerotier.ai/proj_ABC123/v1/endpoints",
headers=headers
)
for endpoint in response.json()["data"]:
print(f"{endpoint['name']} ({endpoint['status']})")
const response = await fetch(
"https://api.xerotier.ai/proj_ABC123/v1/endpoints",
{
headers: {
"Authorization": "Bearer xero_myproject_your_api_key"
}
}
);
const data = await response.json();
for (const endpoint of data.data) {
console.log(`${endpoint.name} (${endpoint.status})`);
}
Response
{
"object": "list",
"data": [
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"object": "endpoint",
"slug": "my-endpoint",
"name": "My Endpoint",
"model_id": "00000000-1111-0000-1111-000000000000",
"model_name": "llama-3.1-8b-instruct",
"tier_id": "free",
"status": "active",
"custom_domain": null,
"max_requests_per_minute": 60,
"max_tokens_per_minute": 100000,
"provisioning_state": null,
"provisioned_worker_id": null,
"created": 1706123456
}
]
}