API Reference
Core API endpoints for chat completions and model information. The Xerotier.ai API is fully compatible with the OpenAI API specification.
Chat Completions
Creates a model response for the given chat conversation. This is the primary endpoint for interacting with language models.
Create Chat Completion
POST /:project_id/:endpoint_slug/v1/chat/completions
URL path: :project_id is your project's external ID (e.g., proj_ABC123) and
:endpoint_slug is the slug of the endpoint to route through (e.g., my-endpoint).
Your full base URL is https://api.xerotier.ai/proj_ABC123/<endpoint-slug>/v1.
Note on service tier: The service tier is determined by your endpoint configuration, not by the request body. There is no client-side tier override. The tier controls which accelerator types and workers can serve requests. See the Service Tiers documentation for full details.
Request Body
| Parameter | Type | Description |
|---|---|---|
| modelrequired | string | ID of the model to use (e.g., "deepseek-r1-distill-llama-70b"). This field is informational; the actual model used is determined by the endpoint configuration. |
| messagesrequired | array | A list of messages comprising the conversation so far. Supports roles: system, user, assistant, tool, developer. |
| max_tokensoptional | integer | Maximum number of tokens to generate. When omitted, the backend uses the remaining context window as the limit. If this value exceeds the available context window, it is automatically clamped. See Auto-Clamping below. |
| max_completion_tokensoptional | integer | Upper bound on the number of tokens to generate. Preferred over max_tokens. Same auto-clamping behavior applies. When both are set, this takes precedence. |
| temperatureoptional | number | Sampling temperature (0.0-2.0). Higher values make output more random. Default: 0.7 |
| top_poptional | number | Nucleus sampling parameter (0.0-1.0). Default: 0.9 |
| streamoptional | boolean | If true, partial message deltas will be sent as Server-Sent Events. Default: true |
| stream_optionsoptional | object | Options for streaming mode. Set {"include_usage": true} to include token usage in the final stream chunk. When omitted or set to false, the final chunk will not contain a usage field. Token usage is always tracked internally for billing regardless of this setting. |
| stopoptional | string | array | Up to 4 sequences where the API will stop generating. |
| frequency_penaltyoptional | number | Penalty for repeated tokens (-2.0 to 2.0). Positive values decrease likelihood of repeating the same tokens. Default: 0 |
| presence_penaltyoptional | number | Penalty for tokens already in the context (-2.0 to 2.0). Positive values increase likelihood of new topics. Default: 0 |
| noptional | integer | Number of completions to generate. Default: 1. Only n=1 is supported in streaming mode. |
| seedoptional | integer | Seed for deterministic sampling. When set, repeated requests with the same seed and parameters should return the same result. |
| logprobsoptional | boolean | Return log probabilities of output tokens. Default: false |
| top_logprobsoptional | integer | Number of most likely tokens to return log probabilities for (0-20). Requires logprobs: true. |
| logit_biasoptional | object | Map of token IDs to bias values (-100 to 100). Modifies the likelihood of specified tokens appearing in the output. |
| toolsoptional | array | A list of tool definitions the model may call. See Tool Calling below. |
| tool_choiceoptional | string | object | Controls tool selection: "auto", "none", "required", or {"type": "function", "function": {"name": "fn_name"}} |
| parallel_tool_callsoptional | boolean | Enable parallel tool calls. When true, the model may generate multiple tool calls in a single response. |
| response_formatoptional | object | Response format: {"type": "text"}, {"type": "json_object"}, or {"type": "json_schema", "json_schema": {"name": "...", "schema": {...}, "strict": true}} |
| metadataoptional | object | Up to 16 key-value pairs for request metadata. Keys max 64 characters, values max 512 characters. |
| useroptional | string | A unique identifier for the end user. Used for abuse monitoring and usage tracking. |
| storeoptional | boolean | Store the completion for later retrieval. Default: false |
| reasoning_effortoptional | string | Reasoning effort level for reasoning models (e.g., o1). Valid values: "low", "medium", "high". Controls how much reasoning the model applies before generating output. Invalid values return a 400 error. When set, the idle stream timeout uses the full request deadline to accommodate long reasoning phases. |
| service_tieroptional | string | Requested service tier. Accepted for OpenAI API compatibility but has no effect on routing -- the endpoint's configured tier is always used. The actual tier used is returned in the service_tier response field. |
| predictionoptional | object | Predicted output content for speculative decoding. When the model can verify the prediction, generation is faster because tokens are validated in parallel rather than generated sequentially. The object must have "type": "content" and a "content" field (string or array of content parts). Token counts for accepted and rejected predictions appear in completion_tokens_details. See Predicted Outputs below. |
| modalitiesoptional | array | Output modalities to generate. Currently only ["text"] is supported. Requests with unsupported modalities (e.g., "audio", "image") return a 400 error. |
Optional Request Headers
| Header | Type | Description |
|---|---|---|
| X-SLO-TTFT-Msoptional | number | Target time-to-first-token in milliseconds. The router prefers workers likely to meet this target. Must be a positive number; invalid values are ignored. |
| X-SLO-TPOT-Msoptional | number | Target time-per-output-token in milliseconds. The router prefers workers likely to meet this target. Must be a positive number; invalid values are ignored. |
Response Headers
| Header | Description |
|---|---|
| X-Request-ID | Unique identifier for the request (matches the response body id field). Present on both streaming and non-streaming responses. Include this in support tickets for request tracing. |
| X-Xerotier-Worker-ID | Identifier of the worker that handled the request. Useful for correlating latency with routing decisions. |
| X-Xerotier-Max-Tokens-Clamped | Present only when max_tokens was automatically reduced to fit the available context window. Format: <original> -> <clamped> (e.g., 32000 -> 18943). See Auto-Clamping below. |
Message Object
| Parameter | Type | Description |
|---|---|---|
| rolerequired | string | The role of the message author: system, user, assistant, tool, or developer |
| contentrequired | string | array | The content of the message. Can be a string or an array of content parts. |
| nameoptional | string | An optional name for the participant. Useful for distinguishing between multiple users or assistants in the same conversation. |
| tool_call_idoptional | string | Required when role is tool. The ID of the tool call this message responds to. |
| tool_callsoptional | array | Tool calls generated by the model (present in assistant messages). |
Example Request
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
response = client.chat.completions.create(
model="deepseek-r1-distill-llama-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
max_tokens=100,
temperature=0.7
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
apiKey: "xero_myproject_your_api_key"
});
const response = await client.chat.completions.create({
model: "deepseek-r1-distill-llama-70b",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is the capital of France?" }
],
max_tokens: 100,
temperature: 0.7
});
console.log(response.choices[0].message.content);
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100,
"temperature": 0.7
}'
Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1706123456,
"model": "deepseek-r1-distill-llama-70b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris.",
"refusal": null,
"annotations": []
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": null
},
"completion_tokens_details": {
"reasoning_tokens": null,
"audio_tokens": null,
"accepted_prediction_tokens": null,
"rejected_prediction_tokens": null
}
},
"service_tier": "default",
"system_fingerprint": "fp_44709d6fcb"
}
Log Probabilities Response
When logprobs: true is set in the request, each choice includes a
logprobs object with per-token log probabilities. The top_logprobs
parameter controls how many alternative tokens are returned (0-20).
| Field | Type | Description |
|---|---|---|
| logprobs.content | array | null | Array of token log probability objects for each content token. Null when the model produces no content tokens (e.g., a pure refusal or tool call). |
| logprobs.content[].token | string | The token string. |
| logprobs.content[].logprob | float | Log probability of this token. 0.0 means 100% confidence; more negative values indicate lower confidence. |
| logprobs.content[].bytes | array | null | UTF-8 byte representation of the token. |
| logprobs.content[].top_logprobs | array | Top alternative tokens at this position, each with token, logprob, and bytes fields. Array length matches the top_logprobs request parameter. |
| logprobs.refusal | array | null | Array of token log probability objects for refusal tokens. Present when the model refuses to comply with a request. Each entry has the same structure as logprobs.content[] entries (token, logprob, bytes, top_logprobs). Null when the model does not refuse. |
Log Probabilities Request and Response Example
# Request with logprobs enabled
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-70b",
"messages": [{"role": "user", "content": "Is Paris the capital of France? Answer yes or no."}],
"logprobs": true,
"top_logprobs": 3,
"max_tokens": 5
}'
# Response (truncated)
{
"id": "chatcmpl-abc456",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Yes"
},
"finish_reason": "stop",
"logprobs": {
"content": [
{
"token": "Yes",
"logprob": -0.00012,
"bytes": [89, 101, 115],
"top_logprobs": [
{"token": "Yes", "logprob": -0.00012, "bytes": [89, 101, 115]},
{"token": "yes", "logprob": -9.08, "bytes": [121, 101, 115]},
{"token": "YES", "logprob": -12.31, "bytes": [89, 69, 83]}
]
}
],
"refusal": null
}
}
],
"usage": {"prompt_tokens": 18, "completion_tokens": 1, "total_tokens": 19}
}
Usage Object
| Field | Type | Description |
|---|---|---|
| prompt_tokens | integer | Number of tokens in the input prompt. |
| completion_tokens | integer | Number of tokens in the generated output. |
| total_tokens | integer | Sum of prompt_tokens and completion_tokens. |
| prompt_tokens_details.cached_tokens | integer | Number of prompt tokens served from the prefix cache (KV cache reuse). A higher value indicates more cache hits, reducing time-to-first-token. 0 if no tokens were cached. |
| prompt_tokens_details.audio_tokens | integer | null | Audio input tokens. Null for text-only models. |
| completion_tokens_details.reasoning_tokens | integer | null | Tokens used for internal reasoning. Null for non-reasoning models. |
| completion_tokens_details.audio_tokens | integer | null | Audio output tokens. Null for text-only models. |
| completion_tokens_details.accepted_prediction_tokens | integer | null | Predicted tokens that appeared in the output. |
| completion_tokens_details.rejected_prediction_tokens | integer | null | Predicted tokens that did not appear in the output. |
Response Message Fields
The message object in each response choice contains the assistant's output.
In addition to role and content, the following fields may be present:
| Field | Type | Description |
|---|---|---|
| role | string | Always "assistant" in completion responses. |
| content | string | null | The generated text content. Null when the model produces only tool calls or a refusal. |
| refusal | string | null | A refusal message when the model declines to respond (content policy, safety filters). Null when the model does not refuse. When present, content is typically null. In streaming mode, refusal text is delivered incrementally via delta.refusal. |
| tool_calls | array | null | Tool calls generated by the model. Present when finish_reason is "tool_calls". See Tool Calling. |
| annotations | array | Message annotations such as URL citations. Defaults to an empty array. Reserved for future use with web search and citation features. |
Additional Response Fields
| Field | Type | Description |
|---|---|---|
| service_tier | string | null | The service tier used to process the request, as determined by the endpoint configuration (e.g., "free", "gpu_nvidia_shared"). Always present in both streaming and non-streaming responses. |
| system_fingerprint | string | null | Identifies the backend system configuration (model weights, quantization, GPU type) used for the request. Can be used alongside the seed parameter for reproducibility debugging. Present in both streaming chunks and non-streaming responses. Returns null when the backend does not provide a fingerprint. |
Predicted Outputs
The prediction parameter enables speculative decoding: you supply a
predicted output and the model verifies it in parallel rather than generating
each token sequentially. When the prediction matches, generation is significantly
faster. When it does not match, the model falls back to normal generation.
Prediction Object
| Field | Type | Description |
|---|---|---|
| typerequired | string | Must be "content". |
| contentrequired | string | array | The predicted output text. Can be a plain string or an array of content parts (each with type and text fields). Arrays are normalized to a single concatenated string internally. |
Example Request
{
"model": "deepseek-r1-distill-llama-70b",
"messages": [
{"role": "user", "content": "Replace 'hello' with 'goodbye' in: hello world, hello there"}
],
"prediction": {
"type": "content",
"content": "goodbye world, goodbye there"
}
}
Response Token Details
When a prediction is provided, the response usage.completion_tokens_details
includes prediction-specific token counts:
{
"usage": {
"prompt_tokens": 32,
"completion_tokens": 6,
"total_tokens": 38,
"completion_tokens_details": {
"reasoning_tokens": null,
"accepted_prediction_tokens": 4,
"rejected_prediction_tokens": 2
}
}
}
- accepted_prediction_tokens -- Tokens from your prediction that the model verified and used. Higher values indicate a better prediction.
- rejected_prediction_tokens -- Tokens from your prediction that the model discarded and regenerated. These still count toward billing.
Best Practices
- Use predictions for code editing, document reformatting, and template-based generation where you can anticipate the output structure.
- Accurate predictions reduce latency via parallel verification. Inaccurate predictions may be slower than no prediction at all.
- Monitor
accepted_prediction_tokensvsrejected_prediction_tokensto evaluate prediction quality.
Tool Calling
Xerotier supports OpenAI-compatible function calling. Define tools in your request and the model may generate tool calls in its response.
Tool Definition
{
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
}
},
"required": ["location"]
},
"strict": false
}
}
],
"tool_choice": "auto"
}
Tool Call Response
When the model decides to call a tool, the response includes tool_calls instead of text content:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"Paris\"}"
}
}
]
},
"finish_reason": "tool_calls"
}]
}
Tool Result Message
Send the tool result back using role tool with the matching tool_call_id:
{
"messages": [
{"role": "user", "content": "What is the weather in Paris?"},
{"role": "assistant", "content": null, "tool_calls": [
{"id": "call_abc123", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}}
]},
{"role": "tool", "tool_call_id": "call_abc123", "content": "{\"temperature\": 18, \"condition\": \"sunny\"}"}
]
}
Rate Limits
Every API response includes rate limit headers. Rate limits are enforced per service tier using a sliding window algorithm.
Rate Limit Response Headers
Both standard (draft IETF) and X- prefixed headers are returned for broad client compatibility:
| Header | Description |
|---|---|
| RateLimit-Limit / X-RateLimit-Limit | Maximum requests allowed per window. |
| RateLimit-Remaining / X-RateLimit-Remaining | Remaining requests in the current window. |
| RateLimit-Reset / X-RateLimit-Reset | Seconds until the current window resets. |
| X-RateLimit-Warning | Set to approaching_limit when remaining requests are below 20% of the limit. Not present otherwise. |
Rate Limits by Tier
| Tier | Requests/Min | Burst Capacity |
|---|---|---|
| Free | 60 | +30 (50%, min 3) |
| CPU AMD / CPU Intel | 120 | +60 (50%, min 10) |
| GPU NVIDIA / GPU AMD / GPU Intel | 240 | +120 (50%, min 10) |
| XIM | Unlimited | N/A |
Burst capacity allows short-term traffic spikes above the base limit without immediately blocking requests.
429 Rate Limit Exceeded
When the rate limit is exceeded, the API returns a 429 Too Many Requests response with retry guidance:
{
"error": {
"message": "Rate limit exceeded. Please retry after 15 seconds using exponential backoff.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"retry_after": 15,
"retry_strategy": {
"type": "exponential_backoff",
"initial_delay_ms": 15000,
"max_delay_ms": 60000,
"multiplier": 2,
"jitter": true
}
}
}
The Retry-After header is also set. Clients should implement exponential backoff with jitter as described in the retry_strategy object.
Management API Rate Limits
Management API endpoints (batch, conversations, files, uploads, webhooks, and SLOs) are subject to a separate per-project rate limit. This limit applies uniformly across all management endpoints for a given project.
| Setting | Value |
|---|---|
| Default limit | 60 requests per minute per project |
| Configuration | Configured per-instance by the platform operator |
| Window | Sliding 60-second window |
Response headers on management API endpoints:
| Header | Description |
|---|---|
X-RateLimit-Limit |
Maximum requests allowed per window. |
X-RateLimit-Remaining |
Remaining requests in the current window. |
X-RateLimit-Reset |
Seconds until the window resets. |
Retry-After |
Seconds to wait before retrying (only on 429 responses). |
When the limit is exceeded, the API returns HTTP 429 Too Many Requests. Inference endpoints (chat completions, embeddings, responses) have separate per-endpoint rate limits as described above and are not affected by the management API rate limit.
Request Timeouts
Request timeouts are determined by your endpoint's service tier. These are not client-configurable.
Request Deadline Timeout
The maximum time a request can wait for a response before being cancelled:
| Tier | Timeout |
|---|---|
| Free | 30 seconds |
| CPU AMD / CPU Intel | 300 seconds |
| GPU NVIDIA / GPU AMD / GPU Intel | 300 seconds |
| XIM | 1800 seconds |
Idle Stream Timeout
For streaming requests, the maximum time between chunks before the stream is terminated:
| Tier | Idle Timeout |
|---|---|
| Free | 120 seconds |
| CPU AMD / CPU Intel | 600 seconds |
| GPU NVIDIA / GPU AMD / GPU Intel | 600 seconds |
| XIM | 3600 seconds |
For requests with reasoning_effort set, the idle timeout uses the full request deadline timeout to accommodate long reasoning phases with no output.
Timeout Error Responses
For non-streaming requests, timeout returns HTTP 408. For streaming requests, an SSE error event is sent:
event: error
data: {"error": {"type": "timeout_error", "message": "Request timed out after 30s. Your Free tier has a 30-second timeout limit."}}
max_tokens Auto-Clamping
When max_tokens (or max_completion_tokens) exceeds the
available context window for a request, Xerotier automatically clamps the value
to the maximum available instead of rejecting the request with an error.
This is useful when clients hardcode a default max_tokens value
(e.g., 32000) that may exceed the remaining context as conversations grow.
Instead of returning a 400 error, the request succeeds with a reduced output
limit.
How It Works
Clamping occurs at two levels for reliability:
- Router-level (heuristic): The router estimates input token
count from message character length and clamps
max_tokensif it would exceed the model's context window. This is a fast-path optimization that avoids an extra round trip to the inference engine. - Agent-level (exact): If a request passes the router estimate but the inference engine rejects it with an exact token count error, the agent automatically retries once with the corrected value. This safety net uses precise tokenizer counts.
Detection
When clamping occurs, the response includes the
X-Xerotier-Max-Tokens-Clamped header showing the original and
clamped values:
X-Xerotier-Max-Tokens-Clamped: 32000 -> 18943
Clients can inspect this header to detect that clamping occurred and adjust
their max_tokens settings if desired. When no clamping is needed,
the header is absent.
Models
List and describe the models available through your Xerotier.ai endpoint.
List Models
GET /proj_ABC123/v1/models
Lists the currently available models and their metadata.
curl https://api.xerotier.ai/proj_ABC123/v1/models \
-H "Authorization: Bearer xero_myproject_your_api_key"
Response
{
"object": "list",
"data": [
{
"id": "deepseek-r1-distill-llama-70b",
"object": "model",
"created": 1706000000,
"owned_by": "Xerotier.ai"
},
{
"id": "llama-3.1-8b-instruct",
"object": "model",
"created": 1706000000,
"owned_by": "Xerotier.ai"
}
]
}
Retrieve Model
GET /proj_ABC123/v1/models/{model}
Retrieves a model instance, providing information about the model.
Endpoints
List inference endpoints configured for your project.
List Endpoints
GET /proj_ABC123/v1/endpoints
Returns all non-deleted endpoints for your project, including those in provisioning, suspended, or error states.
curl https://api.xerotier.ai/proj_ABC123/v1/endpoints \
-H "Authorization: Bearer xero_myproject_your_api_key"
import requests
headers = {"Authorization": "Bearer xero_myproject_your_api_key"}
response = requests.get(
"https://api.xerotier.ai/proj_ABC123/v1/endpoints",
headers=headers
)
for endpoint in response.json()["data"]:
print(f"{endpoint['name']} ({endpoint['status']})")
const response = await fetch(
"https://api.xerotier.ai/proj_ABC123/v1/endpoints",
{
headers: {
"Authorization": "Bearer xero_myproject_your_api_key"
}
}
);
const data = await response.json();
for (const endpoint of data.data) {
console.log(`${endpoint.name} (${endpoint.status})`);
}
Response
{
"object": "list",
"data": [
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"object": "endpoint",
"slug": "my-endpoint",
"name": "My Endpoint",
"model_id": "00000000-1111-0000-1111-000000000000",
"model_name": "llama-3.1-8b-instruct",
"tier_id": "free",
"status": "active",
"custom_domain": null,
"max_requests_per_minute": 60,
"max_tokens_per_minute": 100000,
"provisioning_state": null,
"provisioned_worker_id": null,
"created": 1706123456
}
]
}