API Reference - Xerotier

Chat Completions

Creates a model response for the given chat conversation. This is the primary endpoint for interacting with language models.

Create Chat Completion

POST /:project_id/:endpoint_slug/v1/chat/completions

URL path: :project_id is your project's external ID (e.g., proj_ABC123) and :endpoint_slug is the slug of the endpoint to route through (e.g., my-endpoint). Your full base URL is https://api.xerotier.ai/proj_ABC123/<endpoint-slug>/v1.

Request Body

Parameter	Type	Description
modelrequired	string	ID of the model to use. The field is informational; the actual model served is determined by the endpoint configuration, so any non-empty string is accepted (the canonical form is the endpoint's configured model name).
messagesrequired	array	A list of messages comprising the conversation so far. Supports roles: `system`, `user`, `assistant`, `tool`, `developer`.
max_tokensoptional	integer	Maximum number of tokens to generate. When omitted, the backend uses the remaining context window as the limit. If this value exceeds the available context window, it is automatically clamped. See Auto-Clamping below.
max_completion_tokensoptional	integer	Upper bound on the number of tokens to generate. Preferred over `max_tokens`. Same auto-clamping behavior applies. When both are set, this takes precedence.
temperatureoptional	number	Sampling temperature (0.0-2.0). Higher values make output more random. Default: 0.7
top_poptional	number	Nucleus sampling parameter (0.0-1.0). Default: 0.9
streamoptional	boolean	If true, partial message deltas will be sent as Server-Sent Events. Default: false (omitted or null is treated as non-streaming).
stream_optionsoptional	object	Options for streaming mode. Set `{"include_usage": true}` to include token usage in the final stream chunk. When omitted or set to false, the final chunk will not contain a `usage` field. Token usage is always tracked internally for billing regardless of this setting.
stopoptional	string \| array	Up to 4 sequences where the API will stop generating.
frequency_penaltyoptional	number	Penalty for repeated tokens (-2.0 to 2.0). Positive values decrease likelihood of repeating the same tokens. Default: 0
presence_penaltyoptional	number	Penalty for tokens already in the context (-2.0 to 2.0). Positive values increase likelihood of new topics. Default: 0
noptional	integer	Number of completions to generate. Default: 1. Only n=1 is supported in streaming mode.
seedoptional	integer	Seed for deterministic sampling. When set, repeated requests with the same seed and parameters should return the same result.
logprobsoptional	boolean	Return log probabilities of output tokens. Default: false
top_logprobsoptional	integer	Number of most likely tokens to return log probabilities for (0-20). Requires `logprobs: true`.
logit_biasoptional	object	Map of token IDs to bias values (-100 to 100). Modifies the likelihood of specified tokens appearing in the output.
toolsoptional	array	A list of tool definitions the model may call. See Tool Calling below.
tool_choiceoptional	string \| object	Controls tool selection: `"auto"`, `"none"`, `"required"`, or `{"type": "function", "function": {"name": "fn_name"}}`
parallel_tool_callsoptional	boolean	Enable parallel tool calls. When true, the model may generate multiple tool calls in a single response.
response_formatoptional	object	Response format: `{"type": "text"}`, `{"type": "json_object"}`, or `{"type": "json_schema", "json_schema": {"name": "...", "schema": {...}, "strict": true}}`
metadataoptional	object	Up to 16 key-value pairs for request metadata. Keys max 64 characters, values max 512 characters.
useroptional	string	A unique identifier for the end user. Used for abuse monitoring and usage tracking.
storeoptional	boolean	Store the completion for later retrieval. Default: false
reasoning_effortoptional	string	Reasoning effort level for reasoning models. Valid values: `"low"`, `"medium"`, `"high"`. Controls how much reasoning the model applies before generating output. Invalid values return a 400 error. When set, the idle stream timeout uses the full request deadline to accommodate long reasoning phases. Note: On the chat-completions stream, intermediate reasoning text is not emitted in `delta` chunks (the SSE delta carries `content`, `refusal`, and `tool_calls` only); to receive incremental reasoning content use the Responses API.
service_tieroptional	string	Requested service-tier hint. Routing is still bound to the endpoint's configured tier, but the value influences worker selection and billing: `"flex"` applies a -15 score adjustment (lower priority, no billing change), `"priority"` applies a +15 score adjustment and a 1.25x billing multiplier. `"auto"` and `"default"` are no-ops. The actual tier used is returned in the `service_tier` response field. See Service Tiers.
predictionoptional	object	Predicted output content for speculative decoding. When the model can verify the prediction, generation is faster because tokens are validated in parallel rather than generated sequentially. The object must have `"type": "content"` and a `"content"` field (string or array of content parts). Token counts for accepted and rejected predictions appear in `completion_tokens_details`. See Predicted Outputs below.
modalitiesoptional	array	Output modalities to generate. Currently only `["text"]` is supported. Requests with unsupported modalities (e.g., "audio", "image") return a 400 error.

Optional Request Headers

Header	Type	Description
X-SLO-TTFT-Msoptional	number	Target time-to-first-token in milliseconds. The router prefers workers likely to meet this target. Must be a positive number; invalid values are ignored.
X-SLO-TPOT-Msoptional	number	Target time-per-output-token in milliseconds. The router prefers workers likely to meet this target. Must be a positive number; invalid values are ignored.

Response Headers

Header	Description
X-Request-ID	Unique identifier for the request (matches the response body `id` field). Present on both streaming and non-streaming responses. Include this in support tickets for request tracing.
X-Xerotier-Worker-ID	Identifier of the worker that handled the request. Useful for correlating latency with routing decisions.
X-Xerotier-Max-Tokens-Clamped	Present only when `max_tokens` was automatically reduced to fit the available context window. Format: `<original> -> <clamped>` (e.g., `32000 -> 18943`). See Auto-Clamping below.

Message Object

Parameter	Type	Description
rolerequired	string	The role of the message author: `system`, `user`, `assistant`, `tool`, or `developer`
contentrequired	string \| array	The content of the message. Can be a string or an array of content parts.
nameoptional	string	An optional name for the participant. Useful for distinguishing between multiple users or assistants in the same conversation.
tool_call_idoptional	string	Required when `role` is `tool`. The ID of the tool call this message responds to.
tool_callsoptional	array	Tool calls generated by the model (present in assistant messages).

Example Request

Python

                    from openai import OpenAI

client = OpenAI(
    base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    api_key="xero_myproject_your_api_key"
)

response = client.chat.completions.create(
    model="deepseek-r1-distill-llama-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=100,
    temperature=0.7
)
print(response.choices[0].message.content)
                

Node.js

                    import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    apiKey: "xero_myproject_your_api_key"
});

const response = await client.chat.completions.create({
    model: "deepseek-r1-distill-llama-70b",
    messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "What is the capital of France?" }
    ],
    max_tokens: 100,
    temperature: 0.7
});
console.log(response.choices[0].message.content);
                

curl

                    curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
  -H "Authorization: Bearer xero_myproject_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1-distill-llama-70b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'
                

Response

                        {
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1706123456,
  "model": "deepseek-r1-distill-llama-70b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "refusal": null,
        "annotations": []
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": null
    },
    "completion_tokens_details": {
      "reasoning_tokens": null,
      "audio_tokens": null,
      "accepted_prediction_tokens": null,
      "rejected_prediction_tokens": null
    }
  },
  "service_tier": "default",
  "system_fingerprint": "fp_44709d6fcb"
}
                    

Log Probabilities Response

When logprobs: true is set in the request, each choice includes a logprobs object with per-token log probabilities. The top_logprobs parameter controls how many alternative tokens are returned (0-20).

Field	Type	Description
logprobs.content	array \| null	Array of token log probability objects for each content token. Null when the model produces no content tokens (e.g., a pure refusal or tool call).
logprobs.content[].token	string	The token string.
logprobs.content[].logprob	float	Log probability of this token. 0.0 means 100% confidence; more negative values indicate lower confidence.
logprobs.content[].bytes	array \| null	UTF-8 byte representation of the token.
logprobs.content[].top_logprobs	array	Top alternative tokens at this position, each with `token`, `logprob`, and `bytes` fields. Array length matches the `top_logprobs` request parameter.
logprobs.refusal	array \| null	Array of token log probability objects for refusal tokens. Present when the model refuses to comply with a request. Each entry has the same structure as `logprobs.content[]` entries (`token`, `logprob`, `bytes`, `top_logprobs`). Null when the model does not refuse.

Log Probabilities Request and Response Example

                        # Request with logprobs enabled
curl -X POST https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
  -H "Authorization: Bearer xero_myproject_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "deepseek-r1-distill-llama-70b",
  "messages": [{"role": "user", "content": "Is Paris the capital of France? Answer yes or no."}],
  "logprobs": true,
  "top_logprobs": 3,
  "max_tokens": 5
}'

# Response (truncated)
{
  "id": "chatcmpl-abc456",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Yes"
      },
      "finish_reason": "stop",
      "logprobs": {
        "content": [
          {
            "token": "Yes",
            "logprob": -0.00012,
            "bytes": [89, 101, 115],
            "top_logprobs": [
              {"token": "Yes", "logprob": -0.00012, "bytes": [89, 101, 115]},
              {"token": "yes", "logprob": -9.08, "bytes": [121, 101, 115]},
              {"token": "YES", "logprob": -12.31, "bytes": [89, 69, 83]}
            ]
          }
        ],
        "refusal": null
      }
    }
  ],
  "usage": {"prompt_tokens": 18, "completion_tokens": 1, "total_tokens": 19}
}
                    

Usage Object

Field	Type	Description
prompt_tokens	integer	Number of tokens in the input prompt.
completion_tokens	integer	Number of tokens in the generated output.
total_tokens	integer	Sum of prompt_tokens and completion_tokens.
prompt_tokens_details.cached_tokens	integer	Number of prompt tokens served from the prefix cache (KV cache reuse). A higher value indicates more cache hits, reducing time-to-first-token. 0 if no tokens were cached.
prompt_tokens_details.audio_tokens	integer \| null	Audio input tokens. Null for text-only models.
completion_tokens_details.reasoning_tokens	integer \| null	Tokens used for internal reasoning. Null for non-reasoning models.
completion_tokens_details.audio_tokens	integer \| null	Audio output tokens. Null for text-only models.
completion_tokens_details.accepted_prediction_tokens	integer \| null	Predicted tokens that appeared in the output.
completion_tokens_details.rejected_prediction_tokens	integer \| null	Predicted tokens that did not appear in the output.

Response Message Fields

The message object in each response choice contains the assistant's output. In addition to role and content, the following fields may be present:

Field	Type	Description
role	string	Always `"assistant"` in completion responses.
content	string \| null	The generated text content. Null when the model produces only tool calls or a refusal.
refusal	string \| null	A refusal message when the model declines to respond (content policy, safety filters). Null when the model does not refuse. When present, `content` is typically null. In streaming mode, refusal text is delivered incrementally via `delta.refusal`.
tool_calls	array \| null	Tool calls generated by the model. Present when `finish_reason` is `"tool_calls"`. See Tool Calling.
annotations	array	Message annotations such as URL citations. Defaults to an empty array. Reserved for future use with web search and citation features.

Additional Response Fields

Field	Type	Description
service_tier	string \| null	The service tier used to process the request, as determined by the endpoint configuration. Values match the platform tier slugs: `"free"`, `"cpu"`, `"gpu"`, or `"self_hosted"`. Always present in both streaming and non-streaming responses.
system_fingerprint	string \| null	Identifies the backend system configuration (model weights, quantization, GPU type) used for the request. Can be used alongside the `seed` parameter for reproducibility debugging. Present in both streaming chunks and non-streaming responses. Returns null when the backend does not provide a fingerprint.

Predicted Outputs

The prediction parameter enables speculative decoding: you supply a predicted output and the model verifies it in parallel rather than generating each token sequentially. When the prediction matches, generation is significantly faster. When it does not match, the model falls back to normal generation.

Prediction Object

Field	Type	Description
typerequired	string	Must be `"content"`.
contentrequired	string \| array	The predicted output text. Can be a plain string or an array of content parts (each with `type` and `text` fields). Arrays are normalized to a single concatenated string internally.

Example Request

JSON

                    {
  "model": "deepseek-r1-distill-llama-70b",
  "messages": [
    {"role": "user", "content": "Replace 'hello' with 'goodbye' in: hello world, hello there"}
  ],
  "prediction": {
    "type": "content",
    "content": "goodbye world, goodbye there"
  }
}
                

Response Token Details

When a prediction is provided, the response usage.completion_tokens_details includes prediction-specific token counts:

JSON

                    {
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 6,
    "total_tokens": 38,
    "completion_tokens_details": {
      "reasoning_tokens": null,
      "accepted_prediction_tokens": 4,
      "rejected_prediction_tokens": 2
    }
  }
}
                

accepted_prediction_tokens, Tokens from your prediction that the model verified and used. Higher values indicate a better prediction.
rejected_prediction_tokens, Tokens from your prediction that the model discarded and regenerated. These still count toward billing.

Best Practices

Use predictions for code editing, document reformatting, and template-based generation where you can anticipate the output structure.
Accurate predictions reduce latency via parallel verification. Inaccurate predictions may be slower than no prediction at all.
Monitor accepted_prediction_tokens vs rejected_prediction_tokens to evaluate prediction quality.

Tool Calling

Xerotier supports OpenAI-compatible function calling. Define tools in your request and the model may generate tool calls in its response.

Tool Definition

JSON

                    {
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "City name"
            }
          },
          "required": ["location"]
        },
        "strict": false
      }
    }
  ],
  "tool_choice": "auto"
}
                

Tool Call Response

When the model decides to call a tool, the response includes tool_calls instead of text content:

JSON

                    {
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "id": "call_abc123",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\":\"Paris\"}"
          }
        }
      ]
    },
    "finish_reason": "tool_calls"
  }]
}
                

Tool Result Message

Send the tool result back using role tool with the matching tool_call_id:

JSON

                    {
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": null, "tool_calls": [
      {"id": "call_abc123", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}}
    ]},
    {"role": "tool", "tool_call_id": "call_abc123", "content": "{\"temperature\": 18, \"condition\": \"sunny\"}"}
  ]
}
                

Rate Limits

Every API response includes rate limit headers. Rate limits are enforced per service tier using a sliding window algorithm.

Rate Limit Response Headers

Both standard (draft IETF) and X- prefixed headers are returned for broad client compatibility:

Header	Description
RateLimit-Limit X-RateLimit-Limit	Maximum requests allowed per window.
RateLimit-Remaining X-RateLimit-Remaining	Remaining requests in the current window.
RateLimit-Reset X-RateLimit-Reset	Seconds until the current window resets.
X-RateLimit-Warning	Set to `approaching_limit` when remaining requests are below 20% of the limit. Not present otherwise.

Rate Limits by Tier

Tier	Requests/Min	Burst Capacity
free	64	+32 (50%, min 3)
cpu	128	+64 (50%, min 10)
gpu	256	+128 (50%, min 10)
self_hosted	Unlimited	N/A

Burst capacity allows short-term traffic spikes above the base limit without immediately blocking requests.

429 Rate Limit Exceeded

When the rate limit is exceeded, the API returns a 429 Too Many Requests response with retry guidance:

JSON

                    {
  "error": {
    "message": "Rate limit exceeded. Please retry after 15 seconds using exponential backoff.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded",
    "retry_after": 15,
    "retry_strategy": {
      "type": "exponential_backoff",
      "initial_delay_ms": 15000,
      "max_delay_ms": 60000,
      "multiplier": 2,
      "jitter": true
    }
  }
}
                

The Retry-After header is also set. Clients should implement exponential backoff with jitter as described in the retry_strategy object.

Error envelope type and code values shown here illustrate intent; the canonical list of error types and their HTTP status mapping is documented on the Errors page. Treat that page as the source of truth if a value differs.

Management API Rate Limits

Management API endpoints (batch, conversations, files, uploads, webhooks, and SLOs) are subject to a separate per-project rate limit. This limit applies uniformly across all management endpoints for a given project.

Setting	Value
Default limit	60 requests per minute per project
Configuration	Fixed limit; not configurable per project
Window	Sliding 60-second window

Response headers on management API endpoints:

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed per window.
`X-RateLimit-Remaining`	Remaining requests in the current window.
`X-RateLimit-Reset`	Seconds until the window resets.
`Retry-After`	Seconds to wait before retrying (only on 429 responses).

When the limit is exceeded, the API returns HTTP 429 Too Many Requests. Inference endpoints (chat completions, embeddings, responses) have separate per-endpoint rate limits as described above and are not affected by the management API rate limit.

Request Timeouts

Request timeouts are determined by your endpoint's service tier. These are not client-configurable.

Request Deadline Timeout

The maximum time a request can wait for a response before being cancelled:

Tier	Timeout
free	30 seconds
cpu	300 seconds
gpu	300 seconds
self_hosted	1800 seconds

Idle Stream Timeout

For streaming requests, the maximum time between chunks before the stream is terminated:

Tier	Idle Timeout
free	120 seconds
cpu	600 seconds
gpu	600 seconds
self_hosted	3600 seconds

For requests with reasoning_effort set, the idle timeout uses the full request deadline timeout to accommodate long reasoning phases with no output.

Timeout Error Responses

For non-streaming requests, timeout returns HTTP 408. For streaming requests, an SSE error event is sent:

SSE

                    event: error
data: {"error": {"type": "timeout_error", "message": "Request timed out after 30s. Your free tier has a 30-second timeout limit."}}
                

max_tokens Auto-Clamping

When max_tokens (or max_completion_tokens) exceeds the available context window for a request, Xerotier automatically clamps the value to the maximum available instead of rejecting the request with an error.

This is useful when clients hardcode a default max_tokens value (e.g., 32000) that may exceed the remaining context as conversations grow. Instead of returning a 400 error, the request succeeds with a reduced output limit.

How It Works

Clamping occurs at two levels for reliability:

Router-level (heuristic): The router estimates input token count from message character length and clamps max_tokens if it would exceed the model's context window. This is a fast-path optimization that avoids an extra round trip to the inference engine.
Agent-level (exact): If a request passes the router estimate but the inference engine rejects it with an exact token count error, the agent automatically retries once with the corrected value. This safety net uses precise tokenizer counts.

Detection

When clamping occurs, the response includes the X-Xerotier-Max-Tokens-Clamped header showing the original and clamped values:

HTTP Header

X-Xerotier-Max-Tokens-Clamped: 32000 -> 18943

Clients can inspect this header to detect that clamping occurred and adjust their max_tokens settings if desired. When no clamping is needed, the header is absent.

Models

List and describe the models available through your Xerotier.ai endpoint.

List Models

GET /proj_ABC123/v1/models

Lists the currently available models and their metadata.

curl

                    curl https://api.xerotier.ai/proj_ABC123/v1/models \
  -H "Authorization: Bearer xero_myproject_your_api_key"
                

Response

                        {
  "object": "list",
  "data": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "object": "model",
      "created": 1706000000,
      "owned_by": "my-project"
    },
    {
      "id": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
      "object": "model",
      "created": 1706000000,
      "owned_by": "my-project"
    }
  ]
}
                    

Retrieve Model

GET /proj_ABC123/v1/models/{model}

Retrieves a model instance, providing information about the model.

Reranking and Scoring

Reranking endpoints (/v1/rerank, /v1/score) are documented on the Reranking page. The /v1/score endpoint scores a query against one or more candidate documents using the endpoint's configured reranker model.

Endpoints

List inference endpoints configured for your project.

List Endpoints

GET /proj_ABC123/v1/endpoints

Returns all non-deleted endpoints for your project, including those in provisioning, suspended, or error states.

curl

                    curl https://api.xerotier.ai/proj_ABC123/v1/endpoints \
  -H "Authorization: Bearer xero_myproject_your_api_key"
                

Python

                    import requests

headers = {"Authorization": "Bearer xero_myproject_your_api_key"}
response = requests.get(
    "https://api.xerotier.ai/proj_ABC123/v1/endpoints",
    headers=headers
)
for endpoint in response.json()["data"]:
    print(f"{endpoint['name']} ({endpoint['status']})")
                

Node.js

                    const response = await fetch(
    "https://api.xerotier.ai/proj_ABC123/v1/endpoints",
    {
        headers: {
            "Authorization": "Bearer xero_myproject_your_api_key"
        }
    }
);
const data = await response.json();
for (const endpoint of data.data) {
    console.log(`${endpoint.name} (${endpoint.status})`);
}
                

Response

                        {
  "object": "list",
  "data": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "object": "endpoint",
      "slug": "my-endpoint",
      "name": "My Endpoint",
      "model_id": "00000000-1111-0000-1111-000000000000",
      "model_name": "llama-3.1-8b-instruct",
      "tier_id": "free",
      "status": "active",
      "custom_domain": null,
      "max_requests_per_minute": 60,
      "max_tokens_per_minute": 100000,
      "provisioning_state": null,
      "provisioned_worker_id": null,
      "created": 1706123456
    }
  ]
}