// Client fault
Fix the request
0 retries -- backoff -- multiplier
The request is invalid. The router does not retry; the caller fixes parameters, prompt length, or model name and resubmits.
Errors round-trip as OpenAI envelopes, no exceptions. Catch the same shapes you already catch, branch on error.code, and let the router pick which backend takes the next attempt under the documented retry policy.
All error responses follow the OpenAI-compatible JSON format:
{
"error": {
"message": "Missing required field 'model'.",
"type": "invalid_request_error",
"code": "invalid_request",
"param": "model"
}
}
SDK integrators: the router currently emits
a small number of envelope type strings that
are outside the OpenAI vocabulary
(authorization_error, internal_error,
validation_error, service_error,
stream_error, forbidden_error,
cancelled). Switch on
error.code, not error.type, when
branching retry or user-facing logic. The code
field is stable and machine-readable; the type
field will be brought back to OpenAI parity in a future
release.
| Field | Type | Description |
|---|---|---|
| message | string | Human-readable error description. Sanitized on most paths to strip Swift type names, database errors, and stack traces; do not rely on sensitive data (file paths, IPs, tokens, UUIDs) being stripped in every response. |
| type | string | Error category (e.g., invalid_request_error, authentication_error, rate_limit_error, server_error). |
| code | string | null | Machine-readable error code. See Error Codes below. |
| param | string | null | The request parameter that caused the error, if applicable. |
| retry_after | integer | Vendor extension. Present only on HTTP 429 rate-limit responses. Suggested seconds to wait before retrying. Mirrors the Retry-After response header. |
| retry_strategy | object | Vendor extension. Present only on HTTP 429 rate-limit responses. Contains type, initial_delay_ms, max_delay_ms, multiplier, and jitter fields describing the recommended client-side backoff policy. |
The following error codes can appear in inference API responses. Each code maps to a specific HTTP status, fault category, and retry policy.
| Code | HTTP | Fault | Retryable | Description |
|---|---|---|---|---|
invalid_request |
400 | Client | No | Request parameters are invalid (bad JSON, invalid max_tokens, unsupported modalities). |
context_length_exceeded |
400 | Client | No | The input exceeds the model's context window or leaves too few tokens for output. Reduce prompt length, lower max_tokens, or use a model with a larger context window. |
json_parse_error |
400 | Client | No | The request body could not be parsed as valid JSON. |
authentication_error |
401 | Client | No | Invalid or missing API key. Check your Authorization header. |
insufficient_quota |
402 | Client | No | Project credit balance is exhausted. Top up credits or upgrade your subscription to resume requests. |
billing_delinquent |
402 | Client | No | Payment is overdue on the project's subscription. Resolve the outstanding invoice in the billing dashboard. |
endpoint_restricted |
403 | Client | No | API key is restricted to a specific endpoint and cannot access this resource, or the client IP is blocked by the endpoint's IP filter. |
model_not_found |
404 | Client | No | The requested model is not found or not loaded on any backend. |
project_not_found |
404 | Client | No | The specified project does not exist or you do not have access. |
endpoint_not_found |
404 | Client | No | The specified endpoint does not exist within this project. |
completion_not_found |
404 | Client | No | The specified stored completion does not exist. |
response_not_found |
404 | Client | No | The specified response does not exist. |
rate_limit_exceeded |
429 | Client | Yes | Per-key or per-endpoint request rate limit exceeded. Check the Retry-After and X-RateLimit-Reset headers and retry with exponential backoff. |
capacity_exceeded |
429 | Agent | Yes | Backend worker is at capacity. Retry after the suggested delay. |
quota_exceeded |
429 | Client | No | Your account or endpoint has exceeded its usage quota. Upgrade your tier or wait for the quota to reset. |
endpoint_inactive |
503 | Agent | Yes | The endpoint is not currently active. It may be provisioning or disabled. |
backend_unavailable |
503 | Agent | Yes | Backend inference engine is unavailable (down, unreachable, or returning 5xx errors). |
timeout |
408 | Network | Yes | Request timed out before completion. May be a connect timeout, read timeout, or deadline exceeded. |
invalid_state |
400 | Client | No | The resource is not in a valid state for the requested operation (e.g., cancelling an already completed response). |
cancelled |
499 | Client | No | Request was cancelled by the client (connection closed) or by the router. |
internal_error |
500 | Agent | Yes | An unexpected internal error occurred. The router will attempt to retry on a different backend. |
| No codes match this filter. | ||||
The following additional code values can appear
on specific routes (project management, billing, tool
invocations, exec, embeddings). They use the same envelope
shape and fault-category retry rules as the inference codes
above. Operators integrating against the billed router are
most likely to encounter scope_insufficient,
cross_project_access,
insufficient_quota, and
billing_delinquent.
| Code | HTTP | Description |
|---|---|---|
scope_insufficient |
403 | The API key does not carry the scope required by this route (one of inference, management, execution, research). |
cross_project_access |
403 | The API key belongs to a different project than the targeted resource. |
tool_not_mcp_visible |
403 | The tool exists but is not exposed to MCP clients on this endpoint. |
signing_public_key_required |
400 | Endpoint requires a registered signing public key before requests are accepted. |
invalid_model_id |
400 | The model field is malformed or refers to an unknown model. |
invalid_tier |
400 | The requested tier does not exist or is not available to the project. |
unsupported_modality |
400 | The request includes a modality (audio, vision) that this endpoint does not support. |
model_not_embedding |
400 | The model targeted by /v1/embeddings is not an embeddings model. |
model_capability_missing |
400 | The model is loaded but does not advertise the capability required by the request (tools, reranking, etc.). |
endpoint_task_mode_mismatch |
400 | The endpoint is configured for a different task mode than the request implies. |
model_not_scoring |
400 | The model targeted by a rerank / score request is not a scoring model. |
model_provisioning |
503 | The model is being provisioned on a backend. Retry after a short delay. |
tool_executor_unavailable |
503 | The exec backend is offline or unreachable. |
invocation_terminal |
409 | The tool invocation has already reached a terminal state and cannot be modified. |
invocation_not_found |
404 | The referenced tool invocation does not exist within this project. |
execution_not_found |
404 | The referenced exec execution does not exist within this project. |
approval_not_found |
404 | The referenced approval request does not exist. |
approval_not_pending |
409 | The approval request has already been resolved (approved, denied, or expired). |
candidate_not_found |
404 | The referenced rerank candidate does not exist. |
exec_tool_not_found |
404 | The referenced exec tool name is not registered. |
agent_not_found |
404 | The referenced agent does not exist within this project. |
| No codes match this filter. | ||
A complete enumeration of operational codes appears in the operator handbook; this section lists the codes most likely to surface in production SDK integrations.
Three error codes map to HTTP 429, each with different meanings and retry behavior:
Retry-After / X-RateLimit-Reset headers indicate. Use exponential backoff.Retry-After header.Check the code field in the error response to distinguish them.
Every inference response includes rate limit headers so clients can
track their current usage window and anticipate throttling before it
occurs. Both the IETF draft standard form (RateLimit-*,
per draft-ietf-httpapi-ratelimit-headers)
and the widely-supported vendor form (X-RateLimit-*) are
always present, so a client behind an HTTP proxy that strips the
X- prefix can still read the unprefixed headers (and
vice versa).
| Header | Description |
|---|---|
X-RateLimit-Limit |
Maximum number of requests allowed per rate limit window for the authenticated API key or endpoint. |
X-RateLimit-Remaining |
Number of requests remaining in the current window. |
X-RateLimit-Reset |
Seconds until the current window resets and the limit is restored. |
RateLimit-Limit |
Same as X-RateLimit-Limit (IETF draft form). |
RateLimit-Remaining |
Same as X-RateLimit-Remaining (IETF draft form). |
RateLimit-Reset |
Same as X-RateLimit-Reset (IETF draft form). |
X-RateLimit-Warning |
Present only when remaining requests fall below 20% of the limit. Value is approaching_limit. Use this as an early warning to reduce request rate. |
Retry-After |
Present on 429 responses. Number of seconds to wait before retrying. Also included in the JSON error body as retry_after. |
X-Request-ID |
Opaque request identifier set by the router on every response. Include this value when contacting support so the request can be located in server logs. |
Rate limits use a sliding window algorithm. The window width and maximum request count depend on your service tier and any custom limits configured for your API key or endpoint. When a custom limit is configured, it takes precedence over the tier default.
When a request is rate limited, the response body includes backoff guidance in addition to the standard error fields:
{
"error": {
"message": "Rate limit exceeded. Please retry after 15 seconds using exponential backoff.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"retry_after": 15,
"retry_strategy": {
"type": "exponential_backoff",
"initial_delay_ms": 15000,
"max_delay_ms": 60000,
"multiplier": 2,
"jitter": true
}
}
}
Errors are classified into three fault categories, each with a different retry policy:
// Client fault
0 retries -- backoff -- multiplier
The request is invalid. The router does not retry; the caller fixes parameters, prompt length, or model name and resubmits.
// Agent fault
3 retries 1s -> 30s backoff 2x multiplier
Backend issue. The router retries on a different backend automatically before returning an error to the client.
// Network fault
5 retries 0.5s -> 60s backoff 2x multiplier
Network or timeout issue. The router retries with aggressive backoff before surfacing the failure.
| Category | Max Retries | Initial Backoff | Max Backoff | Multiplier | Description |
|---|---|---|---|---|---|
| Client Fault | 0 | -- | -- | -- | The request is invalid. Fix the request before retrying. |
| Agent Fault | 3 | 1s | 30s | 2x | Backend issue. The router retries on a different backend automatically. |
| Network Fault | 5 | 0.5s | 60s | 2x | Network or timeout issue. The router retries with aggressive backoff. |
The router performs internal retries for agent and network faults before returning an error to the client. If all retries are exhausted, the final error is returned with the appropriate HTTP status code.
The following scenarios are common during inference and have specific handling guidance:
Your prompt exceeds the model's maximum context length. The error message includes details about the limit. Reduce your prompt length or use a model with a larger context window.
The max_tokens or max_completion_tokens value is
invalid (negative, zero, or exceeds the model's limit). Adjust the value
to be within the model's supported range.
The reasoning_effort parameter must be one of
"low", "medium", or "high".
Any other value (including uppercase variants like "LOW"
or numeric strings like "1") returns a 400 error with
type: "invalid_request_error" and
param: "reasoning_effort". Omit the field entirely
if you do not need reasoning effort control.
Two logprobs validation rules are enforced:
top_logprobs requires logprobs: true.
If logprobs is omitted or set to false while
top_logprobs is present, the request returns a 400 error
with type: "invalid_request_error" and
param: "top_logprobs".
top_logprobs value must be between 0 and 20 (inclusive).
Values outside this range return a 400 error with
type: "invalid_request_error" and
param: "top_logprobs".
The model specified in your endpoint configuration is not currently loaded on any backend. This can happen if no backends are available for your tier or if the model has been removed. Check your endpoint configuration and backend status.
The request did not complete within the tier's timeout limit.
Timeouts vary by tier: the foundational free tier
is 30 seconds and the self_hosted tier is 1800
seconds. Custom tiers may set any timeout; the active value
is visible on the tier configuration in the dashboard.
Consider using streaming to avoid timeouts on long-running
generation, or reduce the max_tokens value.
Errors during streaming behave differently depending on when they occur:
If an error occurs before any tokens are generated (e.g., invalid request, model not found), you receive a standard HTTP error response with the appropriate status code. No SSE events are sent.
If an error occurs after streaming has started (HTTP 200 has already been sent), the error is delivered as an SSE event:
event: error
data: {"error":{"message":"Backend connection lost","type":"server_error","code":"backend_unavailable"}}
Branch on code, not type.
After an error event, the stream is terminated with a
data: [DONE] sentinel. Mid-stream error envelopes
populate the type and code fields
inconsistently across emit sites: some events carry only
code, some carry both. The code
field is the stable handle.
The combinations that can appear mid-stream are:
Error type |
Error code |
Cause |
|---|---|---|
server_error |
internal_error, backend_unavailable |
Internal server error during generation. The code field identifies the specific cause. |
stream_error |
(varies) | Generic streaming-pipeline failure. Inspect code and message for the specific cause. |
timeout_error |
timeout |
Request deadline exceeded during generation. |
| (not set) | stream_idle_timeout |
No data received from backend within the idle timeout period. The envelope has only code; type is omitted. |
| (not set) | cancelled |
Request was cancelled by the client or router. The envelope has only code; type is omitted. |
When a mid-stream error occurs, you may have received partial content.
Concatenate all delta.content values received before the error
to get the partial response. Decide whether to use the partial content or
retry the full request based on your application requirements.
The router handles most retries internally, but if the final response is an error, use these guidelines for client-side retries:
Retry-After header and use exponential backoff. The error body includes a retry_strategy object with recommended parameters.Retry-After header, then retry.Retry-After header provides a specific delay.import time, random
from email.utils import parsedate_to_datetime
from datetime import datetime, timezone
def parse_retry_after(value):
"""Parse a Retry-After header value.
RFC 7231 allows either a non-negative integer (delta-seconds) or an
HTTP-date. Returns the number of seconds to wait, or None if the
value cannot be parsed.
"""
if value is None:
return None
try:
return max(0.0, float(value))
except (TypeError, ValueError):
pass
try:
when = parsedate_to_datetime(value)
if when.tzinfo is None:
when = when.replace(tzinfo=timezone.utc)
return max(0.0, (when - datetime.now(timezone.utc)).total_seconds())
except (TypeError, ValueError):
return None
def request_with_retry(make_request, max_retries=3):
delay = 1.0
for attempt in range(max_retries + 1):
response = make_request()
if 200 <= response.status_code < 300:
return response
if response.status_code in (408, 429, 500, 503):
parsed = parse_retry_after(response.headers.get("Retry-After"))
if parsed is not None:
delay = parsed
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
delay = min(delay * 2, 60)
else:
raise Exception(f"Non-retryable error: {response.status_code}")
raise Exception("Max retries exhausted")
// Parse a Retry-After header value per RFC 7231: either a non-negative
// number of seconds or an HTTP-date. Returns milliseconds to wait, or
// null if the value cannot be parsed.
function parseRetryAfterMs(value) {
if (value == null) {
return null;
}
const asNumber = Number(value);
if (Number.isFinite(asNumber) && asNumber >= 0) {
return asNumber * 1000;
}
const asDate = Date.parse(value);
if (Number.isFinite(asDate)) {
return Math.max(0, asDate - Date.now());
}
return null;
}
async function requestWithRetry(makeRequest, maxRetries = 3) {
let delay = 1000;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await makeRequest();
if (response.ok) {
return response;
}
if ([408, 429, 500, 503].includes(response.status)) {
const parsed = parseRetryAfterMs(response.headers.get("Retry-After"));
if (parsed !== null) {
delay = parsed;
}
const jitter = Math.random() * delay * 0.1;
await new Promise(r => setTimeout(r, delay + jitter));
delay = Math.min(delay * 2, 60000);
} else {
throw new Error(`Non-retryable error: ${response.status}`);
}
}
throw new Error("Max retries exhausted");
}
messages array is present and non-empty.max_tokens / max_completion_tokens are positive integers within the model's limit.temperature is between 0.0 and 2.0.reasoning_effort, verify the value is one of "low", "medium", or "high".Authorization header format: Bearer xero_...code field. insufficient_quota means the project's credit balance is exhausted; top up credits in the billing dashboard or upgrade your subscription.billing_delinquent means payment is overdue; resolve the outstanding invoice in the billing dashboard. Requests will not resume automatically until the balance is settled.code field. endpoint_restricted means your API key is scoped to a specific endpoint and you are trying to access a different one.code field: rate_limit_exceeded means per-key/endpoint request rate limit hit (retry with backoff); capacity_exceeded is transient backend overload (retry); quota_exceeded is a persistent billing quota (upgrade or wait).Retry-After header. For rate_limit_exceeded, also check X-RateLimit-Reset.X-RateLimit-Remaining on each response to proactively slow down before hitting the limit.X-RateLimit-Warning: approaching_limit header, it appears when fewer than 20% of the window's requests remain.42501) that requires an operator-side refresh migration. Capture the X-Request-ID header and contact support; the issue cannot be self-diagnosed from the error envelope alone because the underlying sqlState is logged server-side, not surfaced in the response.max_tokens if you do not need long responses.free and 3600 seconds for self_hosted; custom tiers may set any value. The active idle timeout is visible on the tier configuration in the dashboard.event: error, the stream has been terminated by the server. Check the error type for the cause.Contact support if:
internal_error (500) responses that do not resolve with retries.backend_unavailable (503) errors consistently for more than 5 minutes.
When contacting support, include: your project ID, the endpoint name,
the request ID from the X-Request-ID response header,
the error response body, and the approximate time of the issue.