// API Reference

Error Handling

Errors round-trip as OpenAI envelopes, no exceptions. Catch the same shapes you already catch, branch on error.code, and let the router pick which backend takes the next attempt under the documented retry policy.

Error Response Format

All error responses follow the OpenAI-compatible JSON format:

JSON
{ "error": { "message": "Missing required field 'model'.", "type": "invalid_request_error", "code": "invalid_request", "param": "model" } }

SDK integrators: the router currently emits a small number of envelope type strings that are outside the OpenAI vocabulary (authorization_error, internal_error, validation_error, service_error, stream_error, forbidden_error, cancelled). Switch on error.code, not error.type, when branching retry or user-facing logic. The code field is stable and machine-readable; the type field will be brought back to OpenAI parity in a future release.

Error Object Fields

Field Type Description
message string Human-readable error description. Sanitized on most paths to strip Swift type names, database errors, and stack traces; do not rely on sensitive data (file paths, IPs, tokens, UUIDs) being stripped in every response.
type string Error category (e.g., invalid_request_error, authentication_error, rate_limit_error, server_error).
code string | null Machine-readable error code. See Error Codes below.
param string | null The request parameter that caused the error, if applicable.
retry_after integer Vendor extension. Present only on HTTP 429 rate-limit responses. Suggested seconds to wait before retrying. Mirrors the Retry-After response header.
retry_strategy object Vendor extension. Present only on HTTP 429 rate-limit responses. Contains type, initial_delay_ms, max_delay_ms, multiplier, and jitter fields describing the recommended client-side backoff policy.

Error Codes

The following error codes can appear in inference API responses. Each code maps to a specific HTTP status, fault category, and retry policy.

Filter 20 of 20
Code HTTP Fault Retryable Description
invalid_request 400 Client No Request parameters are invalid (bad JSON, invalid max_tokens, unsupported modalities).
context_length_exceeded 400 Client No The input exceeds the model's context window or leaves too few tokens for output. Reduce prompt length, lower max_tokens, or use a model with a larger context window.
json_parse_error 400 Client No The request body could not be parsed as valid JSON.
authentication_error 401 Client No Invalid or missing API key. Check your Authorization header.
insufficient_quota 402 Client No Project credit balance is exhausted. Top up credits or upgrade your subscription to resume requests.
billing_delinquent 402 Client No Payment is overdue on the project's subscription. Resolve the outstanding invoice in the billing dashboard.
endpoint_restricted 403 Client No API key is restricted to a specific endpoint and cannot access this resource, or the client IP is blocked by the endpoint's IP filter.
model_not_found 404 Client No The requested model is not found or not loaded on any backend.
project_not_found 404 Client No The specified project does not exist or you do not have access.
endpoint_not_found 404 Client No The specified endpoint does not exist within this project.
completion_not_found 404 Client No The specified stored completion does not exist.
response_not_found 404 Client No The specified response does not exist.
rate_limit_exceeded 429 Client Yes Per-key or per-endpoint request rate limit exceeded. Check the Retry-After and X-RateLimit-Reset headers and retry with exponential backoff.
capacity_exceeded 429 Agent Yes Backend worker is at capacity. Retry after the suggested delay.
quota_exceeded 429 Client No Your account or endpoint has exceeded its usage quota. Upgrade your tier or wait for the quota to reset.
endpoint_inactive 503 Agent Yes The endpoint is not currently active. It may be provisioning or disabled.
backend_unavailable 503 Agent Yes Backend inference engine is unavailable (down, unreachable, or returning 5xx errors).
timeout 408 Network Yes Request timed out before completion. May be a connect timeout, read timeout, or deadline exceeded.
invalid_state 400 Client No The resource is not in a valid state for the requested operation (e.g., cancelling an already completed response).
cancelled 499 Client No Request was cancelled by the client (connection closed) or by the router.
internal_error 500 Agent Yes An unexpected internal error occurred. The router will attempt to retry on a different backend.

Operational Error Codes

The following additional code values can appear on specific routes (project management, billing, tool invocations, exec, embeddings). They use the same envelope shape and fault-category retry rules as the inference codes above. Operators integrating against the billed router are most likely to encounter scope_insufficient, cross_project_access, insufficient_quota, and billing_delinquent.

Filter 21 of 21
Code HTTP Description
scope_insufficient 403 The API key does not carry the scope required by this route (one of inference, management, execution, research).
cross_project_access 403 The API key belongs to a different project than the targeted resource.
tool_not_mcp_visible 403 The tool exists but is not exposed to MCP clients on this endpoint.
signing_public_key_required 400 Endpoint requires a registered signing public key before requests are accepted.
invalid_model_id 400 The model field is malformed or refers to an unknown model.
invalid_tier 400 The requested tier does not exist or is not available to the project.
unsupported_modality 400 The request includes a modality (audio, vision) that this endpoint does not support.
model_not_embedding 400 The model targeted by /v1/embeddings is not an embeddings model.
model_capability_missing 400 The model is loaded but does not advertise the capability required by the request (tools, reranking, etc.).
endpoint_task_mode_mismatch 400 The endpoint is configured for a different task mode than the request implies.
model_not_scoring 400 The model targeted by a rerank / score request is not a scoring model.
model_provisioning 503 The model is being provisioned on a backend. Retry after a short delay.
tool_executor_unavailable 503 The exec backend is offline or unreachable.
invocation_terminal 409 The tool invocation has already reached a terminal state and cannot be modified.
invocation_not_found 404 The referenced tool invocation does not exist within this project.
execution_not_found 404 The referenced exec execution does not exist within this project.
approval_not_found 404 The referenced approval request does not exist.
approval_not_pending 409 The approval request has already been resolved (approved, denied, or expired).
candidate_not_found 404 The referenced rerank candidate does not exist.
exec_tool_not_found 404 The referenced exec tool name is not registered.
agent_not_found 404 The referenced agent does not exist within this project.

A complete enumeration of operational codes appears in the operator handbook; this section lists the codes most likely to surface in production SDK integrations.

Distinguishing 429 Errors

Three error codes map to HTTP 429, each with different meanings and retry behavior:

  • rate_limit_exceeded -- Per-key or per-endpoint request rate limit hit. The rate limiter's sliding window is full. Retry after the Retry-After / X-RateLimit-Reset headers indicate. Use exponential backoff.
  • capacity_exceeded -- Transient backend overload. The backend worker has no available slots. Retry after the delay indicated in the response or the Retry-After header.
  • quota_exceeded -- Persistent billing quota exhaustion. Your account has hit its token or request usage limit. Do not retry; upgrade your plan or wait for the quota period to reset.

Check the code field in the error response to distinguish them.

Rate Limit Headers

Every inference response includes rate limit headers so clients can track their current usage window and anticipate throttling before it occurs. Both the IETF draft standard form (RateLimit-*, per draft-ietf-httpapi-ratelimit-headers) and the widely-supported vendor form (X-RateLimit-*) are always present, so a client behind an HTTP proxy that strips the X- prefix can still read the unprefixed headers (and vice versa).

Header Description
X-RateLimit-Limit Maximum number of requests allowed per rate limit window for the authenticated API key or endpoint.
X-RateLimit-Remaining Number of requests remaining in the current window.
X-RateLimit-Reset Seconds until the current window resets and the limit is restored.
RateLimit-Limit Same as X-RateLimit-Limit (IETF draft form).
RateLimit-Remaining Same as X-RateLimit-Remaining (IETF draft form).
RateLimit-Reset Same as X-RateLimit-Reset (IETF draft form).
X-RateLimit-Warning Present only when remaining requests fall below 20% of the limit. Value is approaching_limit. Use this as an early warning to reduce request rate.
Retry-After Present on 429 responses. Number of seconds to wait before retrying. Also included in the JSON error body as retry_after.
X-Request-ID Opaque request identifier set by the router on every response. Include this value when contacting support so the request can be located in server logs.

Rate Limit Window

Rate limits use a sliding window algorithm. The window width and maximum request count depend on your service tier and any custom limits configured for your API key or endpoint. When a custom limit is configured, it takes precedence over the tier default.

Rate Limit Error Body

When a request is rate limited, the response body includes backoff guidance in addition to the standard error fields:

JSON
{ "error": { "message": "Rate limit exceeded. Please retry after 15 seconds using exponential backoff.", "type": "rate_limit_error", "code": "rate_limit_exceeded", "retry_after": 15, "retry_strategy": { "type": "exponential_backoff", "initial_delay_ms": 15000, "max_delay_ms": 60000, "multiplier": 2, "jitter": true } } }

Fault Categories

Errors are classified into three fault categories, each with a different retry policy:

// Client fault

Fix the request

0 retries -- backoff -- multiplier

The request is invalid. The router does not retry; the caller fixes parameters, prompt length, or model name and resubmits.

// Agent fault

Router retries

3 retries 1s -> 30s backoff 2x multiplier

Backend issue. The router retries on a different backend automatically before returning an error to the client.

// Network fault

Aggressive backoff

5 retries 0.5s -> 60s backoff 2x multiplier

Network or timeout issue. The router retries with aggressive backoff before surfacing the failure.

Category Max Retries Initial Backoff Max Backoff Multiplier Description
Client Fault 0 -- -- -- The request is invalid. Fix the request before retrying.
Agent Fault 3 1s 30s 2x Backend issue. The router retries on a different backend automatically.
Network Fault 5 0.5s 60s 2x Network or timeout issue. The router retries with aggressive backoff.

The router performs internal retries for agent and network faults before returning an error to the client. If all retries are exhausted, the final error is returned with the appropriate HTTP status code.

Inference-Specific Errors

The following scenarios are common during inference and have specific handling guidance:

Context Length Exceeded (400)

Your prompt exceeds the model's maximum context length. The error message includes details about the limit. Reduce your prompt length or use a model with a larger context window.

Max Tokens Invalid (400)

The max_tokens or max_completion_tokens value is invalid (negative, zero, or exceeds the model's limit). Adjust the value to be within the model's supported range.

Invalid Reasoning Effort (400)

The reasoning_effort parameter must be one of "low", "medium", or "high". Any other value (including uppercase variants like "LOW" or numeric strings like "1") returns a 400 error with type: "invalid_request_error" and param: "reasoning_effort". Omit the field entirely if you do not need reasoning effort control.

Invalid Logprobs Configuration (400)

Two logprobs validation rules are enforced:

  • top_logprobs without logprobs -- Setting top_logprobs requires logprobs: true. If logprobs is omitted or set to false while top_logprobs is present, the request returns a 400 error with type: "invalid_request_error" and param: "top_logprobs".
  • top_logprobs out of range -- The top_logprobs value must be between 0 and 20 (inclusive). Values outside this range return a 400 error with type: "invalid_request_error" and param: "top_logprobs".

Model Not Loaded (404)

The model specified in your endpoint configuration is not currently loaded on any backend. This can happen if no backends are available for your tier or if the model has been removed. Check your endpoint configuration and backend status.

Request Timeout (408)

The request did not complete within the tier's timeout limit. Timeouts vary by tier: the foundational free tier is 30 seconds and the self_hosted tier is 1800 seconds. Custom tiers may set any timeout; the active value is visible on the tier configuration in the dashboard. Consider using streaming to avoid timeouts on long-running generation, or reduce the max_tokens value.

Streaming Errors

Errors during streaming behave differently depending on when they occur:

Pre-Stream Errors

If an error occurs before any tokens are generated (e.g., invalid request, model not found), you receive a standard HTTP error response with the appropriate status code. No SSE events are sent.

Mid-Stream Errors

If an error occurs after streaming has started (HTTP 200 has already been sent), the error is delivered as an SSE event:

SSE Error Event
event: error data: {"error":{"message":"Backend connection lost","type":"server_error","code":"backend_unavailable"}}

Branch on code, not type. After an error event, the stream is terminated with a data: [DONE] sentinel. Mid-stream error envelopes populate the type and code fields inconsistently across emit sites: some events carry only code, some carry both. The code field is the stable handle.

The combinations that can appear mid-stream are:

Error type Error code Cause
server_error internal_error, backend_unavailable Internal server error during generation. The code field identifies the specific cause.
stream_error (varies) Generic streaming-pipeline failure. Inspect code and message for the specific cause.
timeout_error timeout Request deadline exceeded during generation.
(not set) stream_idle_timeout No data received from backend within the idle timeout period. The envelope has only code; type is omitted.
(not set) cancelled Request was cancelled by the client or router. The envelope has only code; type is omitted.

Handling Partial Responses

When a mid-stream error occurs, you may have received partial content. Concatenate all delta.content values received before the error to get the partial response. Decide whether to use the partial content or retry the full request based on your application requirements.

Retry Guidance

The router handles most retries internally, but if the final response is an error, use these guidelines for client-side retries:

Retryable Errors

  • rate_limit_exceeded (429) -- Respect the Retry-After header and use exponential backoff. The error body includes a retry_strategy object with recommended parameters.
  • capacity_exceeded (429) -- Wait for the duration in the Retry-After header, then retry.
  • backend_unavailable (503) -- Wait 10-30 seconds, then retry. The Retry-After header provides a specific delay.
  • timeout (408) -- Retry after 5 seconds. Consider reducing prompt length or max_tokens.
  • internal_error (500) -- Retry after 10 seconds. If persistent, contact support.

Non-Retryable Errors

  • invalid_request (400) -- Fix the request. Check parameters, prompt length, and model name.
  • authentication_error (401) -- Verify your API key is correct and active.
  • endpoint_restricted (403) -- The API key is restricted to a different endpoint, or the client IP is blocked. Update the key or IP filter configuration.
  • model_not_found (404) -- Verify the model is configured for your endpoint.
  • quota_exceeded (429) -- Upgrade your plan or wait for the quota period to reset.
  • cancelled (499) -- The client disconnected. No retry needed unless the disconnection was unintentional.

Exponential Backoff Example

Python
import time, random from email.utils import parsedate_to_datetime from datetime import datetime, timezone def parse_retry_after(value): """Parse a Retry-After header value. RFC 7231 allows either a non-negative integer (delta-seconds) or an HTTP-date. Returns the number of seconds to wait, or None if the value cannot be parsed. """ if value is None: return None try: return max(0.0, float(value)) except (TypeError, ValueError): pass try: when = parsedate_to_datetime(value) if when.tzinfo is None: when = when.replace(tzinfo=timezone.utc) return max(0.0, (when - datetime.now(timezone.utc)).total_seconds()) except (TypeError, ValueError): return None def request_with_retry(make_request, max_retries=3): delay = 1.0 for attempt in range(max_retries + 1): response = make_request() if 200 <= response.status_code < 300: return response if response.status_code in (408, 429, 500, 503): parsed = parse_retry_after(response.headers.get("Retry-After")) if parsed is not None: delay = parsed jitter = random.uniform(0, delay * 0.1) time.sleep(delay + jitter) delay = min(delay * 2, 60) else: raise Exception(f"Non-retryable error: {response.status_code}") raise Exception("Max retries exhausted")

Node.js Exponential Backoff Example

Node.js
// Parse a Retry-After header value per RFC 7231: either a non-negative // number of seconds or an HTTP-date. Returns milliseconds to wait, or // null if the value cannot be parsed. function parseRetryAfterMs(value) { if (value == null) { return null; } const asNumber = Number(value); if (Number.isFinite(asNumber) && asNumber >= 0) { return asNumber * 1000; } const asDate = Date.parse(value); if (Number.isFinite(asDate)) { return Math.max(0, asDate - Date.now()); } return null; } async function requestWithRetry(makeRequest, maxRetries = 3) { let delay = 1000; for (let attempt = 0; attempt <= maxRetries; attempt++) { const response = await makeRequest(); if (response.ok) { return response; } if ([408, 429, 500, 503].includes(response.status)) { const parsed = parseRetryAfterMs(response.headers.get("Retry-After")); if (parsed !== null) { delay = parsed; } const jitter = Math.random() * delay * 0.1; await new Promise(r => setTimeout(r, delay + jitter)); delay = Math.min(delay * 2, 60000); } else { throw new Error(`Non-retryable error: ${response.status}`); } } throw new Error("Max retries exhausted"); }

Troubleshooting

My request returns 400 Bad Request

  • Check that your request body is valid JSON.
  • Verify the messages array is present and non-empty.
  • Check that your prompt does not exceed the model's context length.
  • Verify that max_tokens / max_completion_tokens are positive integers within the model's limit.
  • Check that temperature is between 0.0 and 2.0.
  • If using reasoning_effort, verify the value is one of "low", "medium", or "high".

My request returns 401 Unauthorized

  • Verify your API key is correct and has not been revoked.
  • Check the Authorization header format: Bearer xero_...
  • Ensure the API key belongs to the correct project.

My request returns 402 Payment Required

  • Check the code field. insufficient_quota means the project's credit balance is exhausted; top up credits in the billing dashboard or upgrade your subscription.
  • billing_delinquent means payment is overdue; resolve the outstanding invoice in the billing dashboard. Requests will not resume automatically until the balance is settled.

My request returns 403 Forbidden

  • Check the code field. endpoint_restricted means your API key is scoped to a specific endpoint and you are trying to access a different one.
  • If your IP is being blocked, verify your client IP is in the endpoint's IP allowlist, or that it is not in the blocklist. Check your endpoint's IP filter configuration in the dashboard.
  • Ensure you are not using a key issued for one project to access an endpoint belonging to another project.

I am getting 429 Too Many Requests

  • Check the code field: rate_limit_exceeded means per-key/endpoint request rate limit hit (retry with backoff); capacity_exceeded is transient backend overload (retry); quota_exceeded is a persistent billing quota (upgrade or wait).
  • Respect the Retry-After header. For rate_limit_exceeded, also check X-RateLimit-Reset.
  • Monitor X-RateLimit-Remaining on each response to proactively slow down before hitting the limit.
  • Watch for the X-RateLimit-Warning: approaching_limit header, it appears when fewer than 20% of the window's requests remain.
  • Reduce your request rate or implement client-side rate limiting.
  • Consider upgrading to a higher tier or requesting a custom rate limit for increased throughput.

My request returns 499 Client Closed Request

  • 499 is not a server-side error: it indicates the client disconnected (or the router-side timeout closed the connection) before the response completed. SDKs and proxies sometimes log it as a failure, but no retry is required unless the disconnect was unintentional.
  • If you are seeing repeated 499s with no client-side cancellation, check intermediate proxies for connection-idle timeouts shorter than your tier's request timeout.

My request returns 500 Internal Server Error

  • Retry once after 10 seconds. The router handles most transient backend failures internally, so a 500 that reaches the client typically indicates an unhandled condition.
  • If 500s appear in a burst shortly after a router release, the root cause is often database role-grant drift (PostgreSQL 42501) that requires an operator-side refresh migration. Capture the X-Request-ID header and contact support; the issue cannot be self-diagnosed from the error envelope alone because the underlying sqlState is logged server-side, not surfaced in the response.
  • Persistent 500s on a single route after a release should be reported with the request ID and approximate timestamp.

My responses are slow

  • Use streaming to reduce perceived latency.
  • Check if your prompts are optimized for prefix caching. See Prefix Caching.
  • Consider a GPU tier for latency-sensitive workloads.
  • Reduce max_tokens if you do not need long responses.

My streaming connection drops

  • Check your client's read timeout, it must exceed the tier's idle stream timeout. The foundational tiers are 120 seconds for free and 3600 seconds for self_hosted; custom tiers may set any value. The active idle timeout is visible on the tier configuration in the dashboard.
  • The server sends heartbeat comments every 15 seconds to keep the connection alive. If you are behind a proxy, ensure it does not strip SSE events or impose its own timeout shorter than the idle timeout.
  • If you receive an event: error, the stream has been terminated by the server. Check the error type for the cause.

When to Contact Support

Contact support if:

  • You receive persistent internal_error (500) responses that do not resolve with retries.
  • You see backend_unavailable (503) errors consistently for more than 5 minutes.
  • Your usage metrics do not match your billing.
  • You suspect unauthorized access to your account or API keys.

When contacting support, include: your project ID, the endpoint name, the request ID from the X-Request-ID response header, the error response body, and the approximate time of the issue.