Error Handling

Error codes, fault categories, retry policies, and troubleshooting guidance.

Error Response Format

All error responses follow the OpenAI-compatible JSON format:

JSON
{ "error": { "message": "The model 'nonexistent-model' does not exist.", "type": "not_found_error", "code": "model_not_found", "param": null } }

Error Object Fields

Field Type Description
message string Human-readable error description. Sensitive information (file paths, IPs, tokens, UUIDs) is automatically sanitized.
type string Error category (e.g., invalid_request_error, not_found_error, server_error).
code string | null Machine-readable error code. See Error Codes below.
param string | null The request parameter that caused the error, if applicable.

Error Codes

The following error codes can appear in inference API responses. Each code maps to a specific HTTP status, fault category, and retry policy.

Code HTTP Fault Retryable Description
invalid_request 400 Client No Request parameters are invalid (bad JSON, context length exceeded, invalid max_tokens).
json_parse_error 400 Client No The request body could not be parsed as valid JSON.
authentication_error 401 Client No Invalid or missing API key. Check your Authorization header.
model_not_found 404 Client No The requested model is not found or not loaded on any backend.
project_not_found 404 Client No The specified project does not exist or you do not have access.
endpoint_not_found 404 Client No The specified endpoint does not exist within this project.
completion_not_found 404 Client No The specified stored completion does not exist.
response_not_found 404 Client No The specified response does not exist.
capacity_exceeded 429 Agent Yes Backend worker is at capacity. Retry after the suggested delay.
quota_exceeded 429 Client No Your account or endpoint has exceeded its usage quota. Upgrade your tier or wait for the quota to reset.
endpoint_inactive 503 Agent Yes The endpoint is not currently active. It may be provisioning or disabled.
preempted 503 Agent Yes Request was preempted by a higher-priority request. Retry immediately or after a short delay.
backend_unavailable 503 Agent Yes Backend inference engine is unavailable (down, unreachable, or returning 5xx errors).
timeout 408 Network Yes Request timed out before completion. May be a connect timeout, read timeout, or deadline exceeded.
invalid_state 409 Client No The resource is not in a valid state for the requested operation (e.g., cancelling an already completed response).
cancelled 499 Client No Request was cancelled by the client (connection closed) or by the router.
internal_error 500 Agent Yes An unexpected internal error occurred. The router will attempt to retry on a different backend.

Distinguishing 429 Errors

Both capacity_exceeded and quota_exceeded return HTTP 429, but they have different meanings and retry behavior:

  • capacity_exceeded -- Transient. The backend is temporarily overloaded. Retry after the delay indicated in the response or the Retry-After header.
  • quota_exceeded -- Persistent. Your account has hit its usage limit. Do not retry -- you need to upgrade your plan or wait for the quota period to reset.

Check the code field in the error response to distinguish them.

Fault Categories

Errors are classified into three fault categories, each with a different retry policy:

Category Max Retries Initial Backoff Max Backoff Multiplier Description
Client Fault 0 -- -- -- The request is invalid. Fix the request before retrying.
Agent Fault 3 1s 30s 2x Backend issue. The router retries on a different backend automatically.
Network Fault 5 0.5s 60s 2x Network or timeout issue. The router retries with aggressive backoff.

The router performs internal retries for agent and network faults before returning an error to the client. If all retries are exhausted, the final error is returned to the client with the appropriate HTTP status code.

Inference-Specific Errors

The following scenarios are common during inference and have specific handling guidance:

Context Length Exceeded (400)

Your prompt exceeds the model's maximum context length. The error message includes details about the limit. Reduce your prompt length or use a model with a larger context window.

Max Tokens Invalid (400)

The max_tokens or max_completion_tokens value is invalid (negative, zero, or exceeds the model's limit). Adjust the value to be within the model's supported range.

Invalid Reasoning Effort (400)

The reasoning_effort parameter must be one of "low", "medium", or "high". Any other value (including uppercase variants like "LOW" or numeric strings like "1") returns a 400 error with type: "invalid_request_error" and param: "reasoning_effort". Omit the field entirely if you do not need reasoning effort control.

Invalid Logprobs Configuration (400)

Two logprobs validation rules are enforced:

  • top_logprobs without logprobs -- Setting top_logprobs requires logprobs: true. If logprobs is omitted or set to false while top_logprobs is present, the request returns a 400 error with type: "invalid_request_error" and param: "top_logprobs".
  • top_logprobs out of range -- The top_logprobs value must be between 0 and 20 (inclusive). Values outside this range return a 400 error with type: "invalid_request_error" and param: "top_logprobs".

Model Not Loaded (404)

The model specified in your endpoint configuration is not currently loaded on any backend. This can happen if no backends are available for your tier or if the model has been removed. Check your endpoint configuration and backend status.

Preempted (503)

Your request was interrupted by a higher-priority request on the same backend. This only affects preemptable tiers (Free, GPU Shared). The request was not completed -- retry it. The router may include partial token counts in the error response via partialInputTokens and partialOutputTokens fields.

Request Timeout (408)

The request did not complete within the tier's timeout limit. Timeouts vary by tier (30 seconds for Free, 300 seconds for CPU/GPU, 1800 seconds for XIM). Consider using streaming to avoid timeouts on long-running generation, or reduce the max_tokens value.

Streaming Errors

Errors during streaming behave differently depending on when they occur:

Pre-Stream Errors

If an error occurs before any tokens are generated (e.g., invalid request, model not found), you receive a standard HTTP error response with the appropriate status code. No SSE events are sent.

Mid-Stream Errors

If an error occurs after streaming has started (HTTP 200 has already been sent), the error is delivered as an SSE event:

SSE Error Event
event: error data: {"error":{"message":"Backend connection lost","type":"server_error","code":"backend_unavailable"}}

After an error event, the stream is terminated with a data: [DONE] sentinel. The error types that can appear mid-stream are:

Error Type Cause
api_error Internal server error during generation.
timeout_error Request deadline exceeded during generation.
stream_idle_timeout No data received from backend within the idle timeout period.
cancelled Request was cancelled by the client or router.

Handling Partial Responses

When a mid-stream error occurs, you may have received partial content. Concatenate all delta.content values received before the error to get the partial response. Decide whether to use the partial content or retry the full request based on your application requirements.

Retry Guidance

The router handles most retries internally, but if the final response is an error, use these guidelines for client-side retries:

Retryable Errors

  • capacity_exceeded (429) -- Wait for the duration in the Retry-After header, then retry.
  • preempted (503) -- Retry immediately or after 1-2 seconds.
  • backend_unavailable (503) -- Wait 10-30 seconds, then retry. The Retry-After header provides a specific delay.
  • timeout (408) -- Retry after 5 seconds. Consider reducing prompt length or max_tokens.
  • internal_error (500) -- Retry after 10 seconds. If persistent, contact support.

Non-Retryable Errors

  • invalid_request (400) -- Fix the request. Check parameters, prompt length, and model name.
  • model_not_found (404) -- Verify the model is configured for your endpoint.
  • quota_exceeded (429) -- Upgrade your plan or wait for the quota period to reset.
  • cancelled (499) -- The client disconnected. No retry needed unless the disconnection was unintentional.

Exponential Backoff Example

Python
import time, random def request_with_retry(make_request, max_retries=3): delay = 1.0 for attempt in range(max_retries + 1): response = make_request() if response.status_code == 200: return response if response.status_code in (408, 429, 500, 503): retry_after = response.headers.get("Retry-After") if retry_after: delay = float(retry_after) jitter = random.uniform(0, delay * 0.1) time.sleep(delay + jitter) delay = min(delay * 2, 60) else: raise Exception(f"Non-retryable error: {response.status_code}") raise Exception("Max retries exhausted")

Node.js Exponential Backoff Example

Node.js
async function requestWithRetry(makeRequest, maxRetries = 3) { let delay = 1000; for (let attempt = 0; attempt <= maxRetries; attempt++) { const response = await makeRequest(); if (response.ok) { return response; } if ([408, 429, 500, 503].includes(response.status)) { const retryAfter = response.headers.get("Retry-After"); if (retryAfter) { delay = parseFloat(retryAfter) * 1000; } const jitter = Math.random() * delay * 0.1; await new Promise(r => setTimeout(r, delay + jitter)); delay = Math.min(delay * 2, 60000); } else { throw new Error(`Non-retryable error: ${response.status}`); } } throw new Error("Max retries exhausted"); }

Troubleshooting

My request returns 400 Bad Request

  • Check that your request body is valid JSON.
  • Verify the messages array is present and non-empty.
  • Check that your prompt does not exceed the model's context length.
  • Verify that max_tokens / max_completion_tokens are positive integers within the model's limit.
  • Check that temperature is between 0.0 and 2.0.
  • If using reasoning_effort, verify the value is one of "low", "medium", or "high".

My request returns 401 Unauthorized

  • Verify your API key is correct and has not been revoked.
  • Check the Authorization header format: Bearer xero_...
  • Ensure the API key belongs to the correct project.

I am getting 429 Too Many Requests

  • Check the code field: capacity_exceeded is transient (retry), quota_exceeded is persistent (upgrade or wait).
  • Respect the Retry-After header.
  • Reduce your request rate or implement client-side rate limiting.
  • Consider upgrading to a higher tier for increased rate limits.

My responses are slow

  • Use streaming to reduce perceived latency.
  • Check if your prompts are optimized for prefix caching. See Prefix Caching.
  • Consider a GPU tier for latency-sensitive workloads.
  • Reduce max_tokens if you do not need long responses.

My streaming connection drops

  • Check your client's read timeout -- it must exceed the tier's idle stream timeout (120s for Free, 600s for CPU/GPU, 3600s for XIM).
  • The server sends heartbeat comments every 15 seconds to keep the connection alive. If you are behind a proxy, ensure it does not strip SSE events or impose its own timeout shorter than the idle timeout.
  • If you receive an event: error, the stream has been terminated by the server. Check the error type for the cause.

When to Contact Support

Contact support if:

  • You receive persistent internal_error (500) responses that do not resolve with retries.
  • You see backend_unavailable (503) errors consistently for more than 5 minutes.
  • Your usage metrics do not match your billing.
  • You suspect unauthorized access to your account or API keys.

When contacting support, include: your project ID, the endpoint name, the request ID from the X-Request-ID response header, the error response body, and the approximate time of the issue.