Error Handling
Error codes, fault categories, retry policies, and troubleshooting guidance.
Error Response Format
All error responses follow the OpenAI-compatible JSON format:
{
"error": {
"message": "The model 'nonexistent-model' does not exist.",
"type": "not_found_error",
"code": "model_not_found",
"param": null
}
}
Error Object Fields
| Field | Type | Description |
|---|---|---|
| message | string | Human-readable error description. Sensitive information (file paths, IPs, tokens, UUIDs) is automatically sanitized. |
| type | string | Error category (e.g., invalid_request_error, not_found_error, server_error). |
| code | string | null | Machine-readable error code. See Error Codes below. |
| param | string | null | The request parameter that caused the error, if applicable. |
Error Codes
The following error codes can appear in inference API responses. Each code maps to a specific HTTP status, fault category, and retry policy.
| Code | HTTP | Fault | Retryable | Description |
|---|---|---|---|---|
invalid_request |
400 | Client | No | Request parameters are invalid (bad JSON, context length exceeded, invalid max_tokens). |
json_parse_error |
400 | Client | No | The request body could not be parsed as valid JSON. |
authentication_error |
401 | Client | No | Invalid or missing API key. Check your Authorization header. |
model_not_found |
404 | Client | No | The requested model is not found or not loaded on any backend. |
project_not_found |
404 | Client | No | The specified project does not exist or you do not have access. |
endpoint_not_found |
404 | Client | No | The specified endpoint does not exist within this project. |
completion_not_found |
404 | Client | No | The specified stored completion does not exist. |
response_not_found |
404 | Client | No | The specified response does not exist. |
capacity_exceeded |
429 | Agent | Yes | Backend worker is at capacity. Retry after the suggested delay. |
quota_exceeded |
429 | Client | No | Your account or endpoint has exceeded its usage quota. Upgrade your tier or wait for the quota to reset. |
endpoint_inactive |
503 | Agent | Yes | The endpoint is not currently active. It may be provisioning or disabled. |
preempted |
503 | Agent | Yes | Request was preempted by a higher-priority request. Retry immediately or after a short delay. |
backend_unavailable |
503 | Agent | Yes | Backend inference engine is unavailable (down, unreachable, or returning 5xx errors). |
timeout |
408 | Network | Yes | Request timed out before completion. May be a connect timeout, read timeout, or deadline exceeded. |
invalid_state |
409 | Client | No | The resource is not in a valid state for the requested operation (e.g., cancelling an already completed response). |
cancelled |
499 | Client | No | Request was cancelled by the client (connection closed) or by the router. |
internal_error |
500 | Agent | Yes | An unexpected internal error occurred. The router will attempt to retry on a different backend. |
Distinguishing 429 Errors
Both capacity_exceeded and quota_exceeded return
HTTP 429, but they have different meanings and retry behavior:
- capacity_exceeded -- Transient. The backend is temporarily overloaded. Retry after the delay indicated in the response or the
Retry-Afterheader. - quota_exceeded -- Persistent. Your account has hit its usage limit. Do not retry -- you need to upgrade your plan or wait for the quota period to reset.
Check the code field in the error response to distinguish them.
Fault Categories
Errors are classified into three fault categories, each with a different retry policy:
| Category | Max Retries | Initial Backoff | Max Backoff | Multiplier | Description |
|---|---|---|---|---|---|
| Client Fault | 0 | -- | -- | -- | The request is invalid. Fix the request before retrying. |
| Agent Fault | 3 | 1s | 30s | 2x | Backend issue. The router retries on a different backend automatically. |
| Network Fault | 5 | 0.5s | 60s | 2x | Network or timeout issue. The router retries with aggressive backoff. |
The router performs internal retries for agent and network faults before returning an error to the client. If all retries are exhausted, the final error is returned to the client with the appropriate HTTP status code.
Inference-Specific Errors
The following scenarios are common during inference and have specific handling guidance:
Context Length Exceeded (400)
Your prompt exceeds the model's maximum context length. The error message includes details about the limit. Reduce your prompt length or use a model with a larger context window.
Max Tokens Invalid (400)
The max_tokens or max_completion_tokens value is
invalid (negative, zero, or exceeds the model's limit). Adjust the value
to be within the model's supported range.
Invalid Reasoning Effort (400)
The reasoning_effort parameter must be one of
"low", "medium", or "high".
Any other value (including uppercase variants like "LOW"
or numeric strings like "1") returns a 400 error with
type: "invalid_request_error" and
param: "reasoning_effort". Omit the field entirely
if you do not need reasoning effort control.
Invalid Logprobs Configuration (400)
Two logprobs validation rules are enforced:
-
top_logprobs without logprobs --
Setting
top_logprobsrequireslogprobs: true. Iflogprobsis omitted or set tofalsewhiletop_logprobsis present, the request returns a 400 error withtype: "invalid_request_error"andparam: "top_logprobs". -
top_logprobs out of range --
The
top_logprobsvalue must be between 0 and 20 (inclusive). Values outside this range return a 400 error withtype: "invalid_request_error"andparam: "top_logprobs".
Model Not Loaded (404)
The model specified in your endpoint configuration is not currently loaded on any backend. This can happen if no backends are available for your tier or if the model has been removed. Check your endpoint configuration and backend status.
Preempted (503)
Your request was interrupted by a higher-priority request on the same backend.
This only affects preemptable tiers (Free, GPU Shared). The request was not
completed -- retry it. The router may include partial token counts in the error
response via partialInputTokens and partialOutputTokens
fields.
Request Timeout (408)
The request did not complete within the tier's timeout limit. Timeouts vary
by tier (30 seconds for Free, 300 seconds for CPU/GPU, 1800 seconds for
XIM). Consider using streaming to avoid timeouts on long-running
generation, or reduce the max_tokens value.
Streaming Errors
Errors during streaming behave differently depending on when they occur:
Pre-Stream Errors
If an error occurs before any tokens are generated (e.g., invalid request, model not found), you receive a standard HTTP error response with the appropriate status code. No SSE events are sent.
Mid-Stream Errors
If an error occurs after streaming has started (HTTP 200 has already been sent), the error is delivered as an SSE event:
event: error
data: {"error":{"message":"Backend connection lost","type":"server_error","code":"backend_unavailable"}}
After an error event, the stream is terminated with a
data: [DONE] sentinel. The error types that can appear
mid-stream are:
| Error Type | Cause |
|---|---|
api_error |
Internal server error during generation. |
timeout_error |
Request deadline exceeded during generation. |
stream_idle_timeout |
No data received from backend within the idle timeout period. |
cancelled |
Request was cancelled by the client or router. |
Handling Partial Responses
When a mid-stream error occurs, you may have received partial content.
Concatenate all delta.content values received before the error
to get the partial response. Decide whether to use the partial content or
retry the full request based on your application requirements.
Retry Guidance
The router handles most retries internally, but if the final response is an error, use these guidelines for client-side retries:
Retryable Errors
- capacity_exceeded (429) -- Wait for the duration in the
Retry-Afterheader, then retry. - preempted (503) -- Retry immediately or after 1-2 seconds.
- backend_unavailable (503) -- Wait 10-30 seconds, then retry. The
Retry-Afterheader provides a specific delay. - timeout (408) -- Retry after 5 seconds. Consider reducing prompt length or max_tokens.
- internal_error (500) -- Retry after 10 seconds. If persistent, contact support.
Non-Retryable Errors
- invalid_request (400) -- Fix the request. Check parameters, prompt length, and model name.
- model_not_found (404) -- Verify the model is configured for your endpoint.
- quota_exceeded (429) -- Upgrade your plan or wait for the quota period to reset.
- cancelled (499) -- The client disconnected. No retry needed unless the disconnection was unintentional.
Exponential Backoff Example
import time, random
def request_with_retry(make_request, max_retries=3):
delay = 1.0
for attempt in range(max_retries + 1):
response = make_request()
if response.status_code == 200:
return response
if response.status_code in (408, 429, 500, 503):
retry_after = response.headers.get("Retry-After")
if retry_after:
delay = float(retry_after)
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
delay = min(delay * 2, 60)
else:
raise Exception(f"Non-retryable error: {response.status_code}")
raise Exception("Max retries exhausted")
Node.js Exponential Backoff Example
async function requestWithRetry(makeRequest, maxRetries = 3) {
let delay = 1000;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await makeRequest();
if (response.ok) {
return response;
}
if ([408, 429, 500, 503].includes(response.status)) {
const retryAfter = response.headers.get("Retry-After");
if (retryAfter) {
delay = parseFloat(retryAfter) * 1000;
}
const jitter = Math.random() * delay * 0.1;
await new Promise(r => setTimeout(r, delay + jitter));
delay = Math.min(delay * 2, 60000);
} else {
throw new Error(`Non-retryable error: ${response.status}`);
}
}
throw new Error("Max retries exhausted");
}
Troubleshooting
My request returns 400 Bad Request
- Check that your request body is valid JSON.
- Verify the
messagesarray is present and non-empty. - Check that your prompt does not exceed the model's context length.
- Verify that
max_tokens/max_completion_tokensare positive integers within the model's limit. - Check that
temperatureis between 0.0 and 2.0. - If using
reasoning_effort, verify the value is one of"low","medium", or"high".
My request returns 401 Unauthorized
- Verify your API key is correct and has not been revoked.
- Check the
Authorizationheader format:Bearer xero_... - Ensure the API key belongs to the correct project.
I am getting 429 Too Many Requests
- Check the
codefield:capacity_exceededis transient (retry),quota_exceededis persistent (upgrade or wait). - Respect the
Retry-Afterheader. - Reduce your request rate or implement client-side rate limiting.
- Consider upgrading to a higher tier for increased rate limits.
My responses are slow
- Use streaming to reduce perceived latency.
- Check if your prompts are optimized for prefix caching. See Prefix Caching.
- Consider a GPU tier for latency-sensitive workloads.
- Reduce
max_tokensif you do not need long responses.
My streaming connection drops
- Check your client's read timeout -- it must exceed the tier's idle stream timeout (120s for Free, 600s for CPU/GPU, 3600s for XIM).
- The server sends heartbeat comments every 15 seconds to keep the connection alive. If you are behind a proxy, ensure it does not strip SSE events or impose its own timeout shorter than the idle timeout.
- If you receive an
event: error, the stream has been terminated by the server. Check the error type for the cause.
When to Contact Support
Contact support if:
- You receive persistent
internal_error(500) responses that do not resolve with retries. - You see
backend_unavailable(503) errors consistently for more than 5 minutes. - Your usage metrics do not match your billing.
- You suspect unauthorized access to your account or API keys.
When contacting support, include: your project ID, the endpoint name,
the request ID from the X-Request-ID response header,
the error response body, and the approximate time of the issue.