// API Reference

Streaming API

Server-Sent Events from token zero. Stream completions, tool calls, and reasoning channels to a browser or an SDK without polling, without a websocket library, and without holding a connection open in your worker pool.

SSE Format

When stream: true is set in the request, the response is delivered as a stream of Server-Sent Events. The connection uses the following HTTP headers:

HTTP Headers
Content-Type: text/event-stream Cache-Control: no-cache Connection: keep-alive

Wire Format

Each event is a line prefixed with data: followed by a JSON object and terminated by two newlines:

SSE Event
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...}\n\n

The stream ends with a [DONE] sentinel:

SSE Terminator
data: [DONE]\n\n

Clients should parse each data: line, check for [DONE], and decode the JSON for all other lines.

Chunk Structure

Each streaming chunk follows this schema:

Field Type Description
id string Request identifier (same across all chunks).
object string Always "chat.completion.chunk".
created integer Unix timestamp.
model string Model identifier.
service_tier string | null Service tier. Always present, may be null.
system_fingerprint string | null System fingerprint. Always present, may be null.
choices array Array of choice objects containing delta content, finish_reason, and optional logprobs (when logprobs: true is set in the request).
usage object | null Token usage. Present only in the final chunk when stream_options.include_usage is true.

Delta Object

The delta field in each choice contains the incremental content:

Field Type Description
role string | null Present only in the first chunk ("assistant").
content string | null New text tokens. Null when no text is generated (e.g., tool calls).
tool_calls array | null Tool call deltas. See Tool Call Streaming.
refusal string | null Content filter refusal message (streamed incrementally).

Reasoning content: the chat-completions delta object does not carry a reasoning_content field. Internal reasoning produced by thinking models is dropped from this stream. To observe incremental reasoning, use the Responses API streaming surface (response.reasoning_summary_text.delta events) instead.

finish_reason Values

Value Description
nullGeneration still in progress.
"stop"Model completed naturally or hit a stop sequence.
"length"Maximum token limit reached.
"content_filter"Content was filtered.
"tool_calls"Model generated tool calls.

Per-Chunk Log Probabilities

When logprobs: true is set in the request, each streaming choice includes a logprobs object with per-token log probabilities for the tokens in that chunk. The structure mirrors the non-streaming logprobs format:

Field Type Description
logprobs.content array | null Token log probabilities for content tokens in this chunk.
logprobs.refusal array | null Token log probabilities for refusal tokens in this chunk. Present when the model refuses to comply with a request.

Each entry in the content and refusal arrays contains token, logprob, bytes, and top_logprobs fields, identical to the non-streaming logprobs format. When logprobs is not requested, the field is null in all chunks.

Refusal Streaming

When the model refuses a request due to content policy or safety filters, the refusal text is delivered incrementally via delta.refusal instead of delta.content. The finish_reason is typically "stop" and content will be null in the final response.

SSE Stream (Refusal)
# First chunk: role assignment data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} # Refusal chunks (delta.refusal instead of delta.content) data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":"I'm sorry, but I"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":" cannot help with that request."},"finish_reason":null}]} # Final chunk data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE]

Clients should concatenate delta.refusal strings across chunks just like delta.content. When logprobs: true is set, refusal token probabilities appear in logprobs.refusal on each chunk.

Annotated Stream Example

SSE Stream
# First chunk: role assignment data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} # Content chunks data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" of France is Paris."},"finish_reason":null}]} # Final chunk: finish_reason set, usage included (requires stream_options.include_usage: true) data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":8,"total_tokens":33,"prompt_tokens_details":{"cached_tokens":0,"audio_tokens":null},"completion_tokens_details":{"reasoning_tokens":null,"audio_tokens":null,"accepted_prediction_tokens":null,"rejected_prediction_tokens":null}}} # Stream terminator data: [DONE]

Usage Reporting

To receive token usage data in a streaming response, set stream_options in your request:

JSON
{ "stream": true, "stream_options": {"include_usage": true} }

When enabled, the final chunk (the one with finish_reason set) includes a usage object:

JSON
{ "usage": { "prompt_tokens": 25, "completion_tokens": 8, "total_tokens": 33, "prompt_tokens_details": { "cached_tokens": 12, "audio_tokens": null }, "completion_tokens_details": { "reasoning_tokens": null, "audio_tokens": null, "accepted_prediction_tokens": null, "rejected_prediction_tokens": null } } }

The prompt_tokens_details.cached_tokens field shows how many prompt tokens were served from the prefix cache, reducing time-to-first-token.

The completion_tokens_details object provides a breakdown of output tokens. For reasoning models (when reasoning_effort is set), reasoning_tokens shows how many tokens were used for internal reasoning. When prediction is used, accepted_prediction_tokens and rejected_prediction_tokens indicate how effective the prediction was.

Without stream_options.include_usage, the usage field is null in all chunks. Token usage is still tracked internally for billing.

Tool Call Streaming

When the model generates tool calls, the delta.tool_calls field contains incremental data. Tool calls are streamed across multiple chunks:

  • The first chunk for a tool call includes id, type, and the function name.
  • Subsequent chunks include only the arguments string, accumulated incrementally.
  • The index field identifies which tool call the delta belongs to (for parallel tool calls).

Tool Call Stream Example

SSE Stream
# First chunk: tool call start data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","tool_calls":[{"index":0,"id":"call_abc","type":"function","function":{"name":"get_weather","arguments":""}}]},"finish_reason":null}]} # Arguments streamed incrementally data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"location\":"}}]},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"Paris\"}"}}]},"finish_reason":null}]} # Final chunk data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]} data: [DONE]

Clients should concatenate the arguments strings across chunks and parse the complete JSON after finish_reason: "tool_calls".

Vendor Events

The platform emits vendor-prefixed events (x_*) as inline data: lines within the Chat Completions stream. These events are not standard OpenAI SSE events; they carry a JSON object with a type field that starts with x_. Standard OpenAI SDK clients silently discard unrecognized data lines, so these events are backward compatible.

Vendor events are emitted by the platform when platform-specific features are active (research mode, deep think, analyst mode, artifacts, and interactive user prompts). They flow inline between regular chat.completion.chunk data lines. Custom consumers should inspect the type field of each data line and handle both OpenAI-standard chunk objects and x_* vendor payloads.

Wire Format

Vendor Event Wire Format
data: {"type":"x_research.searching","name":"web_search","arguments":"{\"query\":\"...\"}"}

Research Events (x_research.*)

Emitted during agentic research loops (research mode).

Note: x_research.complete is a dashboard-only consumer signal and is not emitted by the public API surface. Do not rely on it from SDK clients; use the final chat.completion.chunk with finish_reason set instead.

Type Description
x_research.searchingWeb search tool invoked. Fields: name, arguments (JSON string with query).
x_research.readingURL fetch tool invoked. Fields: name, arguments (JSON string with url).
x_research.code_searchingCode search tool invoked. Fields: name, arguments.
x_research.calculatingCalculator tool invoked. Fields: name, arguments.
x_research.resultTool returned a result. Fields: name, arguments, metadata (object with summary).
x_research.gap_analysisLoop is identifying gaps before a deepening pass.
x_research.deepening_roundA deepening iteration has begun.
x_research.context_compactedConversation context was compacted to free input budget.
x_research.tracking_decisionLoop is recording a decision into intelligence storage.
x_research.querying_decisionsLoop is querying recorded decisions.
x_research.tracking_milestoneLoop is recording a milestone.
x_research.querying_timelineLoop is reading the timeline of tracked events.
x_research.briefingIntelligence briefing is being prepared.
x_research.relatingLoop is computing entity/document relationships.
x_research.creating_mockupLoop is dispatching create_mockup.
x_research.updating_mockupLoop is dispatching update_mockup.
x_research.tool_callGeneric tool invocation notification (covers tools not modeled above).

Deep Think Events (x_deep_think.*)

Emitted during deep think (multi-step research with planning and synthesis).

Note: x_deep_think.completed, x_deep_think.error, x_deep_think.artifact_created, and x_deep_think.memories_created are declared in the schema but are not currently emitted by the router. x_deep_think.artifact_saved and x_deep_think.subtask_artifact_saved are emitted only by the dashboard frontend and are not visible to public API consumers.

Type Description
x_deep_think.planning_startedPlanner has begun building the deep-think plan.
x_deep_think.plan_createdPlanning phase complete. Fields: title, total_subtasks.
x_deep_think.plan_critiquedPlanner critique pass finished.
x_deep_think.discovery_startedTarget-focused discovery phase begun. Fields: message.
x_deep_think.discovery_completedDiscovery phase complete. Fields: message.
x_deep_think.subtask_plannedA sub-task plan has been generated.
x_deep_think.subtask_startedSub-task begun. Fields: subtask_id, subtask_index, subtask_query, total_subtasks.
x_deep_think.subtask_retriedSub-task retried after a transient failure.
x_deep_think.subtask_completedSub-task finished. Fields: subtask_index, input_tokens, output_tokens.
x_deep_think.deepening_round_completedA deepening round across sub-tasks finished.
x_deep_think.cross_referenceCross-referencing between sub-task findings.
x_deep_think.synthesizingSynthesis phase begun.
x_deep_think.structured_synthesisStructured synthesis output produced.
x_deep_think.claim_emittedA discrete synthesis claim was emitted.
x_deep_think.synthesis_failedSynthesis pass failed.
x_deep_think.memory_extractedA workspace memory candidate was extracted from synthesis.

Artifact Events (x_artifact.*)

Emitted when code artifacts are created or updated during generation.

Type Description
x_artifact.createdNew artifact created. Fields: artifact_id, identifier, title, language, content_type, content.
x_artifact.updatedExisting artifact updated. Same fields as x_artifact.created.

Mockup Events (x_mockup.*)

Emitted when an agentic-mode response creates or updates a multi-file mockup bundle (see create_mockup and x_update_mockup). Bundle files are reachable from the preview iframe at GET /v1/mockups/{bundleId}/{path}.

Type Description
x_mockup.callingPlaceholder fired before parse/persist begins so the UI can show a card immediately. Fields: identifier, title (when known).
x_mockup.createdBundle creation succeeded. Fired once after create_mockup. Fields: bundleId, identifier, title, entry, files (array of {path, contentType, size}).
x_mockup.updatedBundle update succeeded. Fired once after update_mockup. Fields: bundleId, identifier, title, entry, changed (array of paths added or replaced), deleted (array of paths removed).
x_mockup.errorValidation, storage, or partial-write failure. Fields: bundleId (optional, present when the bundle exists), code, message, successfulPaths (optional, paths persisted before the failure), failedPath (optional, the path that triggered the failure).

Ask User Events (x_ask_user.*)

Emitted when the model needs clarification before it can continue.

Type Description
x_ask_user.questionModel needs user input. Fields: askUserId (camelCase correlation ID to pass when resuming), question (text to show the user), options (optional array of selectable choices), allowFreeText (boolean), multiSelect (boolean), style (string presentation hint), fields (optional structured form fields), toolCallId (originating tool call identifier).
x_ask_user.pending_stateCaptures the assistant content and tool calls accumulated before the pause, for resumption after the user responds. Fields: assistantContent, toolCalls.

Context Fork Event (x_context_fork)

Emitted when the user's message triggered creation of a new conversation branch. Fields: branch_id, branch_name, message_count.

Chat Metadata Event (x_chat.metadata)

Dashboard-only event. x_chat.metadata is emitted by the dashboard frontend controller, not by the public API surface. SDK clients calling POST /:project_id/:endpoint_slug/v1/chat/completions directly will not observe this event.

Emitted as the final data event before [DONE] on dashboard chat streams. Contains server-side persistence identifiers, context budget breakdown, and combined token usage (including any research or deep think overhead).

x_chat.metadata example
data: { "type": "x_chat.metadata", "messageId": "msg_ext_abc123", "userMessageId": "msg_ext_xyz789", "sequence": 4, "context": { "systemTokens": 512, "summaryTokens": 0, "retrievedTokens": 1024, "recentTokens": 2048, "fileTokens": 0, "currentMessageTokens": 64, "totalTokens": 3648, "inputBudget": 8192, "retrievedCount": 6, "recentCount": 12, "usedSemanticRetrieval": true, "semanticRetrievalActive": true, "chunkSelectionMethod": "semantic" }, "usage": { "input_tokens": 3648, "output_tokens": 256, "total_tokens": 3904 } }

Analyst Events (x_analyst.*)

Emitted when analyst mode builds or refreshes the workspace context brief.

Type Description
x_analyst.context_gatheringWorkspace context gathering begun.
x_analyst.context_completedGathering finished. Contains item counts.
x_analyst.context_brief_createdLLM-generated context brief is ready.
x_analyst.context_refreshedCached context brief refreshed due to workspace changes.

Error Handling

Pre-Stream Errors

Errors that occur before the SSE connection is established return standard HTTP status codes (400, 401, 404, 429, 503). The response body is a JSON error object, not an SSE stream.

Mid-Stream Errors

Errors that occur during an active stream use the event: error SSE event type (not the standard data: prefix). The [DONE] sentinel is always sent after the error:

SSE Error Event
event: error data: {"error":{"message":"Request timed out after 30s. Your Free tier has a 30-second timeout limit.","type":"timeout_error","code":"timeout"}} data: [DONE]

Error Types

Type Code Description
server_error varies Backend agent reported an error during generation. The code field contains the specific error code (e.g., internal_error, backend_unavailable, preempted).
timeout_error timeout Request exceeded the tier's deadline timeout.
stream_idle_timeout stream_idle_timeout No chunks received for the idle timeout period. See tier timeouts.
(none) cancelled Request was cancelled (client disconnect or server cancellation).

Client Disconnect

When a client disconnects during a stream, the router detects the broken connection and sends a cancellation request to the backend agent. The agent stops generation to free resources. Any tokens generated before disconnection are still billed.

Heartbeats

During idle periods (no chunks for 15 seconds), the router sends SSE comment frames to keep the connection alive:

SSE Comment
: heartbeat

Per the SSE specification, lines beginning with a colon are comments that clients silently ignore. These heartbeats prevent reverse proxies and load balancers from closing idle connections due to read timeouts. Heartbeats do not reset the idle stream timeout tracker.

Client Examples

Python
from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) stream = client.chat.completions.create( model="llama-3.1-8b", messages=[{"role": "user", "content": "Hello!"}], stream=True, stream_options={"include_usage": True} ) for chunk in stream: if chunk.choices and chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) if chunk.usage: print(f"\nTokens: {chunk.usage.prompt_tokens} + {chunk.usage.completion_tokens}")
Node.js (OpenAI SDK)
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", apiKey: "xero_myproject_your_api_key" }); const stream = await client.chat.completions.create({ model: "llama-3.1-8b", messages: [{ role: "user", content: "Hello!" }], stream: true, stream_options: { include_usage: true } }); for await (const chunk of stream) { const content = chunk.choices?.[0]?.delta?.content; if (content) process.stdout.write(content); if (chunk.usage) { console.log(`\nTokens: ${chunk.usage.prompt_tokens} + ${chunk.usage.completion_tokens}`); } }
JavaScript / TypeScript (Raw Fetch)
const response = await fetch( "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": "Bearer xero_myproject_your_api_key" }, body: JSON.stringify({ model: "llama-3.1-8b", messages: [{ role: "user", content: "Hello!" }], stream: true }) } ); const reader = response.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const text = decoder.decode(value); for (const line of text.split("\n")) { if (!line.startsWith("data: ")) continue; const data = line.slice(6); if (data === "[DONE]") break; const chunk = JSON.parse(data); const content = chunk.choices?.[0]?.delta?.content; if (content) process.stdout.write(content); } }
curl
curl --no-buffer -X POST \ https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_myproject_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.1-8b", "messages": [{"role": "user", "content": "Hello!"}], "stream": true, "stream_options": {"include_usage": true} }'

The --no-buffer flag disables curl's output buffering so chunks are displayed as they arrive.