Streaming API
Server-Sent Events from token zero. Stream completions, tool calls, and reasoning channels to a browser or an SDK without polling, without a websocket library, and without holding a connection open in your worker pool.
SSE Format
When stream: true is set in the request, the response is delivered
as a stream of Server-Sent Events. The connection uses the following HTTP headers:
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
Wire Format
Each event is a line prefixed with data: followed by a JSON object and terminated by two newlines:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...}\n\n
The stream ends with a [DONE] sentinel:
data: [DONE]\n\n
Clients should parse each data: line, check for [DONE], and decode the JSON for all other lines.
Chunk Structure
Each streaming chunk follows this schema:
| Field | Type | Description |
|---|---|---|
| id | string | Request identifier (same across all chunks). |
| object | string | Always "chat.completion.chunk". |
| created | integer | Unix timestamp. |
| model | string | Model identifier. |
| service_tier | string | null | Service tier. Always present, may be null. |
| system_fingerprint | string | null | System fingerprint. Always present, may be null. |
| choices | array | Array of choice objects containing delta content, finish_reason, and optional logprobs (when logprobs: true is set in the request). |
| usage | object | null | Token usage. Present only in the final chunk when stream_options.include_usage is true. |
Delta Object
The delta field in each choice contains the incremental content:
| Field | Type | Description |
|---|---|---|
| role | string | null | Present only in the first chunk ("assistant"). |
| content | string | null | New text tokens. Null when no text is generated (e.g., tool calls). |
| tool_calls | array | null | Tool call deltas. See Tool Call Streaming. |
| refusal | string | null | Content filter refusal message (streamed incrementally). |
Reasoning content: the chat-completions delta object does
not carry a reasoning_content field. Internal reasoning
produced by thinking models is dropped from this stream. To observe
incremental reasoning, use the Responses API streaming surface
(response.reasoning_summary_text.delta events) instead.
finish_reason Values
| Value | Description |
|---|---|
null | Generation still in progress. |
"stop" | Model completed naturally or hit a stop sequence. |
"length" | Maximum token limit reached. |
"content_filter" | Content was filtered. |
"tool_calls" | Model generated tool calls. |
Per-Chunk Log Probabilities
When logprobs: true is set in the request, each streaming choice
includes a logprobs object with per-token log probabilities for the
tokens in that chunk. The structure mirrors the non-streaming logprobs format:
| Field | Type | Description |
|---|---|---|
| logprobs.content | array | null | Token log probabilities for content tokens in this chunk. |
| logprobs.refusal | array | null | Token log probabilities for refusal tokens in this chunk. Present when the model refuses to comply with a request. |
Each entry in the content and refusal arrays contains
token, logprob, bytes, and
top_logprobs fields, identical to the non-streaming logprobs format.
When logprobs is not requested, the field is null in all chunks.
Refusal Streaming
When the model refuses a request due to content policy or safety filters, the
refusal text is delivered incrementally via delta.refusal instead
of delta.content. The finish_reason is typically
"stop" and content will be null in the final response.
# First chunk: role assignment
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
# Refusal chunks (delta.refusal instead of delta.content)
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":"I'm sorry, but I"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":" cannot help with that request."},"finish_reason":null}]}
# Final chunk
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Clients should concatenate delta.refusal strings across chunks just
like delta.content. When logprobs: true is set, refusal
token probabilities appear in logprobs.refusal on each chunk.
Annotated Stream Example
# First chunk: role assignment
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
# Content chunks
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" of France is Paris."},"finish_reason":null}]}
# Final chunk: finish_reason set, usage included (requires stream_options.include_usage: true)
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":8,"total_tokens":33,"prompt_tokens_details":{"cached_tokens":0,"audio_tokens":null},"completion_tokens_details":{"reasoning_tokens":null,"audio_tokens":null,"accepted_prediction_tokens":null,"rejected_prediction_tokens":null}}}
# Stream terminator
data: [DONE]
Usage Reporting
To receive token usage data in a streaming response, set stream_options
in your request:
{
"stream": true,
"stream_options": {"include_usage": true}
}
When enabled, the final chunk (the one with finish_reason set)
includes a usage object:
{
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33,
"prompt_tokens_details": {
"cached_tokens": 12,
"audio_tokens": null
},
"completion_tokens_details": {
"reasoning_tokens": null,
"audio_tokens": null,
"accepted_prediction_tokens": null,
"rejected_prediction_tokens": null
}
}
}
The prompt_tokens_details.cached_tokens field shows how many prompt
tokens were served from the prefix cache, reducing time-to-first-token.
The completion_tokens_details object provides a breakdown of output
tokens. For reasoning models (when reasoning_effort is set),
reasoning_tokens shows how many tokens were used for internal
reasoning. When prediction is used,
accepted_prediction_tokens and rejected_prediction_tokens
indicate how effective the prediction was.
Without stream_options.include_usage, the usage field
is null in all chunks. Token usage is still tracked internally for billing.
Tool Call Streaming
When the model generates tool calls, the delta.tool_calls field
contains incremental data. Tool calls are streamed across multiple chunks:
- The first chunk for a tool call includes
id,type, and the functionname. - Subsequent chunks include only the
argumentsstring, accumulated incrementally. - The
indexfield identifies which tool call the delta belongs to (for parallel tool calls).
Tool Call Stream Example
# First chunk: tool call start
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","tool_calls":[{"index":0,"id":"call_abc","type":"function","function":{"name":"get_weather","arguments":""}}]},"finish_reason":null}]}
# Arguments streamed incrementally
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"location\":"}}]},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"Paris\"}"}}]},"finish_reason":null}]}
# Final chunk
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]}
data: [DONE]
Clients should concatenate the arguments strings across chunks
and parse the complete JSON after finish_reason: "tool_calls".
Vendor Events
The platform emits vendor-prefixed events (x_*) as inline
data: lines within the Chat Completions stream. These events are
not standard OpenAI SSE events; they carry a JSON object with a type
field that starts with x_. Standard OpenAI SDK clients silently
discard unrecognized data lines, so these events are backward compatible.
Vendor events are emitted by the platform when platform-specific features are
active (research mode, deep think, analyst mode, artifacts, and interactive
user prompts). They flow inline between regular chat.completion.chunk
data lines. Custom consumers should inspect the type field of each
data line and handle both OpenAI-standard chunk objects and x_*
vendor payloads.
Wire Format
data: {"type":"x_research.searching","name":"web_search","arguments":"{\"query\":\"...\"}"}
Research Events (x_research.*)
Emitted during agentic research loops (research mode).
Note: x_research.complete is a dashboard-only consumer signal
and is not emitted by the public API surface. Do not rely on it from SDK
clients; use the final chat.completion.chunk with
finish_reason set instead.
| Type | Description |
|---|---|
x_research.searching | Web search tool invoked. Fields: name, arguments (JSON string with query). |
x_research.reading | URL fetch tool invoked. Fields: name, arguments (JSON string with url). |
x_research.code_searching | Code search tool invoked. Fields: name, arguments. |
x_research.calculating | Calculator tool invoked. Fields: name, arguments. |
x_research.result | Tool returned a result. Fields: name, arguments, metadata (object with summary). |
x_research.gap_analysis | Loop is identifying gaps before a deepening pass. |
x_research.deepening_round | A deepening iteration has begun. |
x_research.context_compacted | Conversation context was compacted to free input budget. |
x_research.tracking_decision | Loop is recording a decision into intelligence storage. |
x_research.querying_decisions | Loop is querying recorded decisions. |
x_research.tracking_milestone | Loop is recording a milestone. |
x_research.querying_timeline | Loop is reading the timeline of tracked events. |
x_research.briefing | Intelligence briefing is being prepared. |
x_research.relating | Loop is computing entity/document relationships. |
x_research.creating_mockup | Loop is dispatching create_mockup. |
x_research.updating_mockup | Loop is dispatching update_mockup. |
x_research.tool_call | Generic tool invocation notification (covers tools not modeled above). |
Deep Think Events (x_deep_think.*)
Emitted during deep think (multi-step research with planning and synthesis).
Note: x_deep_think.completed, x_deep_think.error,
x_deep_think.artifact_created, and
x_deep_think.memories_created are declared in the schema but
are not currently emitted by the router. x_deep_think.artifact_saved
and x_deep_think.subtask_artifact_saved are emitted only by the
dashboard frontend and are not visible to public API consumers.
| Type | Description |
|---|---|
x_deep_think.planning_started | Planner has begun building the deep-think plan. |
x_deep_think.plan_created | Planning phase complete. Fields: title, total_subtasks. |
x_deep_think.plan_critiqued | Planner critique pass finished. |
x_deep_think.discovery_started | Target-focused discovery phase begun. Fields: message. |
x_deep_think.discovery_completed | Discovery phase complete. Fields: message. |
x_deep_think.subtask_planned | A sub-task plan has been generated. |
x_deep_think.subtask_started | Sub-task begun. Fields: subtask_id, subtask_index, subtask_query, total_subtasks. |
x_deep_think.subtask_retried | Sub-task retried after a transient failure. |
x_deep_think.subtask_completed | Sub-task finished. Fields: subtask_index, input_tokens, output_tokens. |
x_deep_think.deepening_round_completed | A deepening round across sub-tasks finished. |
x_deep_think.cross_reference | Cross-referencing between sub-task findings. |
x_deep_think.synthesizing | Synthesis phase begun. |
x_deep_think.structured_synthesis | Structured synthesis output produced. |
x_deep_think.claim_emitted | A discrete synthesis claim was emitted. |
x_deep_think.synthesis_failed | Synthesis pass failed. |
x_deep_think.memory_extracted | A workspace memory candidate was extracted from synthesis. |
Artifact Events (x_artifact.*)
Emitted when code artifacts are created or updated during generation.
| Type | Description |
|---|---|
x_artifact.created | New artifact created. Fields: artifact_id, identifier, title, language, content_type, content. |
x_artifact.updated | Existing artifact updated. Same fields as x_artifact.created. |
Mockup Events (x_mockup.*)
Emitted when an agentic-mode response creates or updates a multi-file
mockup bundle (see create_mockup
and x_update_mockup).
Bundle files are reachable from the preview iframe at
GET /v1/mockups/{bundleId}/{path}.
| Type | Description |
|---|---|
x_mockup.calling | Placeholder fired before parse/persist begins so the UI can show a card immediately. Fields: identifier, title (when known). |
x_mockup.created | Bundle creation succeeded. Fired once after create_mockup. Fields: bundleId, identifier, title, entry, files (array of {path, contentType, size}). |
x_mockup.updated | Bundle update succeeded. Fired once after update_mockup. Fields: bundleId, identifier, title, entry, changed (array of paths added or replaced), deleted (array of paths removed). |
x_mockup.error | Validation, storage, or partial-write failure. Fields: bundleId (optional, present when the bundle exists), code, message, successfulPaths (optional, paths persisted before the failure), failedPath (optional, the path that triggered the failure). |
Ask User Events (x_ask_user.*)
Emitted when the model needs clarification before it can continue.
| Type | Description |
|---|---|
x_ask_user.question | Model needs user input. Fields: askUserId (camelCase correlation ID to pass when resuming), question (text to show the user), options (optional array of selectable choices), allowFreeText (boolean), multiSelect (boolean), style (string presentation hint), fields (optional structured form fields), toolCallId (originating tool call identifier). |
x_ask_user.pending_state | Captures the assistant content and tool calls accumulated before the pause, for resumption after the user responds. Fields: assistantContent, toolCalls. |
Context Fork Event (x_context_fork)
Emitted when the user's message triggered creation of a new conversation branch.
Fields: branch_id, branch_name, message_count.
Chat Metadata Event (x_chat.metadata)
Dashboard-only event. x_chat.metadata is emitted by the
dashboard frontend controller, not by the public API surface. SDK clients
calling POST /:project_id/:endpoint_slug/v1/chat/completions
directly will not observe this event.
Emitted as the final data event before [DONE] on dashboard
chat streams. Contains server-side persistence identifiers, context budget
breakdown, and combined token usage (including any research or deep think
overhead).
data: {
"type": "x_chat.metadata",
"messageId": "msg_ext_abc123",
"userMessageId": "msg_ext_xyz789",
"sequence": 4,
"context": {
"systemTokens": 512,
"summaryTokens": 0,
"retrievedTokens": 1024,
"recentTokens": 2048,
"fileTokens": 0,
"currentMessageTokens": 64,
"totalTokens": 3648,
"inputBudget": 8192,
"retrievedCount": 6,
"recentCount": 12,
"usedSemanticRetrieval": true,
"semanticRetrievalActive": true,
"chunkSelectionMethod": "semantic"
},
"usage": {
"input_tokens": 3648,
"output_tokens": 256,
"total_tokens": 3904
}
}
Analyst Events (x_analyst.*)
Emitted when analyst mode builds or refreshes the workspace context brief.
| Type | Description |
|---|---|
x_analyst.context_gathering | Workspace context gathering begun. |
x_analyst.context_completed | Gathering finished. Contains item counts. |
x_analyst.context_brief_created | LLM-generated context brief is ready. |
x_analyst.context_refreshed | Cached context brief refreshed due to workspace changes. |
Error Handling
Pre-Stream Errors
Errors that occur before the SSE connection is established return standard HTTP status codes (400, 401, 404, 429, 503). The response body is a JSON error object, not an SSE stream.
Mid-Stream Errors
Errors that occur during an active stream use the event: error SSE
event type (not the standard data: prefix). The [DONE]
sentinel is always sent after the error:
event: error
data: {"error":{"message":"Request timed out after 30s. Your Free tier has a 30-second timeout limit.","type":"timeout_error","code":"timeout"}}
data: [DONE]
Error Types
| Type | Code | Description |
|---|---|---|
server_error |
varies | Backend agent reported an error during generation. The code field contains the specific error code (e.g., internal_error, backend_unavailable, preempted). |
timeout_error |
timeout |
Request exceeded the tier's deadline timeout. |
stream_idle_timeout |
stream_idle_timeout |
No chunks received for the idle timeout period. See tier timeouts. |
| (none) | cancelled |
Request was cancelled (client disconnect or server cancellation). |
Client Disconnect
When a client disconnects during a stream, the router detects the broken connection and sends a cancellation request to the backend agent. The agent stops generation to free resources. Any tokens generated before disconnection are still billed.
Heartbeats
During idle periods (no chunks for 15 seconds), the router sends SSE comment frames to keep the connection alive:
: heartbeat
Per the SSE specification, lines beginning with a colon are comments that clients silently ignore. These heartbeats prevent reverse proxies and load balancers from closing idle connections due to read timeouts. Heartbeats do not reset the idle stream timeout tracker.
Client Examples
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
stream = client.chat.completions.create(
model="llama-3.1-8b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
stream_options={"include_usage": True}
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
if chunk.usage:
print(f"\nTokens: {chunk.usage.prompt_tokens} + {chunk.usage.completion_tokens}")
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
apiKey: "xero_myproject_your_api_key"
});
const stream = await client.chat.completions.create({
model: "llama-3.1-8b",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
stream_options: { include_usage: true }
});
for await (const chunk of stream) {
const content = chunk.choices?.[0]?.delta?.content;
if (content) process.stdout.write(content);
if (chunk.usage) {
console.log(`\nTokens: ${chunk.usage.prompt_tokens} + ${chunk.usage.completion_tokens}`);
}
}
const response = await fetch(
"https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions",
{
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer xero_myproject_your_api_key"
},
body: JSON.stringify({
model: "llama-3.1-8b",
messages: [{ role: "user", content: "Hello!" }],
stream: true
})
}
);
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
for (const line of text.split("\n")) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6);
if (data === "[DONE]") break;
const chunk = JSON.parse(data);
const content = chunk.choices?.[0]?.delta?.content;
if (content) process.stdout.write(content);
}
}
curl --no-buffer -X POST \
https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true,
"stream_options": {"include_usage": true}
}'
The --no-buffer flag disables curl's output buffering so chunks
are displayed as they arrive.