Streaming API
Complete reference for consuming streaming responses via Server-Sent Events (SSE).
SSE Format
When stream: true is set in the request, the response is delivered
as a stream of Server-Sent Events. The connection uses the following HTTP headers:
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
X-Request-ID: chatcmpl-abc123
X-Xerotier-Worker-ID: worker-7f3a
Wire Format
Each event is a line prefixed with data: followed by a JSON object and terminated by two newlines:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...}\n\n
The stream ends with a [DONE] sentinel:
data: [DONE]\n\n
Clients should parse each data: line, check for [DONE], and decode the JSON for all other lines.
Chunk Structure
Each streaming chunk follows this schema:
| Field | Type | Description |
|---|---|---|
| id | string | Request identifier (same across all chunks). |
| object | string | Always "chat.completion.chunk". |
| created | integer | Unix timestamp. |
| model | string | Model identifier. |
| service_tier | string | null | Service tier. Always present, may be null. |
| system_fingerprint | string | null | System fingerprint. Always present, may be null. |
| choices | array | Array of choice objects containing delta content, finish_reason, and optional logprobs (when logprobs: true is set in the request). |
| usage | object | null | Token usage. Present only in the final chunk when stream_options.include_usage is true. |
Delta Object
The delta field in each choice contains the incremental content:
| Field | Type | Description |
|---|---|---|
| role | string | null | Present only in the first chunk ("assistant"). |
| content | string | null | New text tokens. Null when no text is generated (e.g., tool calls). |
| tool_calls | array | null | Tool call deltas. See Tool Call Streaming. |
| refusal | string | null | Content filter refusal message (streamed incrementally). |
finish_reason Values
| Value | Description |
|---|---|
null | Generation still in progress. |
"stop" | Model completed naturally or hit a stop sequence. |
"length" | Maximum token limit reached. |
"content_filter" | Content was filtered. |
"tool_calls" | Model generated tool calls. |
Per-Chunk Log Probabilities
When logprobs: true is set in the request, each streaming choice
includes a logprobs object with per-token log probabilities for the
tokens in that chunk. The structure mirrors the non-streaming logprobs format:
| Field | Type | Description |
|---|---|---|
| logprobs.content | array | null | Token log probabilities for content tokens in this chunk. |
| logprobs.refusal | array | null | Token log probabilities for refusal tokens in this chunk. Present when the model refuses to comply with a request. |
Each entry in the content and refusal arrays contains
token, logprob, bytes, and
top_logprobs fields, identical to the non-streaming logprobs format.
When logprobs is not requested, the field is null in all chunks.
Refusal Streaming
When the model refuses a request due to content policy or safety filters, the
refusal text is delivered incrementally via delta.refusal instead
of delta.content. The finish_reason is typically
"stop" and content will be null in the final response.
# First chunk: role assignment
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
# Refusal chunks (delta.refusal instead of delta.content)
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":"I'm sorry, but I"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":" cannot help with that request."},"finish_reason":null}]}
# Final chunk
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Clients should concatenate delta.refusal strings across chunks just
like delta.content. When logprobs: true is set, refusal
token probabilities appear in logprobs.refusal on each chunk.
Annotated Stream Example
# First chunk: role assignment
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
# Content chunks
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" of France is Paris."},"finish_reason":null}]}
# Final chunk: finish_reason set, usage included (requires stream_options.include_usage: true)
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":8,"total_tokens":33,"prompt_tokens_details":{"cached_tokens":0,"audio_tokens":null},"completion_tokens_details":{"reasoning_tokens":null,"audio_tokens":null,"accepted_prediction_tokens":null,"rejected_prediction_tokens":null}}}
# Stream terminator
data: [DONE]
Usage Reporting
To receive token usage data in a streaming response, set stream_options
in your request:
{
"stream": true,
"stream_options": {"include_usage": true}
}
When enabled, the final chunk (the one with finish_reason set)
includes a usage object:
{
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33,
"prompt_tokens_details": {
"cached_tokens": 12,
"audio_tokens": null
},
"completion_tokens_details": {
"reasoning_tokens": null,
"audio_tokens": null,
"accepted_prediction_tokens": null,
"rejected_prediction_tokens": null
}
}
}
The prompt_tokens_details.cached_tokens field shows how many prompt
tokens were served from the prefix cache, reducing time-to-first-token.
The completion_tokens_details object provides a breakdown of output
tokens. For reasoning models (when reasoning_effort is set),
reasoning_tokens shows how many tokens were used for internal
reasoning. When prediction is used,
accepted_prediction_tokens and rejected_prediction_tokens
indicate how effective the prediction was.
Without stream_options.include_usage, the usage field
is null in all chunks. Token usage is still tracked internally for billing.
Tool Call Streaming
When the model generates tool calls, the delta.tool_calls field
contains incremental data. Tool calls are streamed across multiple chunks:
- The first chunk for a tool call includes
id,type, and the functionname. - Subsequent chunks include only the
argumentsstring, accumulated incrementally. - The
indexfield identifies which tool call the delta belongs to (for parallel tool calls).
Tool Call Stream Example
# First chunk: tool call start
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","tool_calls":[{"index":0,"id":"call_abc","type":"function","function":{"name":"get_weather","arguments":""}}]},"finish_reason":null}]}
# Arguments streamed incrementally
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"location\":"}}]},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"Paris\"}"}}]},"finish_reason":null}]}
# Final chunk
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]}
data: [DONE]
Clients should concatenate the arguments strings across chunks
and parse the complete JSON after finish_reason: "tool_calls".
Error Handling
Pre-Stream Errors
Errors that occur before the SSE connection is established return standard HTTP status codes (400, 401, 404, 429, 503). The response body is a JSON error object, not an SSE stream.
Mid-Stream Errors
Errors that occur during an active stream use the event: error SSE
event type (not the standard data: prefix). The [DONE]
sentinel is always sent after the error:
event: error
data: {"error":{"message":"Request timed out after 30s. Your Free tier has a 30-second timeout limit.","type":"timeout_error","code":"timeout"}}
data: [DONE]
Error Types
| Type | Code | Description |
|---|---|---|
api_error |
varies | Backend agent reported an error during generation. |
timeout_error |
timeout |
Request exceeded the tier's deadline timeout. |
stream_idle_timeout |
stream_idle_timeout |
No chunks received for the idle timeout period. See tier timeouts. |
| (none) | cancelled |
Request was cancelled (client disconnect or server cancellation). |
Client Disconnect
When a client disconnects during a stream, the router detects the broken connection and sends a cancellation request to the backend agent. The agent stops generation to free resources. Any tokens generated before disconnection are still billed.
Heartbeats
During idle periods (no chunks for 15 seconds), the router sends SSE comment frames to keep the connection alive:
: heartbeat
Per the SSE specification, lines beginning with a colon are comments that clients silently ignore. These heartbeats prevent reverse proxies and load balancers from closing idle connections due to read timeouts. Heartbeats do not reset the idle stream timeout tracker.
Client Examples
from openai import OpenAI
client = OpenAI(
base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
api_key="xero_myproject_your_api_key"
)
stream = client.chat.completions.create(
model="llama-3.1-8b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
stream_options={"include_usage": True}
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
if chunk.usage:
print(f"\nTokens: {chunk.usage.prompt_tokens} + {chunk.usage.completion_tokens}")
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
apiKey: "xero_myproject_your_api_key"
});
const stream = await client.chat.completions.create({
model: "llama-3.1-8b",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
stream_options: { include_usage: true }
});
for await (const chunk of stream) {
const content = chunk.choices?.[0]?.delta?.content;
if (content) process.stdout.write(content);
if (chunk.usage) {
console.log(`\nTokens: ${chunk.usage.prompt_tokens} + ${chunk.usage.completion_tokens}`);
}
}
const response = await fetch(
"https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions",
{
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer xero_myproject_your_api_key"
},
body: JSON.stringify({
model: "llama-3.1-8b",
messages: [{ role: "user", content: "Hello!" }],
stream: true
})
}
);
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
for (const line of text.split("\n")) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6);
if (data === "[DONE]") break;
const chunk = JSON.parse(data);
const content = chunk.choices?.[0]?.delta?.content;
if (content) process.stdout.write(content);
}
}
curl --no-buffer -X POST \
https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
-H "Authorization: Bearer xero_myproject_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true,
"stream_options": {"include_usage": true}
}'
The --no-buffer flag disables curl's output buffering so chunks
are displayed as they arrive.