Streaming API

Complete reference for consuming streaming responses via Server-Sent Events (SSE).

SSE Format

When stream: true is set in the request, the response is delivered as a stream of Server-Sent Events. The connection uses the following HTTP headers:

HTTP Headers
Content-Type: text/event-stream Cache-Control: no-cache Connection: keep-alive X-Request-ID: chatcmpl-abc123 X-Xerotier-Worker-ID: worker-7f3a

Wire Format

Each event is a line prefixed with data: followed by a JSON object and terminated by two newlines:

SSE Event
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...}\n\n

The stream ends with a [DONE] sentinel:

SSE Terminator
data: [DONE]\n\n

Clients should parse each data: line, check for [DONE], and decode the JSON for all other lines.

Chunk Structure

Each streaming chunk follows this schema:

Field Type Description
id string Request identifier (same across all chunks).
object string Always "chat.completion.chunk".
created integer Unix timestamp.
model string Model identifier.
service_tier string | null Service tier. Always present, may be null.
system_fingerprint string | null System fingerprint. Always present, may be null.
choices array Array of choice objects containing delta content, finish_reason, and optional logprobs (when logprobs: true is set in the request).
usage object | null Token usage. Present only in the final chunk when stream_options.include_usage is true.

Delta Object

The delta field in each choice contains the incremental content:

Field Type Description
role string | null Present only in the first chunk ("assistant").
content string | null New text tokens. Null when no text is generated (e.g., tool calls).
tool_calls array | null Tool call deltas. See Tool Call Streaming.
refusal string | null Content filter refusal message (streamed incrementally).

finish_reason Values

Value Description
nullGeneration still in progress.
"stop"Model completed naturally or hit a stop sequence.
"length"Maximum token limit reached.
"content_filter"Content was filtered.
"tool_calls"Model generated tool calls.

Per-Chunk Log Probabilities

When logprobs: true is set in the request, each streaming choice includes a logprobs object with per-token log probabilities for the tokens in that chunk. The structure mirrors the non-streaming logprobs format:

Field Type Description
logprobs.content array | null Token log probabilities for content tokens in this chunk.
logprobs.refusal array | null Token log probabilities for refusal tokens in this chunk. Present when the model refuses to comply with a request.

Each entry in the content and refusal arrays contains token, logprob, bytes, and top_logprobs fields, identical to the non-streaming logprobs format. When logprobs is not requested, the field is null in all chunks.

Refusal Streaming

When the model refuses a request due to content policy or safety filters, the refusal text is delivered incrementally via delta.refusal instead of delta.content. The finish_reason is typically "stop" and content will be null in the final response.

SSE Stream (Refusal)
# First chunk: role assignment data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} # Refusal chunks (delta.refusal instead of delta.content) data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":"I'm sorry, but I"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":" cannot help with that request."},"finish_reason":null}]} # Final chunk data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE]

Clients should concatenate delta.refusal strings across chunks just like delta.content. When logprobs: true is set, refusal token probabilities appear in logprobs.refusal on each chunk.

Annotated Stream Example

SSE Stream
# First chunk: role assignment data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} # Content chunks data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" of France is Paris."},"finish_reason":null}]} # Final chunk: finish_reason set, usage included (requires stream_options.include_usage: true) data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":8,"total_tokens":33,"prompt_tokens_details":{"cached_tokens":0,"audio_tokens":null},"completion_tokens_details":{"reasoning_tokens":null,"audio_tokens":null,"accepted_prediction_tokens":null,"rejected_prediction_tokens":null}}} # Stream terminator data: [DONE]

Usage Reporting

To receive token usage data in a streaming response, set stream_options in your request:

JSON
{ "stream": true, "stream_options": {"include_usage": true} }

When enabled, the final chunk (the one with finish_reason set) includes a usage object:

JSON
{ "usage": { "prompt_tokens": 25, "completion_tokens": 8, "total_tokens": 33, "prompt_tokens_details": { "cached_tokens": 12, "audio_tokens": null }, "completion_tokens_details": { "reasoning_tokens": null, "audio_tokens": null, "accepted_prediction_tokens": null, "rejected_prediction_tokens": null } } }

The prompt_tokens_details.cached_tokens field shows how many prompt tokens were served from the prefix cache, reducing time-to-first-token.

The completion_tokens_details object provides a breakdown of output tokens. For reasoning models (when reasoning_effort is set), reasoning_tokens shows how many tokens were used for internal reasoning. When prediction is used, accepted_prediction_tokens and rejected_prediction_tokens indicate how effective the prediction was.

Without stream_options.include_usage, the usage field is null in all chunks. Token usage is still tracked internally for billing.

Tool Call Streaming

When the model generates tool calls, the delta.tool_calls field contains incremental data. Tool calls are streamed across multiple chunks:

  • The first chunk for a tool call includes id, type, and the function name.
  • Subsequent chunks include only the arguments string, accumulated incrementally.
  • The index field identifies which tool call the delta belongs to (for parallel tool calls).

Tool Call Stream Example

SSE Stream
# First chunk: tool call start data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","tool_calls":[{"index":0,"id":"call_abc","type":"function","function":{"name":"get_weather","arguments":""}}]},"finish_reason":null}]} # Arguments streamed incrementally data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"location\":"}}]},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"Paris\"}"}}]},"finish_reason":null}]} # Final chunk data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]} data: [DONE]

Clients should concatenate the arguments strings across chunks and parse the complete JSON after finish_reason: "tool_calls".

Error Handling

Pre-Stream Errors

Errors that occur before the SSE connection is established return standard HTTP status codes (400, 401, 404, 429, 503). The response body is a JSON error object, not an SSE stream.

Mid-Stream Errors

Errors that occur during an active stream use the event: error SSE event type (not the standard data: prefix). The [DONE] sentinel is always sent after the error:

SSE Error Event
event: error data: {"error":{"message":"Request timed out after 30s. Your Free tier has a 30-second timeout limit.","type":"timeout_error","code":"timeout"}} data: [DONE]

Error Types

Type Code Description
api_error varies Backend agent reported an error during generation.
timeout_error timeout Request exceeded the tier's deadline timeout.
stream_idle_timeout stream_idle_timeout No chunks received for the idle timeout period. See tier timeouts.
(none) cancelled Request was cancelled (client disconnect or server cancellation).

Client Disconnect

When a client disconnects during a stream, the router detects the broken connection and sends a cancellation request to the backend agent. The agent stops generation to free resources. Any tokens generated before disconnection are still billed.

Heartbeats

During idle periods (no chunks for 15 seconds), the router sends SSE comment frames to keep the connection alive:

SSE Comment
: heartbeat

Per the SSE specification, lines beginning with a colon are comments that clients silently ignore. These heartbeats prevent reverse proxies and load balancers from closing idle connections due to read timeouts. Heartbeats do not reset the idle stream timeout tracker.

Client Examples

Python
from openai import OpenAI client = OpenAI( base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", api_key="xero_myproject_your_api_key" ) stream = client.chat.completions.create( model="llama-3.1-8b", messages=[{"role": "user", "content": "Hello!"}], stream=True, stream_options={"include_usage": True} ) for chunk in stream: if chunk.choices and chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) if chunk.usage: print(f"\nTokens: {chunk.usage.prompt_tokens} + {chunk.usage.completion_tokens}")
Node.js (OpenAI SDK)
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1", apiKey: "xero_myproject_your_api_key" }); const stream = await client.chat.completions.create({ model: "llama-3.1-8b", messages: [{ role: "user", content: "Hello!" }], stream: true, stream_options: { include_usage: true } }); for await (const chunk of stream) { const content = chunk.choices?.[0]?.delta?.content; if (content) process.stdout.write(content); if (chunk.usage) { console.log(`\nTokens: ${chunk.usage.prompt_tokens} + ${chunk.usage.completion_tokens}`); } }
JavaScript / TypeScript (Raw Fetch)
const response = await fetch( "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": "Bearer xero_myproject_your_api_key" }, body: JSON.stringify({ model: "llama-3.1-8b", messages: [{ role: "user", content: "Hello!" }], stream: true }) } ); const reader = response.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const text = decoder.decode(value); for (const line of text.split("\n")) { if (!line.startsWith("data: ")) continue; const data = line.slice(6); if (data === "[DONE]") break; const chunk = JSON.parse(data); const content = chunk.choices?.[0]?.delta?.content; if (content) process.stdout.write(content); } }
curl
curl --no-buffer -X POST \ https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \ -H "Authorization: Bearer xero_myproject_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.1-8b", "messages": [{"role": "user", "content": "Hello!"}], "stream": true, "stream_options": {"include_usage": true} }'

The --no-buffer flag disables curl's output buffering so chunks are displayed as they arrive.