Streaming API - Xerotier

SSE Format

When stream: true is set in the request, the response is delivered as a stream of Server-Sent Events. The connection uses the following HTTP headers:

HTTP Headers

                    Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
                

Wire Format

Each event is a line prefixed with data: followed by a JSON object and terminated by two newlines:

SSE Event

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...}\n\n

The stream ends with a [DONE] sentinel:

SSE Terminator

data: [DONE]\n\n

Clients should parse each data: line, check for [DONE], and decode the JSON for all other lines.

Chunk Structure

Each streaming chunk follows this schema:

Field	Type	Description
id	string	Request identifier (same across all chunks).
object	string	Always `"chat.completion.chunk"`.
created	integer	Unix timestamp.
model	string	Model identifier.
service_tier	string \| null	Service tier. Always present, may be null.
system_fingerprint	string \| null	System fingerprint. Always present, may be null.
choices	array	Array of choice objects containing `delta` content, `finish_reason`, and optional `logprobs` (when `logprobs: true` is set in the request).
usage	object \| null	Token usage. Present only in the final chunk when `stream_options.include_usage` is true.

Delta Object

The delta field in each choice contains the incremental content:

Field	Type	Description
role	string \| null	Present only in the first chunk (`"assistant"`).
content	string \| null	New text tokens. Null when no text is generated (e.g., tool calls).
tool_calls	array \| null	Tool call deltas. See Tool Call Streaming.
refusal	string \| null	Content filter refusal message (streamed incrementally).

Reasoning content: the chat-completions delta object does not carry a reasoning_content field. Internal reasoning produced by thinking models is dropped from this stream. To observe incremental reasoning, use the Responses API streaming surface (response.reasoning_summary_text.delta events) instead.

finish_reason Values

Value	Description
`null`	Generation still in progress.
`"stop"`	Model completed naturally or hit a stop sequence.
`"length"`	Maximum token limit reached.
`"content_filter"`	Content was filtered.
`"tool_calls"`	Model generated tool calls.

Per-Chunk Log Probabilities

When logprobs: true is set in the request, each streaming choice includes a logprobs object with per-token log probabilities for the tokens in that chunk. The structure mirrors the non-streaming logprobs format:

Field	Type	Description
logprobs.content	array \| null	Token log probabilities for content tokens in this chunk.
logprobs.refusal	array \| null	Token log probabilities for refusal tokens in this chunk. Present when the model refuses to comply with a request.

Each entry in the content and refusal arrays contains token, logprob, bytes, and top_logprobs fields, identical to the non-streaming logprobs format. When logprobs is not requested, the field is null in all chunks.

Refusal Streaming

When the model refuses a request due to content policy or safety filters, the refusal text is delivered incrementally via delta.refusal instead of delta.content. The finish_reason is typically "stop" and content will be null in the final response.

SSE Stream (Refusal)

                    # First chunk: role assignment
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

# Refusal chunks (delta.refusal instead of delta.content)
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":"I'm sorry, but I"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"refusal":" cannot help with that request."},"finish_reason":null}]}

# Final chunk
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]
                

Clients should concatenate delta.refusal strings across chunks just like delta.content. When logprobs: true is set, refusal token probabilities appear in logprobs.refusal on each chunk.

Annotated Stream Example

SSE Stream

                    # First chunk: role assignment
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

# Content chunks
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" of France is Paris."},"finish_reason":null}]}

# Final chunk: finish_reason set, usage included (requires stream_options.include_usage: true)
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":8,"total_tokens":33,"prompt_tokens_details":{"cached_tokens":0,"audio_tokens":null},"completion_tokens_details":{"reasoning_tokens":null,"audio_tokens":null,"accepted_prediction_tokens":null,"rejected_prediction_tokens":null}}}

# Stream terminator
data: [DONE]
                

Usage Reporting

To receive token usage data in a streaming response, set stream_options in your request:

JSON

                    {
  "stream": true,
  "stream_options": {"include_usage": true}
}
                

When enabled, the final chunk (the one with finish_reason set) includes a usage object:

JSON

                    {
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33,
    "prompt_tokens_details": {
      "cached_tokens": 12,
      "audio_tokens": null
    },
    "completion_tokens_details": {
      "reasoning_tokens": null,
      "audio_tokens": null,
      "accepted_prediction_tokens": null,
      "rejected_prediction_tokens": null
    }
  }
}
                

The prompt_tokens_details.cached_tokens field shows how many prompt tokens were served from the prefix cache, reducing time-to-first-token.

The completion_tokens_details object provides a breakdown of output tokens. For reasoning models (when reasoning_effort is set), reasoning_tokens shows how many tokens were used for internal reasoning. When prediction is used, accepted_prediction_tokens and rejected_prediction_tokens indicate how effective the prediction was.

Without stream_options.include_usage, the usage field is null in all chunks. Token usage is still tracked internally for billing.

Tool Call Streaming

When the model generates tool calls, the delta.tool_calls field contains incremental data. Tool calls are streamed across multiple chunks:

The first chunk for a tool call includes id, type, and the function name.
Subsequent chunks include only the arguments string, accumulated incrementally.
The index field identifies which tool call the delta belongs to (for parallel tool calls).

Tool Call Stream Example

SSE Stream

                    # First chunk: tool call start
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","tool_calls":[{"index":0,"id":"call_abc","type":"function","function":{"name":"get_weather","arguments":""}}]},"finish_reason":null}]}

# Arguments streamed incrementally
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"location\":"}}]},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"Paris\"}"}}]},"finish_reason":null}]}

# Final chunk
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706123456,"model":"llama-3.1-8b","service_tier":null,"system_fingerprint":null,"choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]}

data: [DONE]
                

Clients should concatenate the arguments strings across chunks and parse the complete JSON after finish_reason: "tool_calls".

Vendor Events

The platform emits vendor-prefixed events (x_*) as inline data: lines within the Chat Completions stream. These events are not standard OpenAI SSE events; they carry a JSON object with a type field that starts with x_. Standard OpenAI SDK clients silently discard unrecognized data lines, so these events are backward compatible.

Vendor events are emitted by the platform when platform-specific features are active (research mode, deep think, analyst mode, artifacts, and interactive user prompts). They flow inline between regular chat.completion.chunk data lines. Custom consumers should inspect the type field of each data line and handle both OpenAI-standard chunk objects and x_* vendor payloads.

Wire Format

Vendor Event Wire Format

                    data: {"type":"x_research.searching","name":"web_search","arguments":"{\"query\":\"...\"}"}
                

Research Events (x_research.*)

Emitted during agentic research loops (research mode).

Note: x_research.complete is a dashboard-only consumer signal and is not emitted by the public API surface. Do not rely on it from SDK clients; use the final chat.completion.chunk with finish_reason set instead.

Type	Description
`x_research.searching`	Web search tool invoked. Fields: `name`, `arguments` (JSON string with `query`).
`x_research.reading`	URL fetch tool invoked. Fields: `name`, `arguments` (JSON string with `url`).
`x_research.code_searching`	Code search tool invoked. Fields: `name`, `arguments`.
`x_research.calculating`	Calculator tool invoked. Fields: `name`, `arguments`.
`x_research.result`	Tool returned a result. Fields: `name`, `arguments`, `metadata` (object with `summary`).
`x_research.gap_analysis`	Loop is identifying gaps before a deepening pass.
`x_research.deepening_round`	A deepening iteration has begun.
`x_research.context_compacted`	Conversation context was compacted to free input budget.
`x_research.tracking_decision`	Loop is recording a decision into intelligence storage.
`x_research.querying_decisions`	Loop is querying recorded decisions.
`x_research.tracking_milestone`	Loop is recording a milestone.
`x_research.querying_timeline`	Loop is reading the timeline of tracked events.
`x_research.briefing`	Intelligence briefing is being prepared.
`x_research.relating`	Loop is computing entity/document relationships.
`x_research.creating_mockup`	Loop is dispatching `create_mockup`.
`x_research.updating_mockup`	Loop is dispatching `update_mockup`.
`x_research.tool_call`	Generic tool invocation notification (covers tools not modeled above).

Deep Think Events (x_deep_think.*)

Emitted during deep think (multi-step research with planning and synthesis).

Note: x_deep_think.completed, x_deep_think.error, x_deep_think.artifact_created, and x_deep_think.memories_created are declared in the schema but are not currently emitted by the router. x_deep_think.artifact_saved and x_deep_think.subtask_artifact_saved are emitted only by the dashboard frontend and are not visible to public API consumers.

Type	Description
`x_deep_think.planning_started`	Planner has begun building the deep-think plan.
`x_deep_think.plan_created`	Planning phase complete. Fields: `title`, `total_subtasks`.
`x_deep_think.plan_critiqued`	Planner critique pass finished.
`x_deep_think.discovery_started`	Target-focused discovery phase begun. Fields: `message`.
`x_deep_think.discovery_completed`	Discovery phase complete. Fields: `message`.
`x_deep_think.subtask_planned`	A sub-task plan has been generated.
`x_deep_think.subtask_started`	Sub-task begun. Fields: `subtask_id`, `subtask_index`, `subtask_query`, `total_subtasks`.
`x_deep_think.subtask_retried`	Sub-task retried after a transient failure.
`x_deep_think.subtask_completed`	Sub-task finished. Fields: `subtask_index`, `input_tokens`, `output_tokens`.
`x_deep_think.deepening_round_completed`	A deepening round across sub-tasks finished.
`x_deep_think.cross_reference`	Cross-referencing between sub-task findings.
`x_deep_think.synthesizing`	Synthesis phase begun.
`x_deep_think.structured_synthesis`	Structured synthesis output produced.
`x_deep_think.claim_emitted`	A discrete synthesis claim was emitted.
`x_deep_think.synthesis_failed`	Synthesis pass failed.
`x_deep_think.memory_extracted`	A workspace memory candidate was extracted from synthesis.

Artifact Events (x_artifact.*)

Emitted when code artifacts are created or updated during generation.

Type	Description
`x_artifact.created`	New artifact created. Fields: `artifact_id`, `identifier`, `title`, `language`, `content_type`, `content`.
`x_artifact.updated`	Existing artifact updated. Same fields as `x_artifact.created`.

Mockup Events (x_mockup.*)

Emitted when an agentic-mode response creates or updates a multi-file mockup bundle (see create_mockup and x_update_mockup). Bundle files are reachable from the preview iframe at GET /v1/mockups/{bundleId}/{path}.

Type	Description
`x_mockup.calling`	Placeholder fired before parse/persist begins so the UI can show a card immediately. Fields: `identifier`, `title` (when known).
`x_mockup.created`	Bundle creation succeeded. Fired once after `create_mockup`. Fields: `bundleId`, `identifier`, `title`, `entry`, `files` (array of `{path, contentType, size}`).
`x_mockup.updated`	Bundle update succeeded. Fired once after `update_mockup`. Fields: `bundleId`, `identifier`, `title`, `entry`, `changed` (array of paths added or replaced), `deleted` (array of paths removed).
`x_mockup.error`	Validation, storage, or partial-write failure. Fields: `bundleId` (optional, present when the bundle exists), `code`, `message`, `successfulPaths` (optional, paths persisted before the failure), `failedPath` (optional, the path that triggered the failure).

Ask User Events (x_ask_user.*)

Emitted when the model needs clarification before it can continue.

Type	Description
`x_ask_user.question`	Model needs user input. Fields: `askUserId` (camelCase correlation ID to pass when resuming), `question` (text to show the user), `options` (optional array of selectable choices), `allowFreeText` (boolean), `multiSelect` (boolean), `style` (string presentation hint), `fields` (optional structured form fields), `toolCallId` (originating tool call identifier).
`x_ask_user.pending_state`	Captures the assistant content and tool calls accumulated before the pause, for resumption after the user responds. Fields: `assistantContent`, `toolCalls`.

Context Fork Event (x_context_fork)

Emitted when the user's message triggered creation of a new conversation branch. Fields: branch_id, branch_name, message_count.

Chat Metadata Event (x_chat.metadata)

Dashboard-only event. x_chat.metadata is emitted by the dashboard frontend controller, not by the public API surface. SDK clients calling POST /:project_id/:endpoint_slug/v1/chat/completions directly will not observe this event.

Emitted as the final data event before [DONE] on dashboard chat streams. Contains server-side persistence identifiers, context budget breakdown, and combined token usage (including any research or deep think overhead).

x_chat.metadata example

                    data: {
  "type": "x_chat.metadata",
  "messageId": "msg_ext_abc123",
  "userMessageId": "msg_ext_xyz789",
  "sequence": 4,
  "context": {
    "systemTokens": 512,
    "summaryTokens": 0,
    "retrievedTokens": 1024,
    "recentTokens": 2048,
    "fileTokens": 0,
    "currentMessageTokens": 64,
    "totalTokens": 3648,
    "inputBudget": 8192,
    "retrievedCount": 6,
    "recentCount": 12,
    "usedSemanticRetrieval": true,
    "semanticRetrievalActive": true,
    "chunkSelectionMethod": "semantic"
  },
  "usage": {
    "input_tokens": 3648,
    "output_tokens": 256,
    "total_tokens": 3904
  }
}
                

Analyst Events (x_analyst.*)

Emitted when analyst mode builds or refreshes the workspace context brief.

Type	Description
`x_analyst.context_gathering`	Workspace context gathering begun.
`x_analyst.context_completed`	Gathering finished. Contains item counts.
`x_analyst.context_brief_created`	LLM-generated context brief is ready.
`x_analyst.context_refreshed`	Cached context brief refreshed due to workspace changes.

Error Handling

Pre-Stream Errors

Errors that occur before the SSE connection is established return standard HTTP status codes (400, 401, 404, 429, 503). The response body is a JSON error object, not an SSE stream.

Mid-Stream Errors

Errors that occur during an active stream use the event: error SSE event type (not the standard data: prefix). The [DONE] sentinel is always sent after the error:

SSE Error Event

                    event: error
data: {"error":{"message":"Request timed out after 30s. Your Free tier has a 30-second timeout limit.","type":"timeout_error","code":"timeout"}}

data: [DONE]
                

Error Types

Type	Code	Description
`server_error`	varies	Backend agent reported an error during generation. The `code` field contains the specific error code (e.g., `internal_error`, `backend_unavailable`, `preempted`).
`timeout_error`	`timeout`	Request exceeded the tier's deadline timeout.
`stream_idle_timeout`	`stream_idle_timeout`	No chunks received for the idle timeout period. See tier timeouts.
(none)	`cancelled`	Request was cancelled (client disconnect or server cancellation).

Client Disconnect

When a client disconnects during a stream, the router detects the broken connection and sends a cancellation request to the backend agent. The agent stops generation to free resources. Any tokens generated before disconnection are still billed.

Heartbeats

During idle periods (no chunks for 15 seconds), the router sends SSE comment frames to keep the connection alive:

SSE Comment

: heartbeat

Per the SSE specification, lines beginning with a colon are comments that clients silently ignore. These heartbeats prevent reverse proxies and load balancers from closing idle connections due to read timeouts. Heartbeats do not reset the idle stream timeout tracker.

Client Examples

Python

                    from openai import OpenAI

client = OpenAI(
    base_url="https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    api_key="xero_myproject_your_api_key"
)

stream = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
    stream_options={"include_usage": True}
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
    if chunk.usage:
        print(f"\nTokens: {chunk.usage.prompt_tokens} + {chunk.usage.completion_tokens}")
                

Node.js (OpenAI SDK)

                    import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1",
    apiKey: "xero_myproject_your_api_key"
});

const stream = await client.chat.completions.create({
    model: "llama-3.1-8b",
    messages: [{ role: "user", content: "Hello!" }],
    stream: true,
    stream_options: { include_usage: true }
});

for await (const chunk of stream) {
    const content = chunk.choices?.[0]?.delta?.content;
    if (content) process.stdout.write(content);
    if (chunk.usage) {
        console.log(`\nTokens: ${chunk.usage.prompt_tokens} + ${chunk.usage.completion_tokens}`);
    }
}
                

JavaScript / TypeScript (Raw Fetch)

                    const response = await fetch(
  "https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions",
  {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": "Bearer xero_myproject_your_api_key"
    },
    body: JSON.stringify({
      model: "llama-3.1-8b",
      messages: [{ role: "user", content: "Hello!" }],
      stream: true
    })
  }
);

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const text = decoder.decode(value);
  for (const line of text.split("\n")) {
    if (!line.startsWith("data: ")) continue;
    const data = line.slice(6);
    if (data === "[DONE]") break;

    const chunk = JSON.parse(data);
    const content = chunk.choices?.[0]?.delta?.content;
    if (content) process.stdout.write(content);
  }
}
                

curl

                    curl --no-buffer -X POST \
  https://api.xerotier.ai/proj_ABC123/my-endpoint/v1/chat/completions \
  -H "Authorization: Bearer xero_myproject_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'
                

The --no-buffer flag disables curl's output buffering so chunks are displayed as they arrive.