Incident Response
Diagnose, contain, and recover from an XEM execution incident: a wrong tool result, a destructive call that bypassed approval, a stuck agent, a misbehaving tool. Triage, contain, investigate, remediate, postmortem. Do not skip steps.
Triage
Within the first 5 minutes, answer three questions:
- What executions are affected?
xeroctl exec listreturns the recent execution set. There are no built-in time-window or status filters; narrow the result client-side (for example by piping tojq) or cross-reference the approvals queue withxeroctl approvals list. - What blast radius? Run
xeroctl exec show <executionId>for each suspect execution. Look at the workspace, the tool, and the exit-code / output. Usexeroctl exec raw <executionId>for the raw output blob. - Is the XEM agent still healthy?
xeroctl health dashboardshows the project-wide service health.xeroctl agents <id> --eventsprints the lifecycle event stream for a single agent.
Decide: contain locally (one workspace via a scoped agent pause) or contain globally (pause every XEM agent).
Contain
The containment primitive is per-agent:
xeroctl agents pause-exec
<agentId>. To scope the pause to a
single workspace's dispatch on that agent, pass
--workspace <name>. There is
no workspace-wide pause command; contain a
workspace by pausing each agent that serves it.
xeroctl agents pause-exec agt_xem_fleet_01 \
--workspace kubernetes-prod
A paused agent accepts no new tool invocations on the scoped workspace; in-flight calls complete. Record the incident id in your tracker; the CLI does not accept a reason flag, so attach it to the postmortem instead.
If the agent itself is the problem, omit
--workspace to pause all of its
dispatch:
xeroctl agents pause-exec agt_xem_fleet_01
The agent's lease remains valid but it no longer
receives dispatch frames; the router queues any
pending invocations for redispatch once you run
xeroctl agents resume-exec
<agentId>.
Investigate
Sources you will consult in every incident:
- Execution record.
xeroctl exec show <executionId>reports the execution status, workspace, tool, and result metadata.xeroctl exec raw <executionId>returns the raw output blob, andxeroctl exec watch <executionId>follows a live execution. There is no CLI surface for the immutable audit log; viewaudit_logsrows through the operator dashboard. - Approval queue.
xeroctl approvals listandxeroctl approvals show <id>surface stuck or rejected approvals. Destructive-call gating routes through this queue, so incidents often correlate with expired entries. - Agent events.
xeroctl agents <id> --eventsprints the lifecycle event stream for the suspected agent. - Structured logs on the agent.
sudo journalctl -u xerotier-xem --since "30 minutes ago" --grep <executionId>on the agent host. - Distributed trace. With
XEROTIER_OTEL_ENABLED=true(and anXEROTIER_OTEL_ENDPOINTconfigured) the trace backend holds the per-frame timing from router to agent to tool to result. Tune sample rate viaXEROTIER_OTEL_SAMPLE_RATE.
Correlate by execution id. The execution record and traces share the same id; logs print it on every line.
Remediate
Typical remediation paths, from cheapest to most invasive:
- Inspect the auto-fork-branch.
If the tool call triggered auto-fork-branch,
the parent chat has an unreached branch
recorded immediately before the destructive
call. Locate it with
xeroctl chats branches list <chatId>. There is no CLI rollback verb today; consume the branch through the chat UI or document the manual recovery steps in the postmortem. - Corrective tool call. Run a
read-only probe first, then a reverse mutation
(e.g.
kubectl applyof the pre-change manifest). - Rotate credentials. If the incident indicates credential leak, see Credential Rotation.
- Rotate CURVE keys. If the
incident indicates CURVE-key leak, run
xeroctl agents curve-rotate <agentId> [--reason <text>]. - Revoke the agent. Use
xeroctl agents revoke-credentials <agentId>to invalidate the agent's credentials while keeping the record, orxeroctl agents <agentId> --deletefor permanent removal.
Postmortem
After containment, a written postmortem is expected for every incident that altered production state unexpectedly. Minimum fields:
- Incident ID and window.
- Affected invocations (with IDs).
- Root cause (template, tool, credential, or model misbehavior).
- Detection path and time-to-detect.
- Remediation actions.
- Preventive follow-ups (chat template update, tool risk reclassification, approval policy tightening).
Attach an execution archive to the postmortem.
Generate one with xeroctl exports create
--type executions --since <ts> --until
<ts>, then poll with
xeroctl exports status and pull the
bundle via xeroctl exports download.