// XEM Guides

Incident Response

Diagnose, contain, and recover from an XEM execution incident: a wrong tool result, a destructive call that bypassed approval, a stuck agent, a misbehaving tool. Triage, contain, investigate, remediate, postmortem. Do not skip steps.

Triage

Within the first 5 minutes, answer three questions:

  1. What executions are affected? xeroctl exec list returns the recent execution set. There are no built-in time-window or status filters; narrow the result client-side (for example by piping to jq) or cross-reference the approvals queue with xeroctl approvals list.
  2. What blast radius? Run xeroctl exec show <executionId> for each suspect execution. Look at the workspace, the tool, and the exit-code / output. Use xeroctl exec raw <executionId> for the raw output blob.
  3. Is the XEM agent still healthy? xeroctl health dashboard shows the project-wide service health. xeroctl agents <id> --events prints the lifecycle event stream for a single agent.

Decide: contain locally (one workspace via a scoped agent pause) or contain globally (pause every XEM agent).

Contain

The containment primitive is per-agent: xeroctl agents pause-exec <agentId>. To scope the pause to a single workspace's dispatch on that agent, pass --workspace <name>. There is no workspace-wide pause command; contain a workspace by pausing each agent that serves it.

bash
xeroctl agents pause-exec agt_xem_fleet_01 \ --workspace kubernetes-prod

A paused agent accepts no new tool invocations on the scoped workspace; in-flight calls complete. Record the incident id in your tracker; the CLI does not accept a reason flag, so attach it to the postmortem instead.

If the agent itself is the problem, omit --workspace to pause all of its dispatch:

bash
xeroctl agents pause-exec agt_xem_fleet_01

The agent's lease remains valid but it no longer receives dispatch frames; the router queues any pending invocations for redispatch once you run xeroctl agents resume-exec <agentId>.

Investigate

Sources you will consult in every incident:

  1. Execution record. xeroctl exec show <executionId> reports the execution status, workspace, tool, and result metadata. xeroctl exec raw <executionId> returns the raw output blob, and xeroctl exec watch <executionId> follows a live execution. There is no CLI surface for the immutable audit log; view audit_logs rows through the operator dashboard.
  2. Approval queue. xeroctl approvals list and xeroctl approvals show <id> surface stuck or rejected approvals. Destructive-call gating routes through this queue, so incidents often correlate with expired entries.
  3. Agent events. xeroctl agents <id> --events prints the lifecycle event stream for the suspected agent.
  4. Structured logs on the agent. sudo journalctl -u xerotier-xem --since "30 minutes ago" --grep <executionId> on the agent host.
  5. Distributed trace. With XEROTIER_OTEL_ENABLED=true (and an XEROTIER_OTEL_ENDPOINT configured) the trace backend holds the per-frame timing from router to agent to tool to result. Tune sample rate via XEROTIER_OTEL_SAMPLE_RATE.

Correlate by execution id. The execution record and traces share the same id; logs print it on every line.

Remediate

Typical remediation paths, from cheapest to most invasive:

  • Inspect the auto-fork-branch. If the tool call triggered auto-fork-branch, the parent chat has an unreached branch recorded immediately before the destructive call. Locate it with xeroctl chats branches list <chatId>. There is no CLI rollback verb today; consume the branch through the chat UI or document the manual recovery steps in the postmortem.
  • Corrective tool call. Run a read-only probe first, then a reverse mutation (e.g. kubectl apply of the pre-change manifest).
  • Rotate credentials. If the incident indicates credential leak, see Credential Rotation.
  • Rotate CURVE keys. If the incident indicates CURVE-key leak, run xeroctl agents curve-rotate <agentId> [--reason <text>].
  • Revoke the agent. Use xeroctl agents revoke-credentials <agentId> to invalidate the agent's credentials while keeping the record, or xeroctl agents <agentId> --delete for permanent removal.

Postmortem

After containment, a written postmortem is expected for every incident that altered production state unexpectedly. Minimum fields:

  • Incident ID and window.
  • Affected invocations (with IDs).
  • Root cause (template, tool, credential, or model misbehavior).
  • Detection path and time-to-detect.
  • Remediation actions.
  • Preventive follow-ups (chat template update, tool risk reclassification, approval policy tightening).

Attach an execution archive to the postmortem. Generate one with xeroctl exports create --type executions --since <ts> --until <ts>, then poll with xeroctl exports status and pull the bundle via xeroctl exports download.