// Execution Management (XEM)

Operator Guides Index

Task-first procedures, one operator, one outcome, five to fifteen minutes each. The order tracks the first-week lifecycle: deploy a XEM, author a template against it, run the approval cadence that template drives, unwind a problem when one surfaces.

1. Deploy Your First XEM

Goal: take a fresh host with cluster credentials and get it serving tool calls inside a Xerotier project in under five minutes.

  1. On the XEM host, ensure local credentials exist for the bundles you intend to expose (a readable ~/.kube/config for Kubernetes, valid ~/.aws for cloud bundles).
  2. From your local workstation, mint a single-use join key: xeroctl agents join-keys --create --name <name> --region <region> --router-addr tcp://<router-host>:5555. The command prints the join-key value; capture it once.
  3. Copy the value to the XEM host and either export it as XEM_JOIN_KEY (or XEROTIER_AGENT_JOIN_KEY for XIM-mode workers) before launching the container, or pass it via the --join-key CLI flag. On a self-signed cluster also set XEROTIER_AGENT_ALLOW_INSECURE=true.
  4. Run xeroctl agents, the new agent should appear in the listing with a healthy status.
  5. Inspect what it published with xeroctl agents <agent-id>. The default handler renders the registered tool surface and last-heartbeat fields the XEM advertised.

The XEM is now reachable from any operational workspace bound to it. Bind one with xeroctl workspaces bind <ws> --agent <agent-id>.

Env vars worth knowing: XEM_JOIN_KEY (or XEROTIER_AGENT_JOIN_KEY for XIM workers), XEROTIER_AGENT_ALLOW_INSECURE, XEROTIER_AGENT_LOG_LEVEL, XEROTIER_AGENT_METRICS_PORT, XEROTIER_XEM_METRICS_PORT.

2. Author a Chat Template

Goal: encode a recurring procedure (a runbook successor) as a chat template so operators and automation follow the same plan every time.

  1. Start from a known-good baseline: dump the nearest platform template to a local file with xeroctl templates show <template-id> -o json > local.json.
  2. Edit the JSON; the top-level fields are system_prompt, allowed_tools, approval_cadence, and idle_timeout.
  3. Write the system prompt as a numbered procedure. State the pre-conditions, the step sequence, the post-conditions, and what to do on failure. Keep it explicit about which steps MUST NOT be skipped.
  4. Restrict allowed_tools to the minimum set required. Less is safer.
  5. Create the new template from the edited file with xeroctl templates create --file local.json. To bind the resulting template to a workspace use xeroctl templates apply <template-id> --workspace <ws>.
  6. Dry-run in a test workspace before letting the template touch production. The template-authoring tutorial walks through the drain-node example end to end.

3. Manage Approvals

Goal: configure who approves what, and in what cadence, for every operational workspace.

  1. List the open approvals operators are waiting on with xeroctl approvals list --workspace <wid>. Filter by status, risk, or agent as needed.
  2. Inspect a single request, prompt, tool call, risk, timeout, with xeroctl approvals show <approval-id>.
  3. Stream new requests as they land: xeroctl approvals watch --workspace <wid> prints an SSE feed suitable for piping into a notifier.
  4. Decide a request from the CLI with xeroctl approvals approve <approval-id> --note <reason> or xeroctl approvals reject <approval-id> --note <reason>. Reviewers may also act in the dashboard Approvals tab.
  5. Approval-policy authoring (per-risk timeouts, escalation chains, delegation windows) lives in the workspace settings UI today; there is no xeroctl approvals policy subcommand. If your alert payload mentions retry_after seconds or returns a Retry-After header, wait that long before re-issuing the call, the router's rate-limit middleware sets both.

4. Respond to Alerts

Goal: triage a Prometheus alert (or PagerDuty page) the fleet emitted.

  1. Look at the alert label set. Every XEM alert carries project, workspace, and agent. The label set names the blast radius.
  2. Open /ops, filter the Operations tab by the same labels. The most recent failure should be obvious.
  3. Click through to the execution detail. The modal shows the full tool call, stdout, stderr, and audit timeline.
  4. If the XEM is misbehaving (wrong output, stale credentials), quiesce its executor with xeroctl agents pause-exec <agent-id>. Subsequent dispatch routes around it. Resume with xeroctl agents resume-exec <agent-id> once the host is healthy.
  5. If the target is misbehaving (cluster outage, cloud API throttle), there is no workspace-level pause switch. Pause each XEM bound to the workspace using xeroctl agents pause-exec until the upstream issue resolves.
  6. Watch the error envelope. Common operator-visible codes include endpoint_inactive, model_provisioning, capacity_exceeded, backend_unavailable, and rate_limit_exceeded. A 403 with scope_insufficient means the caller's API key lacks a required scope; a 500 carrying a x-request-id response header should be forwarded to platform support together with the timestamp.
  7. When the alert resolves, Prometheus will mark the rule inactive. If it does not, cross-reference Troubleshooting for the specific error code.

5. Investigate an Execution

Goal: understand, in full, what a specific tool invocation did.

  1. Find the invocation ID. The user will usually paste it from the chat; the /ops dashboard exposes it via a copy button on every row.
  2. Inspect the lifecycle: xeroctl exec show <id>. The output names every state transition, every approval, and the final status.
  3. Pull the raw stdout/stderr: xeroctl exec raw <id>. Artifacts produced by the call are referenced inline; fetch them through the chat's artifact API (the x_read_artifact tool surface) or the dashboard artifact viewer, there is no xeroctl artifacts command group.
  4. For distributed traces, enable OTEL on the router with XEROTIER_OTEL_ENABLED=true and XEROTIER_OTEL_ENDPOINT=<collector> and query the configured tracing backend (Jaeger, Tempo, etc.) by request ID. There is no CLI wrapper for trace retrieval today.
  5. For post-incident review, capture the full JSON shape of the execution with xeroctl exec show <id> -o json and attach it to the incident record. Audit-log rows are retained for XEROTIER_FRONTEND_AUDIT_RETENTION_DAYS days, archive promptly if the incident may outlast the retention window.

6. Rotate Credentials

Goal: rotate a XEM's tool credentials (or its CURVE keys) without dropping in-flight work.

  1. Tool credentials: update the local credential on the XEM host (new kubeconfig, new ~/.aws/credentials). Restart the XEM container; it re-enrolls with the same join-key metadata and publishes a fresh capability manifest. The router accepts the new manifest atomically.
  2. CURVE keys: run xeroctl agents curve-rotate <agent-id>. The router generates a fresh Curve25519 keypair and pushes the public half to the agent over the existing connection. In-flight invocations complete on the old key; new connections pick the rotated key. The overlap window is short but non-zero; the audit log records the cut-over timestamp. If you rely on a deployment-wide salt for vLLM worker keying, also confirm XEROTIER_AGENT_VLLM_SALT_SECRET is set consistently across the fleet.
  3. Join keys: expire immediately unused ones with xeroctl agents join-keys <id> --revoke. Reissue with a fresh xeroctl agents join-keys --create --name <name> --region <region> --router-addr tcp://<router-host>:5555 call.
  4. After any rotation, run xeroctl agents <agent-id> and confirm the agent's heartbeat advanced. A fresh heartbeat is evidence the rotation took effect.