Operator Guides Index
Task-first procedures, one operator, one outcome, five to fifteen minutes each. The order tracks the first-week lifecycle: deploy a XEM, author a template against it, run the approval cadence that template drives, unwind a problem when one surfaces.
1. Deploy Your First XEM
Goal: take a fresh host with cluster credentials and get it serving tool calls inside a Xerotier project in under five minutes.
- On the XEM host, ensure local credentials exist for the
bundles you intend to expose (a readable
~/.kube/configfor Kubernetes, valid~/.awsfor cloud bundles). - From your local workstation, mint a single-use join key:
xeroctl agents join-keys --create --name <name> --region <region> --router-addr tcp://<router-host>:5555. The command prints the join-key value; capture it once. - Copy the value to the XEM host and either export it as
XEM_JOIN_KEY(orXEROTIER_AGENT_JOIN_KEYfor XIM-mode workers) before launching the container, or pass it via the--join-keyCLI flag. On a self-signed cluster also setXEROTIER_AGENT_ALLOW_INSECURE=true. - Run
xeroctl agents, the new agent should appear in the listing with a healthy status. - Inspect what it published with
xeroctl agents <agent-id>. The default handler renders the registered tool surface and last-heartbeat fields the XEM advertised.
The XEM is now reachable from any operational workspace bound
to it. Bind one with
xeroctl workspaces bind <ws> --agent <agent-id>.
Env vars worth knowing:
XEM_JOIN_KEY (or
XEROTIER_AGENT_JOIN_KEY for XIM workers),
XEROTIER_AGENT_ALLOW_INSECURE,
XEROTIER_AGENT_LOG_LEVEL,
XEROTIER_AGENT_METRICS_PORT,
XEROTIER_XEM_METRICS_PORT.
3. Manage Approvals
Goal: configure who approves what, and in what cadence, for every operational workspace.
- List the open approvals operators are waiting on with
xeroctl approvals list --workspace <wid>. Filter by status, risk, or agent as needed. - Inspect a single request, prompt, tool call, risk,
timeout, with
xeroctl approvals show <approval-id>. - Stream new requests as they land:
xeroctl approvals watch --workspace <wid>prints an SSE feed suitable for piping into a notifier. - Decide a request from the CLI with
xeroctl approvals approve <approval-id> --note <reason>orxeroctl approvals reject <approval-id> --note <reason>. Reviewers may also act in the dashboard Approvals tab. - Approval-policy authoring (per-risk timeouts, escalation
chains, delegation windows) lives in the workspace
settings UI today; there is no
xeroctl approvals policysubcommand. If your alert payload mentionsretry_afterseconds or returns aRetry-Afterheader, wait that long before re-issuing the call, the router's rate-limit middleware sets both.
4. Respond to Alerts
Goal: triage a Prometheus alert (or PagerDuty page) the fleet emitted.
- Look at the alert label set. Every XEM alert carries
project,workspace, andagent. The label set names the blast radius. - Open
/ops, filter the Operations tab by the same labels. The most recent failure should be obvious. - Click through to the execution detail. The modal shows the full tool call, stdout, stderr, and audit timeline.
- If the XEM is misbehaving (wrong output, stale
credentials), quiesce its executor with
xeroctl agents pause-exec <agent-id>. Subsequent dispatch routes around it. Resume withxeroctl agents resume-exec <agent-id>once the host is healthy. - If the target is misbehaving (cluster outage, cloud API
throttle), there is no workspace-level pause switch.
Pause each XEM bound to the workspace using
xeroctl agents pause-execuntil the upstream issue resolves. - Watch the error envelope. Common operator-visible codes
include
endpoint_inactive,model_provisioning,capacity_exceeded,backend_unavailable, andrate_limit_exceeded. A 403 withscope_insufficientmeans the caller's API key lacks a required scope; a 500 carrying ax-request-idresponse header should be forwarded to platform support together with the timestamp. - When the alert resolves, Prometheus will mark the rule inactive. If it does not, cross-reference Troubleshooting for the specific error code.
5. Investigate an Execution
Goal: understand, in full, what a specific tool invocation did.
- Find the invocation ID. The user will usually paste it
from the chat; the
/opsdashboard exposes it via a copy button on every row. - Inspect the lifecycle:
xeroctl exec show <id>. The output names every state transition, every approval, and the final status. - Pull the raw stdout/stderr:
xeroctl exec raw <id>. Artifacts produced by the call are referenced inline; fetch them through the chat's artifact API (thex_read_artifacttool surface) or the dashboard artifact viewer, there is noxeroctl artifactscommand group. - For distributed traces, enable OTEL on the router with
XEROTIER_OTEL_ENABLED=trueandXEROTIER_OTEL_ENDPOINT=<collector>and query the configured tracing backend (Jaeger, Tempo, etc.) by request ID. There is no CLI wrapper for trace retrieval today. - For post-incident review, capture the full JSON shape of
the execution with
xeroctl exec show <id> -o jsonand attach it to the incident record. Audit-log rows are retained forXEROTIER_FRONTEND_AUDIT_RETENTION_DAYSdays, archive promptly if the incident may outlast the retention window.
6. Rotate Credentials
Goal: rotate a XEM's tool credentials (or its CURVE keys) without dropping in-flight work.
- Tool credentials: update the local
credential on the XEM host (new kubeconfig, new
~/.aws/credentials). Restart the XEM container; it re-enrolls with the same join-key metadata and publishes a fresh capability manifest. The router accepts the new manifest atomically. - CURVE keys: run
xeroctl agents curve-rotate <agent-id>. The router generates a fresh Curve25519 keypair and pushes the public half to the agent over the existing connection. In-flight invocations complete on the old key; new connections pick the rotated key. The overlap window is short but non-zero; the audit log records the cut-over timestamp. If you rely on a deployment-wide salt for vLLM worker keying, also confirmXEROTIER_AGENT_VLLM_SALT_SECRETis set consistently across the fleet. - Join keys: expire immediately unused ones
with
xeroctl agents join-keys <id> --revoke. Reissue with a freshxeroctl agents join-keys --create --name <name> --region <region> --router-addr tcp://<router-host>:5555call. - After any rotation, run
xeroctl agents <agent-id>and confirm the agent's heartbeat advanced. A fresh heartbeat is evidence the rotation took effect.