Triage and Incident Response

This page covers operational triage for ByteOr runtime and Cloud deployments.

First Triage Pass

  1. Check API and worker health endpoints (/healthz, /readyz)
  2. Check readiness before assuming only a UI fault
  3. Inspect /metrics for request, auth, unauthorized, forbidden, and rate-limited counters
  4. Identify category: auth, deployment, agent, or workflow failure

Incident Categories

Auth Failures

  • OIDC login or callback failures
  • Unauthorized request spikes
  • Membership or project-role mismatches

Confirm the operator can still access the expected organization and project scope.

Deployment Failures

  • Deployments stuck before artifact generation
  • Approval coverage rejection
  • Missing or invalid bundle hash resolution

Check deployment status transitions in audit records.

Agent Failures

  • Registration problems during enrollment
  • Missing heartbeats
  • Repeated authorization failures on heartbeat, artifact, or signing-key routes
  • Rate limiting on hot agents

Worker Failures

  • Workflow jobs in retrying state
  • Repeated retries for the same job type
  • Jobs reaching dead_letter

The worker uses bounded retry and backoff — jobs should not fail permanently on first error.

Rate Limiting

The API applies route-level limits for:

  • Auth login and callback traffic
  • Agent protocol traffic
  • Artifact upload traffic

If requests return 429:

  1. Confirm whether traffic is legitimate burst or abuse
  2. Identify the caller identity involved
  3. Ask the client to back off before retrying
  4. Check for loops on invalid credentials or missing scopes

Runtime Triage

For runtime-level issues (not Cloud control plane):

  1. Run doctor to verify host readiness
  2. Check effective tuning — compare requested vs. applied
  3. Inspect snapshots for runtime state
  4. Export an incident bundle for offline triage
  5. Run dry-run replay to investigate

Recovery

  • Restart API or worker only after capturing the visible failure mode
  • Prefer replaying a single failing workflow after root cause is understood
  • If a job reaches dead_letter, capture the input payload and error before remediation
  • Test backup and restore paths periodically
Provenance
Need the canonical source?
Use the public hub to orient yourself, then jump to repo-owned docs or rustdoc when you need contract-level detail.