Triage and Incident Response

This page covers operational triage for ByteOr runtime and Cloud deployments.

First Triage Pass

Check API and worker health endpoints (/healthz, /readyz)
Check readiness before assuming only a UI fault
Inspect /metrics for request, auth, unauthorized, forbidden, and rate-limited counters
Identify category: auth, deployment, agent, or workflow failure

Incident Categories

Auth Failures

OIDC login or callback failures
Unauthorized request spikes
Membership or project-role mismatches

Confirm the operator can still access the expected organization and project scope.

Deployment Failures

Deployments stuck before artifact generation
Approval coverage rejection
Missing or invalid bundle hash resolution

Check deployment status transitions in audit records.

Agent Failures

Registration problems during enrollment
Missing heartbeats
Repeated authorization failures on heartbeat, artifact, or signing-key routes
Rate limiting on hot agents

Worker Failures

Workflow jobs in retrying state
Repeated retries for the same job type
Jobs reaching dead_letter

The worker uses bounded retry and backoff — jobs should not fail permanently on first error.

Rate Limiting

The API applies route-level limits for:

Auth login and callback traffic
Agent protocol traffic
Artifact upload traffic

If requests return 429:

Confirm whether traffic is legitimate burst or abuse
Identify the caller identity involved
Ask the client to back off before retrying
Check for loops on invalid credentials or missing scopes

Runtime Triage

For runtime-level issues (not Cloud control plane):

Run doctor to verify host readiness
Check effective tuning — compare requested vs. applied
Inspect snapshots for runtime state
Export an incident bundle for offline triage
Run dry-run replay to investigate

Recovery

Restart API or worker only after capturing the visible failure mode
Prefer replaying a single failing workflow after root cause is understood
If a job reaches dead_letter, capture the input payload and error before remediation
Test backup and restore paths periodically

Provenance

Need the source docs?

Use the public hub to orient yourself, then jump to repo-owned docs or rustdoc when you need contract-level detail.

Reference hub

#Triage and Incident Response

#First Triage Pass

#Incident Categories

#Auth Failures

#Deployment Failures

#Agent Failures

#Worker Failures

#Rate Limiting

#Runtime Triage

#Recovery