Triage and Incident Response
This page covers operational triage for ByteOr runtime and Cloud deployments.
First Triage Pass
- Check API and worker health endpoints (
/healthz,/readyz) - Check readiness before assuming only a UI fault
- Inspect
/metricsfor request, auth, unauthorized, forbidden, and rate-limited counters - Identify category: auth, deployment, agent, or workflow failure
Incident Categories
Auth Failures
- OIDC login or callback failures
- Unauthorized request spikes
- Membership or project-role mismatches
Confirm the operator can still access the expected organization and project scope.
Deployment Failures
- Deployments stuck before artifact generation
- Approval coverage rejection
- Missing or invalid bundle hash resolution
Check deployment status transitions in audit records.
Agent Failures
- Registration problems during enrollment
- Missing heartbeats
- Repeated authorization failures on heartbeat, artifact, or signing-key routes
- Rate limiting on hot agents
Worker Failures
- Workflow jobs in
retryingstate - Repeated retries for the same job type
- Jobs reaching
dead_letter
The worker uses bounded retry and backoff — jobs should not fail permanently on first error.
Rate Limiting
The API applies route-level limits for:
- Auth login and callback traffic
- Agent protocol traffic
- Artifact upload traffic
If requests return 429:
- Confirm whether traffic is legitimate burst or abuse
- Identify the caller identity involved
- Ask the client to back off before retrying
- Check for loops on invalid credentials or missing scopes
Runtime Triage
For runtime-level issues (not Cloud control plane):
- Run
doctorto verify host readiness - Check effective tuning — compare requested vs. applied
- Inspect snapshots for runtime state
- Export an incident bundle for offline triage
- Run dry-run replay to investigate
Recovery
- Restart API or worker only after capturing the visible failure mode
- Prefer replaying a single failing workflow after root cause is understood
- If a job reaches
dead_letter, capture the input payload and error before remediation - Test backup and restore paths periodically