Incident And Replay Operator Guide
Synced from repo docs
This page is synced from docs/guides/incident-replay-operator-guide.md via docs/public-docs.json. Edit the owning repo source instead of this generated copy. GitHub source: https://github.com/byteor-systems/byteor-cloud/blob/master/docs/guides/incident-replay-operator-guide.md
This guide covers the operator path from incident artifact collection to replay execution.
Investigation flow
- Identify the incident, agent, deployment, and environment involved.
- Open the stored artifact record for the incident bundle or snapshot.
- Confirm the artifact metadata and capture time.
- Launch a dry-run replay first to validate the reconstructed input.
- Escalate to execute-mode replay only when the environment posture and approval coverage allow it.
Available replay modes
Dry run
Use dry run when you want to inspect behavior without allowing the replayed action to execute against live infrastructure.
Execute mode
Use execute mode only when:
- the environment posture permits governed execution
- the replay can resolve a matching pipeline version from the incident spec hash
- an eligible approval is attached when required
What execute replay validates
Before execution starts, the backend validates:
- source agent environment posture
- pipeline version resolution from the incident artifact spec hash
- approval coverage when execute mode requires it
If these checks fail, the replay request is rejected before any governed execution begins.
Artifact handling notes
Artifact uploads use multipart submission and enforce:
- type validation
- size validation
- content-hash deduplication
- scoped agent API key authorization
Artifact upload routes also have abuse controls and can return 429 when an agent exceeds allowed request volume.
Response playbook
When investigating a production incident:
- confirm the latest deployment status and approval context
- inspect recent heartbeats and snapshots from the source agent
- verify the incident artifact metadata and source environment
- run a dry-run replay and capture the result
- only request execute-mode approval if dry run is insufficient
- review resulting audit records after the replay request completes or fails
Escalation points
Escalate when:
- the source environment cannot be resolved
- the artifact spec hash does not map to a known pipeline version
- approval coverage is missing or rejected
- the agent shows repeated
401,403, or429responses during investigation