Incident Bundles and Replay

ByteOr's incident capture and replay system provides a structured path from failure detection through investigation to governed re-execution.

Incident Bundles

An incident bundle packages everything needed for triage into a single durable unit:

  • Spec — the pipeline spec that was running
  • Validation result — the compile and validation output
  • Policy audit — the governance decisions that were in effect
  • Environment data — runtime environment details
  • Snapshots — optional state captures from the runtime

Bundles are created with incident-bundle on the runtime CLI or through the Cloud artifact system.

Replay Modes

Dry Run

Dry run replays captured inputs without executing against live infrastructure. Use dry run for:

  • Investigating what happened
  • Validating the reconstructed input
  • Confirming the pipeline version matches the incident
  • Testing remediation strategies before committing

Dry run does not require execute-mode approval in most environment postures.

Execute Mode

Execute mode re-runs captured inputs with full side effects. Use execute mode only when:

  • The environment posture permits governed execution
  • A matching pipeline version can be resolved from the incident spec hash
  • An eligible approval is attached (when required)

Execute Mode Validation

Before execution starts, the backend validates:

  • Source agent environment posture
  • Pipeline version resolution from the incident artifact spec hash
  • Approval coverage when the environment requires it

If validation fails, the replay request is rejected before any governed execution begins.

Replay Audit

Every replay produces a structured audit record:

  • Audit version — schema version
  • Bundle directory — source artifact location
  • Spec hash — the pipeline version
  • Input — journal lane, path, scanned records, selected bytes, sample hashes
  • Policy — environment, approval status, mode (dry_run or execute)
  • Actions — each action taken with role, lane, stage, target, and decision

Operational Playbook

  1. Identify the incident, agent, deployment, and environment
  2. Open the stored artifact record
  3. Confirm artifact metadata and capture time
  4. Launch a dry-run replay
  5. Review the replay audit
  6. Escalate to execute mode only if dry run is insufficient
  7. Obtain approval for execute-mode replay
  8. Review resulting audit records

Escalation Points

Escalate when:

  • The source environment cannot be resolved
  • The artifact spec hash does not map to a known pipeline version
  • Approval coverage is missing or rejected
  • The agent shows repeated 401, 403, or 429 responses during investigation
Provenance
Need the canonical source?
Use the public hub to orient yourself, then jump to repo-owned docs or rustdoc when you need contract-level detail.