IndexBus Ops Triage (v1)

Synced from repo docs

This page is synced from docs/ops/v1-triage.md via docs/public-docs.json. Edit the owning repo source instead of this generated copy. GitHub source: https://github.com/byteor-systems/indexbus/blob/master/docs/ops/v1-triage.md

Goal: answer “why is it slow/dropping?” using only:

indexbus-inspect output (layout, caps, init state)
router counters (throughput, drops, queue depth)
the normative failure contracts in v1 failure & lifecycle

This is intentionally best-effort v1 guidance: it favors boundedness and clear operator actions.

Router counters reference: ./router-counters.md
Performance & tuning: ./performance-tuning.md

Rustdoc entry points

For the API-level tools behind this runbook, start with:

1) Quick checklist

Confirm the mapped region is valid:
- initialized == 2 (when present)
- mapped_bytes >= layout_bytes
- capability bits should match the intended region type
Confirm the fanout router is running:
- routed/sec should be > 0 when producers are active
Decide whether the bottleneck is producer-side, router-side, or consumer-side:
- producer backlog: qdepth: src_spsc / src_mpsc grows
- consumer backlog: qdepth: consumers=[..] grows for one or more consumers
- drops: drops/sec > 0

2) Inspect a region

Run:

cargo run -p indexbus-inspect -- <path-to-region-file>
cargo run -p indexbus-inspect -- <path-to-region-file> --json

Interpretation:

initialized: 0/1/2 (or n/a in text output / null in JSON for layouts that do not use it, e.g. v1 state)
- 0: uninitialized (treat as not ready)
- 1: stuck initializing → treat as failed init; recreate region
- 2: safe to operate (subject to other checks)
caps: should match the intended region kind
layout_bytes and mapped_bytes
- If mapped_bytes < layout_bytes: treat as truncated/corrupted mapping; recreate

If validation/inspection fails, follow v1 failure & lifecycle “ABI compatibility and corruption”.

3) Run the router with periodic stats (fanout)

Run:

cargo run -p indexbus-route --bin indexbus-router -- --name fanout --interval-ms 1000
or against a specific file: --file <path>

Work-queue policy notes (best-effort):

If consumers are saturated and you want the router to stop dequeuing instead of dropping:
- --mode work --policy spinthenblock
If the region supports blocking and you want OS-backed waits (lower idle CPU):
- --mode work --policy block

The router prints key/value counters:

sent/sec: best-effort producer enqueue rate (derived from producer queue tail/write)
routed/sec: number of messages dequeued from the producer→router queue (router throughput)
recv/sec: best-effort consumer dequeue rate (sum of consumer queue head deltas)
delivered/sec: total per-consumer enqueues performed (broadcast can be > routed/sec)
drops/sec: total drops (best-effort)
drops_full/sec: drops best-effort attributed to destination queue full
drops_all_full/sec: drops attributable to no eligible consumer having space (work-queue)
drops_no_credit/sec: drops attributable to credit exhaustion (best-effort; see note below)
credit_waits/sec: iterations where the router waited due to credits (work-queue)
detaches/sec: number of consumer detaches performed by the credit policy
idle_waits/sec: router “no work” iterations invoking the wait strategy
pressure_waits/sec: router throttling iterations because consumer queues have no capacity
wake_waits/sec: OS-backed wake waits (only when wake sections are present/used)
wake_timeouts/sec: bounded wake waits that timed out
batches/sec: routed batches per second
batch_avg: average batch size over the interval
batch_max: maximum batch size over the interval
qdepth: src_spsc / src_mpsc: producer backlog
qdepth: consumers=[..]: per-consumer backlog / lag proxy

4) Common incident patterns and actions

Router not running

Symptoms:

routed/sec absent (router not running)
qdepth: src_* grows while producers are active

Action:

restart the router process for that region

See: v1 failure & lifecycle “Router death (fanout)”.

Slow consumer(s)

Symptoms:

one consumer depth grows while others stay low
in broadcast mode, drops may rise for that consumer only (best-effort observable as drops/sec)

Action:

restart or fix the slow consumer
consider work-queue mode if only one consumer should process each message

See: v1 failure & lifecycle “Consumer death” / “Stalls”.

Consumer queues full (drops)

Symptoms:

drops/sec > 0
consumer depth often near capacity

Action:

treat as explicit overload: reduce producer rate, scale consumers, or change policy at the edge

See: v1 failure & lifecycle “Stalls” (boundedness and backpressure policies).

5) Notes on v1 limitations

Drop reason attribution is best-effort.
- In broadcast routing, drops may be counted as drops_full (destination queue full) or drops_no_credit (credit exhaustion), but per-consumer attribution is not guaranteed.
- In work-queue routing, drops_all_full can include cases where no consumer was eligible (full and/or no-credit). The router reports drops_full vs drops_no_credit best-effort when credit masks are available.
- Credit mask attribution is only meaningful when the fanout has $N \le 64$ consumers.

Provenance

Need the canonical source?

Use the public hub to orient yourself, then jump to repo-owned docs or rustdoc when you need contract-level detail.

Source-of-truth model Repo docs index

#IndexBus Ops Triage (v1)

#Rustdoc entry points

#1) Quick checklist

#2) Inspect a region

#3) Run the router with periodic stats (fanout)

#4) Common incident patterns and actions

#Router not running

#Slow consumer(s)

#Consumer queues full (drops)

#5) Notes on v1 limitations

IndexBus Ops Triage (v1)

Rustdoc entry points

1) Quick checklist

2) Inspect a region

3) Run the router with periodic stats (fanout)

4) Common incident patterns and actions

Router not running

Slow consumer(s)

Consumer queues full (drops)

5) Notes on v1 limitations