IndexBus Ops Triage (v1)
Goal: answer “why is it slow/dropping?” using only:
indexbus-inspectoutput (layout, caps, init state)- router counters (throughput, drops, queue depth)
- the normative failure contracts in v1 failure & lifecycle
This is intentionally best-effort v1 guidance: it favors boundedness and clear operator actions.
Related:
- Router counters reference: ./router-counters.md
- Performance & tuning: ./performance-tuning.md
Rustdoc entry points
For the API-level tools behind this runbook, start with:
1) Quick checklist
-
Confirm the mapped region is valid:
initialized == 2(when present)mapped_bytes >= layout_bytes- capability bits should match the intended region type
-
Confirm the fanout router is running:
routed/secshould be > 0 when producers are active
-
Decide whether the bottleneck is producer-side, router-side, or consumer-side:
- producer backlog:
qdepth: src_spsc/src_mpscgrows - consumer backlog:
qdepth: consumers=[..]grows for one or more consumers - drops:
drops/sec> 0
- producer backlog:
2) Inspect a region
Run:
cargo run -p indexbus-inspect -- <path-to-region-file>cargo run -p indexbus-inspect -- <path-to-region-file> --json
Interpretation:
initialized: 0/1/2(orn/ain text output /nullin JSON for layouts that do not use it, e.g. v1 state)0: uninitialized (treat as not ready)1: stuck initializing → treat as failed init; recreate region2: safe to operate (subject to other checks)
caps: should match the intended region kindlayout_bytesandmapped_bytes- If
mapped_bytes < layout_bytes: treat as truncated/corrupted mapping; recreate
- If
If validation/inspection fails, follow v1 failure & lifecycle “ABI compatibility and corruption”.
3) Run the router with periodic stats (fanout)
Run:
cargo run -p indexbus-route --bin indexbus-router -- --name fanout --interval-ms 1000- or against a specific file:
--file <path>
Work-queue policy notes (best-effort):
- If consumers are saturated and you want the router to stop dequeuing instead of
dropping:
--mode work --policy spinthenblock
- If the region supports blocking and you want OS-backed waits (lower idle CPU):
--mode work --policy block
The router prints key/value counters:
sent/sec: best-effort producer enqueue rate (derived from producer queue tail/write)routed/sec: number of messages dequeued from the producer→router queue (router throughput)recv/sec: best-effort consumer dequeue rate (sum of consumer queue head deltas)delivered/sec: total per-consumer enqueues performed (broadcast can be >routed/sec)drops/sec: total drops (best-effort)drops_full/sec: drops best-effort attributed to destination queue fulldrops_all_full/sec: drops attributable to no eligible consumer having space (work-queue)drops_no_credit/sec: drops attributable to credit exhaustion (best-effort; see note below)credit_waits/sec: iterations where the router waited due to credits (work-queue)detaches/sec: number of consumer detaches performed by the credit policyidle_waits/sec: router “no work” iterations invoking the wait strategypressure_waits/sec: router throttling iterations because consumer queues have no capacitywake_waits/sec: OS-backed wake waits (only when wake sections are present/used)wake_timeouts/sec: bounded wake waits that timed outbatches/sec: routed batches per secondbatch_avg: average batch size over the intervalbatch_max: maximum batch size over the intervalqdepth: src_spsc / src_mpsc: producer backlogqdepth: consumers=[..]: per-consumer backlog / lag proxy
4) Common incident patterns and actions
Router not running
Symptoms:
routed/secabsent (router not running)qdepth: src_*grows while producers are active
Action:
- restart the router process for that region
See: v1 failure & lifecycle “Router death (fanout)”.
Slow consumer(s)
Symptoms:
- one consumer depth grows while others stay low
- in broadcast mode, drops may rise for that consumer only (best-effort observable as
drops/sec)
Action:
- restart or fix the slow consumer
- consider work-queue mode if only one consumer should process each message
See: v1 failure & lifecycle “Consumer death” / “Stalls”.
Consumer queues full (drops)
Symptoms:
drops/sec > 0- consumer depth often near capacity
Action:
- treat as explicit overload: reduce producer rate, scale consumers, or change policy at the edge
See: v1 failure & lifecycle “Stalls” (boundedness and backpressure policies).
5) Notes on v1 limitations
- Drop reason attribution is best-effort.
- In broadcast routing, drops may be counted as
drops_full(destination queue full) ordrops_no_credit(credit exhaustion), but per-consumer attribution is not guaranteed. - In work-queue routing,
drops_all_fullcan include cases where no consumer was eligible (full and/or no-credit). The router reportsdrops_fullvsdrops_no_creditbest-effort when credit masks are available. - Credit mask attribution is only meaningful when the fanout has $N \le 64$ consumers.
- In broadcast routing, drops may be counted as