IndexBus Ops Triage (v1)

Goal: answer “why is it slow/dropping?” using only:

  • indexbus-inspect output (layout, caps, init state)
  • router counters (throughput, drops, queue depth)
  • the normative failure contracts in v1 failure & lifecycle

This is intentionally best-effort v1 guidance: it favors boundedness and clear operator actions.

Related:

  • Router counters reference: ./router-counters.md
  • Performance & tuning: ./performance-tuning.md

Rustdoc entry points

For the API-level tools behind this runbook, start with:


1) Quick checklist

  1. Confirm the mapped region is valid:

    • initialized == 2 (when present)
    • mapped_bytes >= layout_bytes
    • capability bits should match the intended region type
  2. Confirm the fanout router is running:

    • routed/sec should be > 0 when producers are active
  3. Decide whether the bottleneck is producer-side, router-side, or consumer-side:

    • producer backlog: qdepth: src_spsc / src_mpsc grows
    • consumer backlog: qdepth: consumers=[..] grows for one or more consumers
    • drops: drops/sec > 0

2) Inspect a region

Run:

  • cargo run -p indexbus-inspect -- <path-to-region-file>
  • cargo run -p indexbus-inspect -- <path-to-region-file> --json

Interpretation:

  • initialized: 0/1/2 (or n/a in text output / null in JSON for layouts that do not use it, e.g. v1 state)
    • 0: uninitialized (treat as not ready)
    • 1: stuck initializing → treat as failed init; recreate region
    • 2: safe to operate (subject to other checks)
  • caps: should match the intended region kind
  • layout_bytes and mapped_bytes
    • If mapped_bytes < layout_bytes: treat as truncated/corrupted mapping; recreate

If validation/inspection fails, follow v1 failure & lifecycle “ABI compatibility and corruption”.


3) Run the router with periodic stats (fanout)

Run:

  • cargo run -p indexbus-route --bin indexbus-router -- --name fanout --interval-ms 1000
  • or against a specific file: --file <path>

Work-queue policy notes (best-effort):

  • If consumers are saturated and you want the router to stop dequeuing instead of dropping:
    • --mode work --policy spinthenblock
  • If the region supports blocking and you want OS-backed waits (lower idle CPU):
    • --mode work --policy block

The router prints key/value counters:

  • sent/sec: best-effort producer enqueue rate (derived from producer queue tail/write)
  • routed/sec: number of messages dequeued from the producer→router queue (router throughput)
  • recv/sec: best-effort consumer dequeue rate (sum of consumer queue head deltas)
  • delivered/sec: total per-consumer enqueues performed (broadcast can be > routed/sec)
  • drops/sec: total drops (best-effort)
  • drops_full/sec: drops best-effort attributed to destination queue full
  • drops_all_full/sec: drops attributable to no eligible consumer having space (work-queue)
  • drops_no_credit/sec: drops attributable to credit exhaustion (best-effort; see note below)
  • credit_waits/sec: iterations where the router waited due to credits (work-queue)
  • detaches/sec: number of consumer detaches performed by the credit policy
  • idle_waits/sec: router “no work” iterations invoking the wait strategy
  • pressure_waits/sec: router throttling iterations because consumer queues have no capacity
  • wake_waits/sec: OS-backed wake waits (only when wake sections are present/used)
  • wake_timeouts/sec: bounded wake waits that timed out
  • batches/sec: routed batches per second
  • batch_avg: average batch size over the interval
  • batch_max: maximum batch size over the interval
  • qdepth: src_spsc / src_mpsc: producer backlog
  • qdepth: consumers=[..]: per-consumer backlog / lag proxy

4) Common incident patterns and actions

Router not running

Symptoms:

  • routed/sec absent (router not running)
  • qdepth: src_* grows while producers are active

Action:

  • restart the router process for that region

See: v1 failure & lifecycle “Router death (fanout)”.

Slow consumer(s)

Symptoms:

  • one consumer depth grows while others stay low
  • in broadcast mode, drops may rise for that consumer only (best-effort observable as drops/sec)

Action:

  • restart or fix the slow consumer
  • consider work-queue mode if only one consumer should process each message

See: v1 failure & lifecycle “Consumer death” / “Stalls”.

Consumer queues full (drops)

Symptoms:

  • drops/sec > 0
  • consumer depth often near capacity

Action:

  • treat as explicit overload: reduce producer rate, scale consumers, or change policy at the edge

See: v1 failure & lifecycle “Stalls” (boundedness and backpressure policies).


5) Notes on v1 limitations

  • Drop reason attribution is best-effort.
    • In broadcast routing, drops may be counted as drops_full (destination queue full) or drops_no_credit (credit exhaustion), but per-consumer attribution is not guaranteed.
    • In work-queue routing, drops_all_full can include cases where no consumer was eligible (full and/or no-credit). The router reports drops_full vs drops_no_credit best-effort when credit masks are available.
    • Credit mask attribution is only meaningful when the fanout has $N \le 64$ consumers.
Provenance
Need the canonical source?
Use the public hub to orient yourself, then jump to repo-owned docs or rustdoc when you need contract-level detail.