Operational Runbook

This runbook covers first-response triage, key metrics, common incident patterns, and performance tuning guidance for production ByteOr deployments.

First-Pass Checklist

Before diving into metrics or logs, confirm the basics:

  1. Validate the region — Verify the alert is firing in the expected region and that the control plane is reachable from that region.
  2. Confirm the router — Identify which router instance is involved. Check the router_id label on the alert and cross-reference it with the environment's agent list.
  3. Identify the bottleneck — Determine whether the issue is upstream (producers backing up), downstream (consumers stalled), or internal (router loop saturation).

Router Counters

These counters are exposed on the /metrics endpoint of every router instance. Use them to build dashboards and alert rules.

CounterMeaning
sent/secMessages sent into the router by producers.
routed/secMessages matched to at least one output lane.
delivered/secMessages successfully written to a consumer's input buffer.
recv/secMessages pulled from the router by consumers.
drops/secTotal messages dropped (sum of all drop sub-categories).
drops_full/secMessages dropped because the target lane buffer was full.
drops_no_credit/secMessages dropped because the consumer had no flow-control credit remaining.
drops_all_full/secMessages dropped because every output lane was full.
credit_waits/secTimes a producer blocked waiting for flow-control credit to become available.
detaches/secConsumer detach events (graceful disconnects or timeouts).
qdepthCurrent queue depth across all lanes (gauge, not a rate).

Common Incident Patterns

SymptomLikely CauseAction
drops_full/sec spikes on a single laneOne consumer is slow or stuck.Check the consumer's CPU and memory. Restart if unresponsive. Increase lane buffer size if the consumer is healthy but bursty.
drops_all_full/sec sustained above zeroAll consumers are falling behind the producer rate.Scale out consumers or reduce producer throughput. Check for downstream dependency failures (database, network).
credit_waits/sec rising, sent/sec flatProducers are being back-pressured by flow control.This is healthy back-pressure. If latency is unacceptable, increase credit limits or add router capacity.
detaches/sec spikesConsumers are crashing or being evicted.Review consumer logs for panics or OOM kills. Check Kubernetes pod eviction events.
qdepth growing unboundedMessages are being produced faster than consumed and drops are disabled.Enable drop policies or scale consumers. Investigate whether the workload spike is expected.

Performance Tuning

Router Loop

The router loop is the hot path. Tuning it has the largest impact on throughput and latency.

  • Batch size — Increase the router batch size (router_batch_size) to amortize per-message overhead. Values between 256 and 4096 work well for most workloads. Larger batches increase tail latency.
  • Poll interval — The router polls for new messages on a configurable interval (router_poll_interval_us). Lower values reduce latency but increase CPU usage. Start at 100 µs and adjust based on your latency budget.
  • Affinity — Pin the router thread to a dedicated CPU core to avoid context-switch jitter. Use taskset or cgroup cpusets.

CPU

  • Isolate router cores — Reserve 1–2 cores exclusively for the router. Use isolcpus or cgroup cpusets to prevent the OS scheduler from placing other work on those cores.
  • Disable hyper-threading — On latency-sensitive deployments, disable SMT to avoid contention on shared execution units.
  • Governor — Set the CPU frequency governor to performance to prevent frequency scaling during bursts.

Memory

  • Huge pages — Enable 2 MiB huge pages for shared-memory segments to reduce TLB misses. Pre-allocate the required number at boot via vm.nr_hugepages.
  • NUMA locality — Ensure producers, routers, and consumers sharing a segment are all scheduled on the same NUMA node. Use numactl --membind and --cpunodebind.
  • Lock pages — Use mlockall or equivalent to prevent the OS from swapping shared-memory pages to disk.

Anti-Patterns

Avoid these common mistakes when operating IndexBus routers:

  • Expecting fairness — The router does not guarantee fair scheduling across lanes. A high-volume lane can starve a low-volume lane. Use separate router instances if strict isolation is required.
  • Treating drops as exact accounting — Drop counters are sampled, not transactional. A small discrepancy between sent/sec and delivered/sec + drops/sec is normal under high contention.
  • Assuming blocking changes semantics — Switching a lane from non-blocking to blocking mode does not retroactively recover dropped messages. It only changes future behavior.
Provenance
Need the canonical source?
Use the public hub to orient yourself, then jump to repo-owned docs or rustdoc when you need contract-level detail.