Operational Runbook

This runbook covers first-response triage, key metrics, common incident patterns, and performance tuning guidance for production ByteOr deployments.

First-Pass Checklist

Before diving into metrics or logs, confirm the basics:

Validate the region — Verify the alert is firing in the expected region and that the control plane is reachable from that region.
Confirm the router — Identify which router instance is involved. Check the router_id label on the alert and cross-reference it with the environment's agent list.
Identify the bottleneck — Determine whether the issue is upstream (producers backing up), downstream (consumers stalled), or internal (router loop saturation).

These counters are exposed on the /metrics endpoint of every router instance. Use them to build dashboards and alert rules.

Counter	Meaning
`sent/sec`	Messages sent into the router by producers.
`routed/sec`	Messages matched to at least one output lane.
`delivered/sec`	Messages successfully written to a consumer's input buffer.
`recv/sec`	Messages pulled from the router by consumers.
`drops/sec`	Total messages dropped (sum of all drop sub-categories).
`drops_full/sec`	Messages dropped because the target lane buffer was full.
`drops_no_credit/sec`	Messages dropped because the consumer had no flow-control credit remaining.
`drops_all_full/sec`	Messages dropped because every output lane was full.
`credit_waits/sec`	Times a producer blocked waiting for flow-control credit to become available.
`detaches/sec`	Consumer detach events (graceful disconnects or timeouts).
`qdepth`	Current queue depth across all lanes (gauge, not a rate).

Symptom	Likely Cause	Action
`drops_full/sec` spikes on a single lane	One consumer is slow or stuck.	Check the consumer's CPU and memory. Restart if unresponsive. Increase lane buffer size if the consumer is healthy but bursty.
`drops_all_full/sec` sustained above zero	All consumers are falling behind the producer rate.	Scale out consumers or reduce producer throughput. Check for downstream dependency failures (database, network).
`credit_waits/sec` rising, `sent/sec` flat	Producers are being back-pressured by flow control.	This is healthy back-pressure. If latency is unacceptable, increase credit limits or add router capacity.
`detaches/sec` spikes	Consumers are crashing or being evicted.	Review consumer logs for panics or OOM kills. Check Kubernetes pod eviction events.
`qdepth` growing unbounded	Messages are being produced faster than consumed and drops are disabled.	Enable drop policies or scale consumers. Investigate whether the workload spike is expected.

The router loop is the hot path. Tuning it has the largest impact on throughput and latency.

Batch size — Increase the router batch size (router_batch_size) to amortize per-message overhead. Values between 256 and 4096 work well for most workloads. Larger batches increase tail latency.
Poll interval — The router polls for new messages on a configurable interval (router_poll_interval_us). Lower values reduce latency but increase CPU usage. Start at 100 µs and adjust based on your latency budget.
Affinity — Pin the router thread to a dedicated CPU core to avoid context-switch jitter. Use taskset or cgroup cpusets.

Isolate router cores — Reserve 1–2 cores exclusively for the router. Use isolcpus or cgroup cpusets to prevent the OS scheduler from placing other work on those cores.
Disable hyper-threading — On latency-sensitive deployments, disable SMT to avoid contention on shared execution units.
Governor — Set the CPU frequency governor to performance to prevent frequency scaling during bursts.

Huge pages — Enable 2 MiB huge pages for shared-memory segments to reduce TLB misses. Pre-allocate the required number at boot via vm.nr_hugepages.
NUMA locality — Ensure producers, routers, and consumers sharing a segment are all scheduled on the same NUMA node. Use numactl --membind and --cpunodebind.
Lock pages — Use mlockall or equivalent to prevent the OS from swapping shared-memory pages to disk.

Avoid these common mistakes when operating IndexBus routers:

Expecting fairness — The router does not guarantee fair scheduling across lanes. A high-volume lane can starve a low-volume lane. Use separate router instances if strict isolation is required.
Treating drops as exact accounting — Drop counters are sampled, not transactional. A small discrepancy between sent/sec and delivered/sec + drops/sec is normal under high contention.
Assuming blocking changes semantics — Switching a lane from non-blocking to blocking mode does not retroactively recover dropped messages. It only changes future behavior.

Provenance

Need the source docs?

Use the public hub to orient yourself, then jump to repo-owned docs or rustdoc when you need contract-level detail.