Operational Runbook
This runbook covers first-response triage, key metrics, common incident patterns, and performance tuning guidance for production ByteOr deployments.
First-Pass Checklist
Before diving into metrics or logs, confirm the basics:
- Validate the region — Verify the alert is firing in the expected region and that the control plane is reachable from that region.
- Confirm the router — Identify which router instance is involved. Check the
router_idlabel on the alert and cross-reference it with the environment's agent list. - Identify the bottleneck — Determine whether the issue is upstream (producers backing up), downstream (consumers stalled), or internal (router loop saturation).
Router Counters
These counters are exposed on the /metrics endpoint of every router instance. Use them to build dashboards and alert rules.
Common Incident Patterns
Performance Tuning
Router Loop
The router loop is the hot path. Tuning it has the largest impact on throughput and latency.
- Batch size — Increase the router batch size (
router_batch_size) to amortize per-message overhead. Values between 256 and 4096 work well for most workloads. Larger batches increase tail latency. - Poll interval — The router polls for new messages on a configurable interval (
router_poll_interval_us). Lower values reduce latency but increase CPU usage. Start at 100 µs and adjust based on your latency budget. - Affinity — Pin the router thread to a dedicated CPU core to avoid context-switch jitter. Use
tasksetor cgroup cpusets.
CPU
- Isolate router cores — Reserve 1–2 cores exclusively for the router. Use
isolcpusor cgroup cpusets to prevent the OS scheduler from placing other work on those cores. - Disable hyper-threading — On latency-sensitive deployments, disable SMT to avoid contention on shared execution units.
- Governor — Set the CPU frequency governor to
performanceto prevent frequency scaling during bursts.
Memory
- Huge pages — Enable 2 MiB huge pages for shared-memory segments to reduce TLB misses. Pre-allocate the required number at boot via
vm.nr_hugepages. - NUMA locality — Ensure producers, routers, and consumers sharing a segment are all scheduled on the same NUMA node. Use
numactl --membindand--cpunodebind. - Lock pages — Use
mlockallor equivalent to prevent the OS from swapping shared-memory pages to disk.
Anti-Patterns
Avoid these common mistakes when operating IndexBus routers:
- Expecting fairness — The router does not guarantee fair scheduling across lanes. A high-volume lane can starve a low-volume lane. Use separate router instances if strict isolation is required.
- Treating drops as exact accounting — Drop counters are sampled, not transactional. A small discrepancy between
sent/secanddelivered/sec + drops/secis normal under high contention. - Assuming blocking changes semantics — Switching a lane from non-blocking to blocking mode does not retroactively recover dropped messages. It only changes future behavior.