IndexBus Failure & Lifecycle (normative, v1)

This document defines how IndexBus v1 behaves under failures and lifecycle events.

Normative language

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY in this document are to be interpreted as described in RFC 2119 and RFC 8174.

Scope:

  • SHM mappings and validation (indexbus-transport-shm, indexbus-core::validate)
  • In-memory/local regions (indexbus-transport-local)
  • Core primitives (indexbus-core) and the envelope (indexbus-msg)

1) Lifecycle states and validation

1.1 Initialization state machine

Regions with an initialized field use this state machine:

  • 0: uninitialized
  • 1: initializing
  • 2: initialized

Normative rules:

  • Producers/consumers/routers MUST only operate on regions that validate successfully.
  • A region that is stuck in initialized = 1 MUST be treated as failed initialization.
    • Required operator action: delete/recreate the region (or reinitialize in-place with exclusive access).

Note:

  • Not all v1 layouts have initialized. For example, v1 state regions use a seq parity protocol (even = stable, odd = writer in progress) and validation does not require initialized == 2.

Implementation note (v1):

  • Validation reports NotInitialized for any initialized != 2 and does not distinguish 0 vs 1; operators may still treat a persistently-non-2 region as a failed or incomplete initialization.

1.2 ABI compatibility and corruption

All mapped regions MUST be validated before use:

  • magic/version compatibility
  • region kind discriminator (LayoutHeader.flags)
  • required capability bits
  • layout_bytes is large enough for the required base layout
  • (where applicable) initialized state is 2

2) Sequencer: stalls and restart behavior

Sequencer regions are coordination primitives: they do not store payloads themselves.

Normative stall behavior (v1):

  • If any consumer/stage stops advancing its gating sequence indefinitely, producers will eventually be unable to advance due to wrap prevention (bounded backpressure).
  • Operator action (v1): restart the stalled consumer/stage, or recreate the region if the gating sequence cannot be advanced safely.

Restart notes (v1):

  • Gating sequences are stored in the shared region, so producer wrap prevention persists across producer restarts.
  • Any wake-backed blocking is best-effort and does not change correctness; after restarts, participants SHOULD resume polling/waiting based on cursor and gating[].

If validation fails, the region MUST be treated as:

  • incompatible (wrong version/caps)
  • corrupted (truncated mapping, bad fields)

Required operator action:

  • recreate the region with the expected version/caps, or rebuild the producer/consumer to match.

3) Participant death

3.1 Producer death

If a producer process/thread dies:

  • Messages that were fully committed prior to death may still be received.
  • Messages in-flight at the time of death may be lost.
  • A crash can strand resources (e.g., a slot allocated but not enqueued) depending on where it occurs.

Normative operational guidance:

  • IndexBus v1 does not provide transactional cleanup or exactly-once guarantees.
  • If the system requires strong recovery semantics after crashes, treat the region as disposable and recreate it.

3.2 Consumer death

If a consumer stops permanently:

  • In SPSC/MPSC events: the queue may fill, causing producers to observe Error::Full.
  • In fanout: a slow/dead consumer may accumulate backlog in its consumer queue.

With router-enforced credits enabled:

  • A slow/dead consumer will eventually become ineligible (typically because depth reaches credit_max, and/or because its destination queue becomes full).
  • Depending on CreditPolicy, the router may:
    • skip/avoid delivering to ineligible consumers (broadcast) or drop when none eligible (work-queue)
    • park (work-queue) instead of dequeueing
    • detach persistently-over-credit consumers (router-local)

Required operator action:

  • restart the consumer, or recreate/reset the region depending on the application’s policy.

3.3 Router death (fanout)

If the router is not running:

  • Producers may continue to publish into the producer→router queue until it fills.
  • Consumers will stop receiving new messages (their queues are not being filled).

Required operator action:

  • restart the router loop, or fail over to another router instance for that region.

4) Stalls (participants stop polling)

IndexBus is fundamentally a polling-based system at Tier 0.

  • If any participant stops polling, bounded queues will eventually fill.
  • When full, producers observe Error::Full and MUST apply a policy at the edges:
    • drop
    • backoff/spin
    • block (only if you are using wake/blocking capabilities and an adapter that supports it)

5) Restart / reattach rules

5.1 Local/in-process regions

  • If the region lives in the same address space and is still valid, handles can continue to operate.
  • If the owning memory is dropped/freed, all handles become invalid.

5.2 SHM regions

  • A process may re-open and re-attach to an existing SHM file mapping.
  • Participants MUST ensure:
    • successful validation
    • compatible version/capability bits
    • agreed-upon region type and consumer indexing

If you cannot trust the previous writer’s shutdown behavior (e.g., crash, forced kill), the safest v1 action is:

  • recreate the region and restart all participants.

Credit state notes:

  • The v1 credit model is depth-based, so the effective credit state is derived from queue head/tail and therefore survives router restarts.
  • Any router-local bookkeeping (e.g., CreditPolicy::Detach timing/detach flags) is not persisted in the region and will reset when the router restarts.

6) Envelope and decoding failures

  • indexbus-msg rejects malformed headers deterministically (bad magic, truncated, bad payload length, unknown flags).
  • Typed decoding failures are surfaced to the caller (codec decode error vs core recv error).

Normative rule:

  • Decode errors do not automatically poison the region; they indicate that the bytes in a slot do not match the expected schema/codec for that consumer.

7) Required operator action summary (v1 production profile)

  • Validation failure: recreate region or upgrade/downgrade participants to match.
  • Router stopped: restart router.
  • Persistent Full: apply explicit policy (drop/backoff/block) or increase capacity via a new layout/config.
  • initialized=1 stuck: recreate region.
Provenance
Need the canonical source?
Use the public hub to orient yourself, then jump to repo-owned docs or rustdoc when you need contract-level detail.