# Parrhesia clustering and distributed fanout This document describes: 1. the **current** distributed fanout behavior implemented today, and 2. a practical evolution path to a more production-grade clustered relay. --- ## 1) Current state (implemented today) ### 1.1 What exists right now Parrhesia currently includes a lightweight multi-node live fanout path (untested!): - `Parrhesia.Fanout.MultiNode` (`lib/parrhesia/fanout/multi_node.ex`) - GenServer that joins a `:pg` process group. - Receives locally-published events and forwards them to other group members. - Receives remote events and performs local fanout lookup. - `Parrhesia.Web.Connection` (`lib/parrhesia/web/connection.ex`) - On successful ingest, after ACK scheduling, it does: 1. local fanout (`fanout_event/1`), then 2. cross-node publish (`maybe_publish_multi_node/1`). - `Parrhesia.Subscriptions.Supervisor` (`lib/parrhesia/subscriptions/supervisor.ex`) - Starts `Parrhesia.Fanout.MultiNode` unconditionally. In other words: **if BEAM nodes are connected, live events are fanned out cross-node**. ### 1.2 What is not included yet - No automatic cluster formation/discovery (no `libcluster`, DNS polling, gossip, etc.). - No durable inter-node event transport. - No replay/recovery of missed cross-node live events. - No explicit per-node delivery ACK between relay nodes. --- ## 2) Current runtime behavior in detail ### 2.1 Local ingest flow and publish ordering For an accepted event in `Parrhesia.Web.Connection`: 1. validate/policy/persist path runs. 2. Client receives `OK` reply. 3. A post-ACK message triggers: - local fanout (`Index.candidate_subscription_keys/1` + send `{:fanout_event, ...}`), - multi-node publish (`MultiNode.publish/1`). Important semantics: - Regular persisted events: ACK implies DB persistence succeeded. - Ephemeral events: ACK implies accepted by policy, but no DB durability. - Cross-node fanout happens **after** ACK path is scheduled. ### 2.2 Multi-node transport mechanics `Parrhesia.Fanout.MultiNode` uses `:pg` membership: - On init: - ensures `:pg` is started, - joins group `Parrhesia.Fanout.MultiNode`. - On publish: - gets all group members, - excludes itself, - sends `{:remote_fanout_event, event}` to each member pid. - On remote receive: - runs local subscription candidate narrowing via `Parrhesia.Subscriptions.Index`, - forwards matching candidates to local connection owners as `{:fanout_event, sub_id, event}`. No republish on remote receive, so this path does not create fanout loops. ### 2.3 Subscription index locality The subscription index is local ETS state per node (`Parrhesia.Subscriptions.Index`). - Each node only tracks subscriptions of its local websocket processes. - Each node independently decides which local subscribers match a remote event. - There is no global cross-node subscription registry. ### 2.4 Delivery model and guarantees (current) Current model is **best-effort live propagation** among connected nodes. - If nodes are connected and healthy, remote live subscribers should receive events quickly. - If there is a netsplit or temporary disconnection: - remote live subscribers may miss events, - persisted events can still be recovered by normal `REQ`/history query, - ephemeral events are not recoverable. ### 2.5 Cluster preconditions For cross-node fanout to work, operators must provide distributed BEAM connectivity: - consistent Erlang cookie, - named nodes (`--name`/`--sname`), - network reachability for Erlang distribution ports, - explicit node connections (or external discovery tooling). Parrhesia currently does not automate these steps. --- ## 3) Operational characteristics of current design ### 3.1 Performance shape For each accepted event on one node: - one local fanout lookup + local sends, - one cluster publish that sends to `N - 1` remote bus members, - on each remote node: one local fanout lookup + local sends. So inter-node traffic scales roughly linearly with node count per event (full-cluster broadcast). This is simple and low-latency for small-to-medium clusters, but can become expensive as node count grows. ### 3.2 Failure behavior - Remote node down: send attempts to that member stop once membership updates; no replay. - Netsplit: live propagation gap during split. - Recovery: local clients can catch up via DB-backed queries (except ephemeral kinds). ### 3.3 Consistency expectations - No global total-ordering guarantee for live delivery across nodes. - Per-connection ordering is preserved by each connection process queue/drain behavior. - Duplicate suppression for ingestion uses storage semantics (`duplicate_event`), but transport itself is not exactly-once. ### 3.4 Observability today Relevant metrics exist for fanout/queue pressure (see `Parrhesia.Telemetry`), e.g.: - `parrhesia.fanout.duration.ms` - `parrhesia.connection.outbound_queue.depth` - `parrhesia.connection.outbound_queue.pressure` - `parrhesia.connection.outbound_queue.overflow.count` These are useful but do not yet fully separate local-vs-remote fanout pipeline stages. --- ## 4) Practical extension path to a fully-fledged clustered system A realistic path is incremental. Suggested phases: ### Phase A — hardened BEAM cluster control plane 1. Add cluster discovery/formation (e.g. `libcluster`) with environment-specific topology: - Kubernetes DNS, - static nodes, - cloud VM discovery. 2. Add clear node liveness/partition telemetry and alerts. 3. Provide operator docs for cookie, node naming, and network requirements. Outcome: simpler and safer cluster operations, same data plane semantics. ### Phase B — resilient distributed fanout data plane Introduce a durable fanout stream for persisted events. Recommended pattern: 1. On successful DB commit of event, append to a monotonic fanout log (or use DB sequence-based stream view). 2. Each relay node runs a consumer with a stored cursor. 3. On restart/partition recovery, node resumes from cursor and replays missed events. 4. Local fanout remains same (subscription index + per-connection queues). Semantics target: - **at-least-once** node-to-node propagation, - replay after downtime, - idempotent handling keyed by event id. Notes: - Ephemeral events can remain best-effort (or have a separate short-lived transport), since no storage source exists for replay. ### Phase C — scale and efficiency improvements As cluster size grows, avoid naive full broadcast where possible: 1. Optional node-level subscription summaries (coarse bloom/bitset or keyed summaries) to reduce unnecessary remote sends. 2. Shard fanout workers for CPU locality and mailbox control. 3. Batch remote delivery payloads. 4. Separate traffic classes (e.g. Marmot-heavy streams vs generic) with independent queues. Outcome: higher throughput per node and lower inter-node amplification. ### Phase D — stronger observability and SLOs Add explicit distributed pipeline metrics: - publish enqueue/dequeue latency, - cross-node delivery lag (commit -> remote fanout enqueue), - replay backlog depth, - per-node dropped/expired transport messages, - partition detection counters. Define cluster SLO examples: - p95 commit->remote-live enqueue under nominal load, - max replay catch-up time after node restart, - bounded message loss for best-effort channels. --- ## 5) How a fully-fledged system would behave in practice With Phases A-D implemented, expected behavior: - **Normal operation:** - low-latency local fanout, - remote nodes receive events via stream consumers quickly, - consistent operational visibility of end-to-end lag. - **Node restart:** - node reconnects and replays from stored cursor, - local subscribers begin receiving new + missed persisted events. - **Transient partition:** - live best-effort path may degrade, - persisted events converge after partition heals via replay. - **High fanout bursts:** - batching + sharding keeps queue pressure bounded, - overflow policies remain connection-local and measurable. This approach gives a good trade-off between Nostr relay latency and distributed robustness without requiring strict exactly-once semantics. --- ## 6) Current status summary Today, Parrhesia already supports **lightweight distributed live fanout** when BEAM nodes are connected. It is intentionally simple and fast for smaller clusters, and provides a solid base for a more durable, observable cluster architecture as relay scale and availability requirements grow.