8.4 KiB
Parrhesia clustering and distributed fanout
This document describes:
- the current distributed fanout behavior implemented today, and
- a practical evolution path to a more production-grade clustered relay.
1) Current state (implemented today)
1.1 What exists right now
Parrhesia currently includes a lightweight multi-node live fanout path (untested!):
Parrhesia.Fanout.MultiNode(lib/parrhesia/fanout/multi_node.ex)- GenServer that joins a
:pgprocess group. - Receives locally-published events and forwards them to other group members.
- Receives remote events and performs local fanout lookup.
- GenServer that joins a
Parrhesia.Web.Connection(lib/parrhesia/web/connection.ex)- On successful ingest, after ACK scheduling, it does:
- local fanout (
fanout_event/1), then - cross-node publish (
maybe_publish_multi_node/1).
- local fanout (
- On successful ingest, after ACK scheduling, it does:
Parrhesia.Subscriptions.Supervisor(lib/parrhesia/subscriptions/supervisor.ex)- Starts
Parrhesia.Fanout.MultiNodeunconditionally.
- Starts
In other words: if BEAM nodes are connected, live events are fanned out cross-node.
1.2 What is not included yet
- No automatic cluster formation/discovery (no
libcluster, DNS polling, gossip, etc.). - No durable inter-node event transport.
- No replay/recovery of missed cross-node live events.
- No explicit per-node delivery ACK between relay nodes.
2) Current runtime behavior in detail
2.1 Local ingest flow and publish ordering
For an accepted event in Parrhesia.Web.Connection:
- validate/policy/persist path runs.
- Client receives
OKreply. - A post-ACK message triggers:
- local fanout (
Index.candidate_subscription_keys/1+ send{:fanout_event, ...}), - multi-node publish (
MultiNode.publish/1).
- local fanout (
Important semantics:
- Regular persisted events: ACK implies DB persistence succeeded.
- Ephemeral events: ACK implies accepted by policy, but no DB durability.
- Cross-node fanout happens after ACK path is scheduled.
2.2 Multi-node transport mechanics
Parrhesia.Fanout.MultiNode uses :pg membership:
- On init:
- ensures
:pgis started, - joins group
Parrhesia.Fanout.MultiNode.
- ensures
- On publish:
- gets all group members,
- excludes itself,
- sends
{:remote_fanout_event, event}to each member pid.
- On remote receive:
- runs local subscription candidate narrowing via
Parrhesia.Subscriptions.Index, - forwards matching candidates to local connection owners as
{:fanout_event, sub_id, event}.
- runs local subscription candidate narrowing via
No republish on remote receive, so this path does not create fanout loops.
2.3 Subscription index locality
The subscription index is local ETS state per node (Parrhesia.Subscriptions.Index).
- Each node only tracks subscriptions of its local websocket processes.
- Each node independently decides which local subscribers match a remote event.
- There is no global cross-node subscription registry.
2.4 Delivery model and guarantees (current)
Current model is best-effort live propagation among connected nodes.
- If nodes are connected and healthy, remote live subscribers should receive events quickly.
- If there is a netsplit or temporary disconnection:
- remote live subscribers may miss events,
- persisted events can still be recovered by normal
REQ/history query, - ephemeral events are not recoverable.
2.5 Cluster preconditions
For cross-node fanout to work, operators must provide distributed BEAM connectivity:
- consistent Erlang cookie,
- named nodes (
--name/--sname), - network reachability for Erlang distribution ports,
- explicit node connections (or external discovery tooling).
Parrhesia currently does not automate these steps.
3) Operational characteristics of current design
3.1 Performance shape
For each accepted event on one node:
- one local fanout lookup + local sends,
- one cluster publish that sends to
N - 1remote bus members, - on each remote node: one local fanout lookup + local sends.
So inter-node traffic scales roughly linearly with node count per event (full-cluster broadcast).
This is simple and low-latency for small-to-medium clusters, but can become expensive as node count grows.
3.2 Failure behavior
- Remote node down: send attempts to that member stop once membership updates; no replay.
- Netsplit: live propagation gap during split.
- Recovery: local clients can catch up via DB-backed queries (except ephemeral kinds).
3.3 Consistency expectations
- No global total-ordering guarantee for live delivery across nodes.
- Per-connection ordering is preserved by each connection process queue/drain behavior.
- Duplicate suppression for ingestion uses storage semantics (
duplicate_event), but transport itself is not exactly-once.
3.4 Observability today
Relevant metrics exist for fanout/queue pressure (see Parrhesia.Telemetry), e.g.:
parrhesia.fanout.duration.msparrhesia.connection.outbound_queue.depthparrhesia.connection.outbound_queue.pressureparrhesia.connection.outbound_queue.overflow.count
These are useful but do not yet fully separate local-vs-remote fanout pipeline stages.
4) Practical extension path to a fully-fledged clustered system
A realistic path is incremental. Suggested phases:
Phase A — hardened BEAM cluster control plane
- Add cluster discovery/formation (e.g.
libcluster) with environment-specific topology:- Kubernetes DNS,
- static nodes,
- cloud VM discovery.
- Add clear node liveness/partition telemetry and alerts.
- Provide operator docs for cookie, node naming, and network requirements.
Outcome: simpler and safer cluster operations, same data plane semantics.
Phase B — resilient distributed fanout data plane
Introduce a durable fanout stream for persisted events.
Recommended pattern:
- On successful DB commit of event, append to a monotonic fanout log (or use DB sequence-based stream view).
- Each relay node runs a consumer with a stored cursor.
- On restart/partition recovery, node resumes from cursor and replays missed events.
- Local fanout remains same (subscription index + per-connection queues).
Semantics target:
- at-least-once node-to-node propagation,
- replay after downtime,
- idempotent handling keyed by event id.
Notes:
- Ephemeral events can remain best-effort (or have a separate short-lived transport), since no storage source exists for replay.
Phase C — scale and efficiency improvements
As cluster size grows, avoid naive full broadcast where possible:
- Optional node-level subscription summaries (coarse bloom/bitset or keyed summaries) to reduce unnecessary remote sends.
- Shard fanout workers for CPU locality and mailbox control.
- Batch remote delivery payloads.
- Separate traffic classes (e.g. Marmot-heavy streams vs generic) with independent queues.
Outcome: higher throughput per node and lower inter-node amplification.
Phase D — stronger observability and SLOs
Add explicit distributed pipeline metrics:
- publish enqueue/dequeue latency,
- cross-node delivery lag (commit -> remote fanout enqueue),
- replay backlog depth,
- per-node dropped/expired transport messages,
- partition detection counters.
Define cluster SLO examples:
- p95 commit->remote-live enqueue under nominal load,
- max replay catch-up time after node restart,
- bounded message loss for best-effort channels.
5) How a fully-fledged system would behave in practice
With Phases A-D implemented, expected behavior:
- Normal operation:
- low-latency local fanout,
- remote nodes receive events via stream consumers quickly,
- consistent operational visibility of end-to-end lag.
- Node restart:
- node reconnects and replays from stored cursor,
- local subscribers begin receiving new + missed persisted events.
- Transient partition:
- live best-effort path may degrade,
- persisted events converge after partition heals via replay.
- High fanout bursts:
- batching + sharding keeps queue pressure bounded,
- overflow policies remain connection-local and measurable.
This approach gives a good trade-off between Nostr relay latency and distributed robustness without requiring strict exactly-once semantics.
6) Current status summary
Today, Parrhesia already supports lightweight distributed live fanout when BEAM nodes are connected.
It is intentionally simple and fast for smaller clusters, and provides a solid base for a more durable, observable cluster architecture as relay scale and availability requirements grow.