tribes/parrhesia

Fork 0

Files

Steffen Beyer 889d630c12 Add monthly partition maintenance and retention pruning

2026-03-14 18:11:40 +01:00

8.4 KiB

Raw Blame History

Parrhesia clustering and distributed fanout

This document describes:

the current distributed fanout behavior implemented today, and
a practical evolution path to a more production-grade clustered relay.

1) Current state (implemented today)

1.1 What exists right now

Parrhesia currently includes a lightweight multi-node live fanout path (untested!):

Parrhesia.Fanout.MultiNode (lib/parrhesia/fanout/multi_node.ex)
- GenServer that joins a :pg process group.
- Receives locally-published events and forwards them to other group members.
- Receives remote events and performs local fanout lookup.
Parrhesia.Web.Connection (lib/parrhesia/web/connection.ex)
- On successful ingest, after ACK scheduling, it does:
  1. local fanout (fanout_event/1), then
  2. cross-node publish (maybe_publish_multi_node/1).
Parrhesia.Subscriptions.Supervisor (lib/parrhesia/subscriptions/supervisor.ex)
- Starts Parrhesia.Fanout.MultiNode unconditionally.

In other words: if BEAM nodes are connected, live events are fanned out cross-node.

1.2 What is not included yet

No automatic cluster formation/discovery (no libcluster, DNS polling, gossip, etc.).
No durable inter-node event transport.
No replay/recovery of missed cross-node live events.
No explicit per-node delivery ACK between relay nodes.

2) Current runtime behavior in detail

2.1 Local ingest flow and publish ordering

For an accepted event in Parrhesia.Web.Connection:

validate/policy/persist path runs.
Client receives OK reply.
A post-ACK message triggers:
- local fanout (Index.candidate_subscription_keys/1 + send {:fanout_event, ...}),
- multi-node publish (MultiNode.publish/1).

Important semantics:

Regular persisted events: ACK implies DB persistence succeeded.
Ephemeral events: ACK implies accepted by policy, but no DB durability.
Cross-node fanout happens after ACK path is scheduled.

2.2 Multi-node transport mechanics

Parrhesia.Fanout.MultiNode uses :pg membership:

On init:
- ensures :pg is started,
- joins group Parrhesia.Fanout.MultiNode.
On publish:
- gets all group members,
- excludes itself,
- sends {:remote_fanout_event, event} to each member pid.
On remote receive:
- runs local subscription candidate narrowing via Parrhesia.Subscriptions.Index,
- forwards matching candidates to local connection owners as {:fanout_event, sub_id, event}.

No republish on remote receive, so this path does not create fanout loops.

2.3 Subscription index locality

The subscription index is local ETS state per node (Parrhesia.Subscriptions.Index).

Each node only tracks subscriptions of its local websocket processes.
Each node independently decides which local subscribers match a remote event.
There is no global cross-node subscription registry.

2.4 Delivery model and guarantees (current)

Current model is best-effort live propagation among connected nodes.

If nodes are connected and healthy, remote live subscribers should receive events quickly.
If there is a netsplit or temporary disconnection:
- remote live subscribers may miss events,
- persisted events can still be recovered by normal REQ/history query,
- ephemeral events are not recoverable.

2.5 Cluster preconditions

For cross-node fanout to work, operators must provide distributed BEAM connectivity:

consistent Erlang cookie,
named nodes (--name/--sname),
network reachability for Erlang distribution ports,
explicit node connections (or external discovery tooling).

Parrhesia currently does not automate these steps.

3) Operational characteristics of current design

3.1 Performance shape

For each accepted event on one node:

one local fanout lookup + local sends,
one cluster publish that sends to N - 1 remote bus members,
on each remote node: one local fanout lookup + local sends.

So inter-node traffic scales roughly linearly with node count per event (full-cluster broadcast).

This is simple and low-latency for small-to-medium clusters, but can become expensive as node count grows.

3.2 Failure behavior

Remote node down: send attempts to that member stop once membership updates; no replay.
Netsplit: live propagation gap during split.
Recovery: local clients can catch up via DB-backed queries (except ephemeral kinds).

3.3 Consistency expectations

No global total-ordering guarantee for live delivery across nodes.
Per-connection ordering is preserved by each connection process queue/drain behavior.
Duplicate suppression for ingestion uses storage semantics (duplicate_event), but transport itself is not exactly-once.

3.4 Observability today

Relevant metrics exist for fanout/queue pressure (see Parrhesia.Telemetry), e.g.:

parrhesia.fanout.duration.ms
parrhesia.connection.outbound_queue.depth
parrhesia.connection.outbound_queue.pressure
parrhesia.connection.outbound_queue.overflow.count

These are useful but do not yet fully separate local-vs-remote fanout pipeline stages.

4) Practical extension path to a fully-fledged clustered system

A realistic path is incremental. Suggested phases:

Phase A — hardened BEAM cluster control plane

Add cluster discovery/formation (e.g. libcluster) with environment-specific topology:
- Kubernetes DNS,
- static nodes,
- cloud VM discovery.
Add clear node liveness/partition telemetry and alerts.
Provide operator docs for cookie, node naming, and network requirements.

Outcome: simpler and safer cluster operations, same data plane semantics.

Phase B — resilient distributed fanout data plane

Introduce a durable fanout stream for persisted events.

Recommended pattern:

On successful DB commit of event, append to a monotonic fanout log (or use DB sequence-based stream view).
Each relay node runs a consumer with a stored cursor.
On restart/partition recovery, node resumes from cursor and replays missed events.
Local fanout remains same (subscription index + per-connection queues).

Semantics target:

at-least-once node-to-node propagation,
replay after downtime,
idempotent handling keyed by event id.

Notes:

Ephemeral events can remain best-effort (or have a separate short-lived transport), since no storage source exists for replay.

Phase C — scale and efficiency improvements

As cluster size grows, avoid naive full broadcast where possible:

Optional node-level subscription summaries (coarse bloom/bitset or keyed summaries) to reduce unnecessary remote sends.
Shard fanout workers for CPU locality and mailbox control.
Batch remote delivery payloads.
Separate traffic classes (e.g. Marmot-heavy streams vs generic) with independent queues.

Outcome: higher throughput per node and lower inter-node amplification.

Phase D — stronger observability and SLOs

Add explicit distributed pipeline metrics:

publish enqueue/dequeue latency,
cross-node delivery lag (commit -> remote fanout enqueue),
replay backlog depth,
per-node dropped/expired transport messages,
partition detection counters.

Define cluster SLO examples:

p95 commit->remote-live enqueue under nominal load,
max replay catch-up time after node restart,
bounded message loss for best-effort channels.

5) How a fully-fledged system would behave in practice

With Phases A-D implemented, expected behavior:

Normal operation:
- low-latency local fanout,
- remote nodes receive events via stream consumers quickly,
- consistent operational visibility of end-to-end lag.
Node restart:
- node reconnects and replays from stored cursor,
- local subscribers begin receiving new + missed persisted events.
Transient partition:
- live best-effort path may degrade,
- persisted events converge after partition heals via replay.
High fanout bursts:
- batching + sharding keeps queue pressure bounded,
- overflow policies remain connection-local and measurable.

This approach gives a good trade-off between Nostr relay latency and distributed robustness without requiring strict exactly-once semantics.

6) Current status summary

Today, Parrhesia already supports lightweight distributed live fanout when BEAM nodes are connected.

It is intentionally simple and fast for smaller clusters, and provides a solid base for a more durable, observable cluster architecture as relay scale and availability requirements grow.

8.4 KiB Raw Blame History