Files
parrhesia/docs/ARCH.md

265 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Parrhesia Nostr Relay Architecture
## 1) Goals
Build a **robust, high-performance Nostr relay** in Elixir/OTP with PostgreSQL as first adapter, while keeping a strict boundary so storage can be swapped later.
Primary targets:
- Broad relay feature support (core + modern relay-facing NIPs)
- Strong correctness around NIP-01 semantics
- Clear OTP supervision and failure isolation
- High fanout throughput and bounded resource usage
- Storage abstraction via behavior-driven ports/adapters
- Full test suite (unit, integration, conformance, perf, fault-injection)
- Support for experimental MLS flow (NIP-EE), behind feature flags
## 2) NIP support scope
### Mandatory baseline
- NIP-01 (includes behavior moved from NIP-12/NIP-16/NIP-20/NIP-33)
- NIP-11 (relay info document)
### Relay-facing features to include
- NIP-09 (deletion requests)
- NIP-13 (PoW gating)
- NIP-17 + NIP-44 + NIP-59 (private DMs / gift wraps)
- NIP-40 (expiration)
- NIP-42 (AUTH)
- NIP-43 (relay membership requests/metadata)
- NIP-45 (COUNT, optional HLL)
- NIP-50 (search)
- NIP-62 (request to vanish)
- NIP-66 (relay discovery events; store/serve as normal events)
- NIP-70 (protected events)
- NIP-77 (negentropy sync)
- NIP-86 + NIP-98 (relay management API auth)
### Experimental MLS
- NIP-EE (unrecommended/upstream-superseded, but requested):
- kind `443` KeyPackage events
- kind `445` group events (policy-controlled retention/ephemeral treatment)
- kind `10051` keypackage relay lists
- interop with wrapped delivery (`1059`) and auth/privacy policies
## 3) System architecture (high level)
```text
WS/HTTP Edge (Bandit/Plug)
-> Protocol Decoder/Encoder
-> Command Router (EVENT/REQ/CLOSE/AUTH/COUNT/NEG-*)
-> Policy Pipeline (validation, auth, ACL, PoW, NIP-70)
-> Event Service / Query Service
-> Storage Port (behavior)
-> Postgres Adapter (Ecto)
-> Subscription Index (ETS)
-> Fanout Dispatcher
-> Telemetry + Metrics + Tracing
```
## 4) OTP supervision design
`Parrhesia.Application` children (top-level):
1. `Parrhesia.Telemetry` metric definitions/reporters
2. `Parrhesia.Config` runtime config cache (ETS-backed)
3. `Parrhesia.Storage.Supervisor` adapter processes (`Repo`, pools)
4. `Parrhesia.Subscriptions.Supervisor` subscription index + fanout workers
5. `Parrhesia.Auth.Supervisor` AUTH challenge/session tracking
6. `Parrhesia.Policy.Supervisor` rate limiters / ACL caches
7. `Parrhesia.Web.Endpoint` WS + HTTP ingress
8. `Parrhesia.Tasks.Supervisor` background jobs (expiry purge, maintenance)
Failure model:
- Connection failures are isolated per socket process.
- Storage outages degrade with explicit `OK/CLOSED` error prefixes (`error:`) per NIP-01.
- Non-critical workers are `:transient`; core infra is `:permanent`.
## 5) Core runtime components
### 5.1 Connection process
Per websocket connection:
- Parse frames, enforce max frame/message limits
- Maintain authenticated pubkeys (NIP-42)
- Track active subscriptions (`sub_id` scoped to connection)
- Handle backpressure (bounded outbound queue + drop/close strategy)
### 5.2 Command router
Dispatches:
- `EVENT` -> ingest pipeline
- `REQ` -> initial DB query + live subscription
- `CLOSE` -> unsubscribe
- `AUTH` -> challenge validation, session update
- `COUNT` -> aggregate path
- `NEG-OPEN`/`NEG-MSG`/`NEG-CLOSE` -> negentropy session engine
### 5.3 Event ingest pipeline
Ordered stages:
1. Decode + schema checks
2. `id` recomputation and signature verification
3. NIP semantic checks (timestamps, tag forms, size limits)
4. Policy checks (banlists, kind allowlists, auth-required, NIP-70, PoW)
5. Storage write (including ephemeral events with short TTL retention)
6. Live fanout to matching subscriptions
7. Return canonical `OK` response with machine prefix when needed, **only after durable DB commit succeeds**
### 5.4 Subscription index + fanout
- ETS-backed inverted indices (`kind`, `author`, single-letter tags)
- Candidate narrowing before full filter evaluation
- OR semantics across filters, AND within filter
- `limit` only for initial query phase; ignored in live phase (NIP-01)
### 5.5 Query service
- Compiles NIP filters into adapter-neutral query AST
- Pushes AST to storage adapter
- Deterministic ordering (`created_at` desc, `id` lexical tie-break)
- Emits `EOSE` exactly once per subscription initial catch-up
## 6) Storage boundary (swap-friendly by design)
### 6.1 Port/adapter contract
Define behaviors under `Parrhesia.Storage`:
- `Parrhesia.Storage.Events`
- `put_event/2`, `get_event/2`, `query/3`, `count/3`
- `delete_by_request/2`, `vanish/2`, `purge_expired/1`
- `Parrhesia.Storage.Moderation`
- pubkey/event bans, allowlists, blocked IPs
- `Parrhesia.Storage.Groups`
- NIP-29/NIP-43 membership + role operations
- `Parrhesia.Storage.Admin`
- backing for NIP-86 methods
All domain logic depends only on these behaviors.
### 6.2 Postgres adapter notes
Initial adapter: `Parrhesia.Storage.Adapters.Postgres` with Ecto.
Schema outline:
- `events` (partitioned by `created_at`; `id`, `pubkey`, `sig` stored in compact binary form; `kind`, `content`, `d_tag`, `deleted_at`, `expires_at`)
- `event_tags` (event_id, name, value, idx)
- moderation tables (banned/allowed pubkeys, banned events, blocked IPs)
- relay/group membership tables
- optional count/HLL helper tables
Indexing strategy:
- `(kind, created_at DESC)`
- `(pubkey, created_at DESC)`
- `(created_at DESC)`
- `(name, value, created_at DESC)` on `event_tags`
- partial/unique indexes and deterministic upsert paths for replaceable `(pubkey, kind)` and addressable `(pubkey, kind, d_tag)` semantics
- targeted partial indexes for high-traffic single-letter tags (`e`, `p`, `d` first), with additional tag indexes added from production query telemetry
Retention strategy:
- Mandatory time partitioning for `events` (monthly default, configurable)
- Partition-aligned pruning for expired/deleted data where possible
- Periodic purge job for expired/deleted tombstoned rows
### 6.3 Postgres operating defaults (locked before implementation)
- **Durability invariant:** relay returns `OK` only after transaction commit for accepted events.
- **Pool separation:** independent DB pools/queues for ingest writes, REQ/COUNT reads, and maintenance/admin operations.
- **Server-side guardrails:** enforce `max_filter_limit`, max filters per REQ, max entries for `ids`/`authors`/`#tag`, and bounded `since/until` windows.
- **Deterministic conflict resolution:** tie-break replaceable/addressable collisions by `created_at`, then lexical `id` (NIP-01-consistent).
- **Conformance lock-in:** treat `since <= created_at <= until`, newest-first initial query ordering, and single `EOSE` emission as fixed behavior.
## 7) Feature-specific implementation notes
### 7.1 NIP-11
- Serve on WS URL with `Accept: application/nostr+json`
- Include accurate `supported_nips` and `limitation`
### 7.2 NIP-42 + NIP-70
- Connection-scoped challenge store
- Protected (`["-"]`) events rejected by default unless auth+pubkey match
### 7.3 NIP-17/59 privacy guardrails
- Relay can enforce recipient-only reads for kind `1059` (AUTH required)
- Query path validates requester access for wrapped DM fetches
### 7.4 NIP-45 COUNT
- Exact count baseline
- Optional approximate mode and HLL payloads for common queries
### 7.5 NIP-50 search
- Use Postgres FTS (`tsvector`) with ranking
- Apply `limit` after ranking
### 7.6 NIP-77 negentropy
- Track per-negentropy-session state in dedicated GenServer
- Use bounded resources + inactivity timeout
### 7.7 NIP-62 vanish
- Hard-delete all events by pubkey up to `created_at`
- Also delete matching gift wraps where feasible (`#p` target)
- Persist minimal audit record if needed for operations/legal trace
### 7.8 NIP-EE MLS (feature-flagged)
- Accept/store kind `443` KeyPackage events
- Process kind `445` under configurable retention policy (default short TTL)
- Ensure kind `10051` replaceable semantics
- Keep relay MLS-agnostic cryptographically (no MLS decryption in relay path)
## 8) Performance model
- Bounded mailbox and queue limits on connections
- ETS-heavy hot path (subscription match, auth/session cache)
- DB writes batched where safe; reads via prepared plans
- Avoid global locks; prefer partitioned workers and sharded ETS tables
- Telemetry-first tuning: p50/p95/p99 for ingest, query, fanout
- Expose Prometheus-compatible `/metrics` endpoint for scraping
Targets (initial):
- p95 EVENT ack < 50ms under nominal load
- p95 REQ initial response start < 120ms on indexed queries
- predictable degradation under overload via rate-limit + backpressure
## 9) Testing strategy (full suite)
1. **Unit tests**: parser, filter evaluator, policy predicates, NIP validators
2. **Property tests**: filter semantics, replaceable/addressable conflict resolution
3. **Adapter contract tests**: shared behavior tests run against Postgres adapter
4. **Integration tests**: websocket protocol flows (`EVENT/REQ/CLOSE/AUTH/COUNT/NEG-*`)
5. **NIP conformance tests**: machine-prefix responses, ordering, EOSE behavior
6. **MLS scenario tests**: keypackage/group-event acceptance and policy handling
7. **Performance tests**: soak + burst + large fanout profiles
8. **Query-plan regression tests**: representative `EXPLAIN (ANALYZE, BUFFERS)` checks for core REQ/COUNT shapes
9. **Fault-injection tests**: DB outage, slow query, connection churn, node restart
## 10) Implementation principles
- Keep relay event-kind agnostic by default; special-case only where NIPs require
- Prefer explicit feature flags for expensive/experimental modules
- No direct Ecto usage outside Postgres adapter and migration layer
- Every feature lands with tests + telemetry hooks
---
Implementation task breakdown is tracked in `./PROGRESS.md`.