From 8dbf05b7fe519860282cea695b73793f29ac11b4 Mon Sep 17 00:00:00 2001 From: Steffen Beyer Date: Wed, 18 Mar 2026 13:23:06 +0100 Subject: [PATCH] docs: Opus review --- docs/slop/REVIEW.md | 984 ++++++++++++++++++-------------------------- 1 file changed, 390 insertions(+), 594 deletions(-) diff --git a/docs/slop/REVIEW.md b/docs/slop/REVIEW.md index 6fee164..25e6776 100644 --- a/docs/slop/REVIEW.md +++ b/docs/slop/REVIEW.md @@ -1,670 +1,466 @@ -# Parrhesia Relay — Technical Review +# Parrhesia Technical Audit -**Reviewer:** Case, Senior Systems & Protocol Engineer -**Date:** 2026-03-14 -**Commit:** `63d3e7d` (master) -**Scope:** Full codebase review against Nostr NIPs, MARMOT specs, and production readiness criteria +**Date:** 2026-03-17 +**Reviewer:** BEAM Architecture & Security Specialist +**Version audited:** 0.5.0 (Alpha) +**Scope:** Full-spectrum audit — protocol compliance, architecture, OTP supervision, backpressure, data integrity, security --- -# Executive Summary +## Table of Contents -Parrhesia is a well-structured Nostr relay built on Elixir/OTP with PostgreSQL storage. The architecture is clean — clear separation between web, protocol, policy, and storage layers with a pluggable adapter pattern. Code quality is above average: consistent error handling, good use of `with` chains, comprehensive policy enforcement for MARMOT-specific concerns, and thoughtful outbound backpressure management. The developer clearly understands both the BEAM and the Nostr protocol. - -However, the relay has **two critical defects** that make it unsuitable for any deployment beyond trusted local development: (1) **no Schnorr signature verification** — any client can forge events with arbitrary pubkeys, and (2) **lossy tag storage** — events returned from queries have truncated tags, violating NIP-01's data integrity guarantees. Several additional high-severity issues (no ephemeral event handling, missing NIP-42 relay tag validation, SQL LIKE injection vector, no ingest rate limiting) compound the risk. - -**Overall risk rating: Critical** - -This relay is **not production-ready** for any public deployment. It is suitable for local development and internal testing with trusted clients. With the critical and high findings addressed, it could serve as a solid private relay. Public internet deployment requires significant additional hardening. +1. [Feature Parity & Protocol Completeness](#1-feature-parity--protocol-completeness) +2. [Architecture & Design Coherence](#2-architecture--design-coherence) +3. [OTP & Supervision Tree](#3-otp--supervision-tree) +4. [Backpressure & Load Handling](#4-backpressure--load-handling) +5. [Data Integrity & PostgreSQL](#5-data-integrity--postgresql) +6. [Security](#6-security) +7. [Overall Assessment](#7-overall-assessment) +8. [Prioritised Remediation List](#8-prioritised-remediation-list) --- -# Top Findings +## 1. Feature Parity & Protocol Completeness -## [Critical] No Schnorr Signature Verification +**Verdict: ⚠️ Concern — two ghost features, two partial implementations** -**Area:** protocol correctness, security +### Claimed NIPs: 1, 9, 11, 13, 17, 40, 42, 43, 44, 45, 50, 59, 62, 66, 70, 77, 86, 98 -**Why it matters:** -NIP-01 mandates that relays MUST verify event signatures using Schnorr signatures over secp256k1. Without signature verification, any client can publish events with any pubkey. This completely breaks the identity and trust model of the Nostr protocol. Authentication (NIP-42), protected events (NIP-70), deletion (NIP-09), replaceable events — all rely on pubkey authenticity. +| NIP | Title | Status | Notes | +|-----|-------|--------|-------| +| 01 | Basic Protocol | ✅ Full | EVENT, REQ, CLOSE, EOSE, OK, NOTICE, CLOSED all correct | +| 09 | Event Deletion | ✅ Full | Kind 5 with `e`/`a`/`k` tags; author-only enforcement; soft delete | +| 11 | Relay Information | ✅ Full | `application/nostr+json` on same URI; limitations section present | +| 13 | Proof of Work | ✅ Full | Leading-zero-bit calculation; configurable `min_pow_difficulty` | +| 17 | Private DMs | ⚠️ Partial | Kind 14/15 accepted; encryption/decryption client-side only; relay stores giftwraps transparently | +| 40 | Expiration | ✅ Full | Tag parsed at ingest; background worker for cleanup | +| 42 | Authentication | ✅ Full | Challenge/response with Kind 22242; freshness check; challenge rotation | +| 43 | Relay Access Metadata | ⚠️ Partial | Kind 13534 membership tracking present; join/leave request mechanics (28934/28935/28936) unimplemented | +| 44 | Encrypted Payloads | ✅ Full | Relay-transparent; version 2 format recognised | +| 45 | Count | ✅ Full | COUNT messages with optional HyperLogLog; `approximate` field | +| 50 | Search | ⚠️ Partial | Basic `ILIKE` substring match only; no tsvector/pg_trgm; no extensions (domain, language, sentiment) | +| 59 | Gift Wrap | ⚠️ Partial | Kind 1059/13 accepted; encryption at client level; recipient auth gating present | +| 62 | Request to Vanish | ✅ Full | Kind 62; relay tag requirement; `ALL_RELAYS` support; hard delete | +| **66** | **Relay Discovery** | **🔴 Ghost** | **Listed in `supported_nips` but no relay-specific liveness/discovery logic exists. Kinds 30166/10166 stored as regular events only.** | +| 70 | Protected Events | ✅ Full | `["-"]` tag detection; AUTH required; pubkey matching | +| 77 | Negentropy Sync | ✅ Full | NEG-OPEN/MSG/CLOSE/ERR; session management; configurable limits | +| 86 | Management API | ✅ Full | JSON-RPC over HTTP; NIP-98 auth; full moderation method set | +| 98 | HTTP Auth | ✅ Full | Kind 27235; URL/method binding; SHA256 payload hash; freshness | -**Evidence:** -`lib/parrhesia/protocol/event_validator.ex` validates the event ID hash (`validate_id_hash/1` at line 188) but never verifies the `sig` field against the `pubkey` using Schnorr/secp256k1. A `grep` for `schnorr`, `secp256k1`, `verify`, and `:crypto.verify` across the entire `lib/` directory returns zero results. The `validate_sig/1` function (line 182) only checks that `sig` is 64-byte lowercase hex — a format check, not a cryptographic verification. +### Ghost Features -**Spec reference:** -NIP-01: "Each user has a keypair. Signatures, public key, and encodings are done according to the Schnorr signatures standard for the curve secp256k1." The relay is expected to verify signatures to ensure event integrity. +1. **NIP-66 (Relay Discovery and Liveness Monitoring)** — Listed in supported NIPs but the relay performs no special handling for kinds 30166 or 10166. They are accepted and stored as generic events. No liveness data generation exists. **Remove from `supported_nips` or implement.** -**Attack scenario:** -An unauthenticated client connects and publishes `["EVENT", {"id": "", "pubkey": "", "sig": "", ...}]`. The relay stores and fans out the forged event as if the victim authored it. This enables impersonation, reputation attacks, and poisoning of replaceable events (kind 0 profile, kind 3 contacts, kind 10002 relay lists). +2. **NIP-43 (Relay Access Metadata)** — Group membership tracking (kind 13534, 8000, 8001) is present, but the join/leave request mechanics (kinds 28934, 28935, 28936) and invite-code validation are absent. This is a partial implementation marketed as complete. **Document as partial or implement join/leave flow.** -**Recommended fix:** -Add a secp256k1 library dependency (e.g., `ex_secp256k1` or `:crypto` with OTP 26+ Schnorr support) and add signature verification to `EventValidator.validate/1` after `validate_id_hash/1`. This is the single most important fix. +### Undocumented Features -**Confidence:** High +- **Marmot MLS groups** (kinds 443-449) with specialised media validation, push notification policies, and `h`-tag scoping are implemented but are not standard NIPs. These are correctly treated as extensions rather than NIP claims. + +### Remediation + +| Item | Action | File(s) | +|------|--------|---------| +| NIP-66 ghost | Remove from `supported_nips` in `relay_info.ex` | `lib/parrhesia/web/relay_info.ex` | +| NIP-43 partial | Document as partial or implement join/leave | `lib/parrhesia/groups/flow.ex` | +| NIP-50 extensions | Consider tsvector for FTS; document missing extensions | `lib/parrhesia/storage/adapters/postgres/events.ex` | --- -## [Critical] Lossy Tag Storage — Events Returned With Truncated Tags +## 2. Architecture & Design Coherence -**Area:** protocol correctness, database +**Verdict: ✅ Sound — clean layered architecture with proper separation of concerns** -**Why it matters:** -NIP-01 events have tags with arbitrary numbers of elements (e.g., `["e", "", "", ""]`, `["p", "", ""]`, `["a", "::", ""]`). The relay only stores the first two elements (`name` and `value`) of each tag in the `event_tags` table, and single-element tags (like `["-"]` for NIP-70 protected events) are dropped entirely. When events are queried back, the reconstructed tags are truncated. +### Layer Map -**Evidence:** -`lib/parrhesia/storage/adapters/postgres/events.ex`: -- `insert_tags!/2` (line 266): pattern matches `[name, value | _rest]` — discards `_rest`, ignores tags with fewer than 2 elements. -- `load_tags/1` (line 739): reconstructs tags as `[tag.name, tag.value]` — only 2 elements. -- `to_nostr_event/2` (line 763): uses the truncated tags directly. - -The `events` table itself does not store the full tag array. The full tags exist only in the original JSON during ingest, then are lost. - -**Spec reference:** -NIP-01: Tags are arrays of arbitrary strings. Relay implementations MUST return events with their complete, unmodified tags. Relay hints in `e`/`p` tags, markers, and other metadata are essential for client operation. - -**Attack/failure scenario:** -1. Client publishes event with `["e", "", "wss://relay.example.com", "reply"]`. -2. Another client queries and receives `["e", ""]` — relay hint and marker lost. -3. Client cannot follow the event reference to the correct relay. -4. Protected events with `["-"]` tag lose their protection marker on retrieval, breaking NIP-70 semantics. - -**Recommended fix:** -Either (a) store the full tag JSON array in the events table (e.g., a `tags` JSONB column), using `event_tags` only as a query index, or (b) add additional columns to `event_tags` to preserve all elements (e.g., a `rest` text array column or store the full tag as a JSONB column). Option (a) is simpler and more correct. - -**Confidence:** High - ---- - -## [High] No Ephemeral Event Handling (Kind 20000–29999) - -**Area:** protocol correctness, performance - -**Why it matters:** -NIP-01 defines kinds 20000–29999 as ephemeral events that relays are NOT expected to store. They should be fanned out to matching subscribers but never persisted. The relay currently persists all events regardless of kind, which wastes storage and violates client expectations. - -**Evidence:** -`lib/parrhesia/storage/adapters/postgres/events.ex`: -- `replaceable_kind?/1` (line 515): handles kinds 0, 3, 10000–19999. -- `addressable_kind?/1` (line 517): handles kinds 30000–39999. -- No function checks for ephemeral kinds (20000–29999). -- `put_event/2` persists all non-deletion, non-vanish events unconditionally. - -`lib/parrhesia/web/connection.ex`: -- `persist_event/1` (line 420): routes kind 5 to deletion, kind 62 to vanish, everything else to `put_event`. No ephemeral bypass. - -The config has `accept_ephemeral_events: true` but it's never checked anywhere. - -**Spec reference:** -NIP-01: "Upon receiving an ephemeral event, a relay is NOT expected to store it and SHOULD send it directly to the clients that have matching filters open." - -**Recommended fix:** -In `persist_event/1`, check if the event kind is in the ephemeral range. If so, skip DB persistence and only fan out. The `accept_ephemeral_events` config should gate whether ephemeral events are accepted at all. - -**Confidence:** High - ---- - -## [High] NIP-42 AUTH Missing Relay Tag Validation - -**Area:** protocol correctness, security - -**Why it matters:** -NIP-42 requires AUTH events to include a `["relay", ""]` tag that matches the relay's URL. Without this check, an AUTH event created for relay A can be replayed against relay B, enabling cross-relay authentication bypass. - -**Evidence:** -`lib/parrhesia/web/connection.ex`: -- `validate_auth_event/1` (line 573): checks kind 22242 and presence of `challenge` tag. -- `validate_auth_challenge/2` (line 590): checks challenge value matches. -- **No validation of `relay` tag** anywhere in the auth flow. - -**Spec reference:** -NIP-42: AUTH event "MUST include `['relay', '']` tag". The relay MUST verify this tag matches its own URL. - -**Attack scenario:** -Attacker obtains an AUTH event from user for relay A (which may be the attacker's relay). Attacker replays this AUTH event against Parrhesia, which accepts it because the challenge is the only thing checked. If the challenge can be predicted or leaked, authentication is fully bypassed. - -**Recommended fix:** -Add relay URL validation to `validate_auth_event/1`. The relay should know its own canonical URL (from config or NIP-11 document) and verify the `relay` tag matches. - -**Confidence:** High - ---- - -## [High] SQL LIKE Pattern Injection in NIP-50 Search - -**Area:** security, performance - -**Why it matters:** -The NIP-50 search implementation uses PostgreSQL `ILIKE` with unsanitized user input interpolated into the pattern. While not traditional SQL injection (the value is parameterized), LIKE metacharacters (`%`, `_`) in the search string alter the matching semantics and can cause catastrophic performance. - -**Evidence:** -`lib/parrhesia/storage/adapters/postgres/events.ex` line 627: -```elixir -where(query, [event], ilike(event.content, ^"%#{search}%")) +``` +┌─────────────────────────────────────────────────────┐ +│ Transport Layer (Bandit + WebSock) │ +│ lib/parrhesia/web/ │ +│ connection.ex · router.ex · endpoint.ex · listener │ +├─────────────────────────────────────────────────────┤ +│ Protocol Layer (codec + validation) │ +│ lib/parrhesia/protocol/ │ +│ protocol.ex · event_validator.ex · filter.ex │ +├─────────────────────────────────────────────────────┤ +│ API Layer (domain operations) │ +│ lib/parrhesia/api/ │ +│ events.ex · auth.ex · stream.ex · sync.ex · admin │ +├─────────────────────────────────────────────────────┤ +│ Policy Layer (authorisation & rules) │ +│ lib/parrhesia/policy/ │ +│ event_policy.ex · connection_policy.ex │ +├─────────────────────────────────────────────────────┤ +│ Storage Layer (adapter pattern) │ +│ lib/parrhesia/storage/ │ +│ behaviours → postgres adapters · memory adapters │ +└─────────────────────────────────────────────────────┘ ``` -The `search` variable is directly interpolated into the LIKE pattern. User-supplied values like `%a%b%c%d%e%f%g%h%i%j%` create pathological patterns that force PostgreSQL into exponential backtracking against the full `content` column of every matching row. +### Strengths -**Attack scenario:** -Client sends `["REQ", "sub1", {"search": "%a%b%c%d%e%f%g%h%i%j%k%l%m%n%o%p%q%r%", "kinds": [1]}]`. PostgreSQL executes an expensive sequential scan with exponential LIKE pattern matching. A handful of concurrent requests with adversarial patterns can saturate the DB connection pool and CPU. +- **Clean adapter pattern**: Storage behaviours (`Storage.Events`, `Storage.Moderation`, etc.) are resolved at runtime via `Parrhesia.Storage`. PostgreSQL and in-memory implementations are interchangeable. +- **Protocol layer is pure**: `Protocol.decode_client/1` and `Protocol.encode_relay/1` contain no side effects — they are pure codecs. Validation is separated into `EventValidator`. +- **API layer mediates all domain logic**: `API.Events.publish/2` orchestrates validation → policy → persistence → fanout. The web layer never touches the DB directly. +- **Connection handler is a state machine**: `Web.Connection` implements `WebSock` behaviour cleanly — no business logic bleeds in; it delegates to API functions. +- **Configuration is centralised**: `Parrhesia.Config` provides an ETS-backed config cache with runtime environment override support via 50+ env vars. -**Recommended fix:** -1. Escape `%` and `_` characters in user search input before interpolation: `search |> String.replace("%", "\\%") |> String.replace("_", "\\_")`. -2. Consider PostgreSQL full-text search (`tsvector`/`tsquery`) instead of ILIKE for better performance and correct semantics. -3. Add a minimum search term length (e.g., 3 characters). +### Concerns -**Confidence:** High +- **Connection handler size**: `web/connection.ex` is a large module handling all WebSocket message dispatch. While each handler delegates to API, the dispatch surface is wide. Consider extracting message handlers into sub-modules. +- **Fanout is embedded in API.Events**: The `fanout_event/1` call inside `publish/2` couples persistence with subscription delivery. Under extreme load, this means the publisher blocks on fanout. Decoupling into an async step would improve isolation. + +### Scalability Assessment + +The architecture would scale gracefully to moderate load (thousands of concurrent connections) without a rewrite. For truly high-volume deployments (10k+ connections, 100k+ events/sec), the synchronous fanout path and lack of global rate limiting would need attention. The BEAM's per-process isolation provides natural horizontal scaling if combined with a load balancer. --- -## [High] No Per-Connection or Per-IP Rate Limiting on Event Ingestion +## 3. OTP & Supervision Tree -**Area:** security, robustness +**Verdict: ✅ Sound — exemplary OTP compliance with zero unsafe practices** -**Why it matters:** -There is no rate limiting on EVENT submissions. A single client can flood the relay with events at wire speed, consuming DB connections, CPU (for validation), and disk I/O. The outbound queue has backpressure, but the ingest path is completely unbounded. +### Supervision Hierarchy -**Evidence:** -`lib/parrhesia/web/connection.ex`: -- `handle_event_ingest/2` (line 186): processes every EVENT message immediately with no throttle. -- No token bucket, sliding window, or any rate-limiting mechanism anywhere in the codebase. -- `grep` for `rate.limit`, `throttle`, `rate_limit` across `lib/` returns only error message strings, not actual rate-limiting logic. - -**Attack scenario:** -A single WebSocket connection sends 10,000 EVENT messages per second. Each triggers validation, policy checks, and a DB transaction. The DB connection pool (default 32) saturates within milliseconds. All other clients experience timeouts. - -**Recommended fix:** -Implement per-connection rate limiting in the WebSocket handler (token bucket per connection state). Consider also per-pubkey and per-IP rate limiting as a separate layer. Start with a simple `{count, window_start}` in connection state. - -**Confidence:** High - ---- - -## [High] max_frame_bytes and max_event_bytes Not Enforced - -**Area:** security, robustness - -**Why it matters:** -The configuration defines `max_frame_bytes: 1_048_576` and `max_event_bytes: 262_144` but neither value is actually used to limit incoming data. The max_frame_bytes is only reported in the NIP-11 document. An attacker can send arbitrarily large WebSocket frames and events. - -**Evidence:** -- `grep` for `max_frame_bytes` in `lib/`: only found in `relay_info.ex` for NIP-11 output. -- `grep` for `max_event_bytes` in `lib/`: no results at all. -- The Bandit WebSocket upgrade in `router.ex` line 53 passes `timeout: 60_000` but no `max_frame_size` option. -- No payload size check in `handle_in/2` before JSON decoding. - -**Attack scenario:** -Client sends a 100MB WebSocket frame containing a single event with a massive `content` field or millions of tags. The relay attempts to JSON-decode the entire payload in memory, potentially causing OOM or extreme GC pressure. - -**Recommended fix:** -1. Pass `max_frame_size` to Bandit's WebSocket upgrade options. -2. Check `byte_size(payload)` in `handle_in/2` before calling `Protocol.decode_client/1`. -3. Optionally check individual event size after JSON decoding. - -**Confidence:** High - ---- - -## [Medium] NIP-09 Deletion Missing "a" Tag Support for Addressable Events - -**Area:** protocol correctness - -**Why it matters:** -NIP-09 specifies that deletion events (kind 5) can reference addressable/replaceable events via `"a"` tags (format: `"::"`). The current implementation only handles `"e"` tags. - -**Evidence:** -`lib/parrhesia/storage/adapters/postgres/events.ex`: -- `extract_delete_event_ids/1` (line 821): only extracts `["e", event_id | _rest]` tags. -- No handling of `["a", ...]` tags. -- No query against addressable_event_state or events by kind+pubkey+d_tag. - -**Spec reference:** -NIP-09: "The deletion event MAY contain `a` tags pointing to the replaceable/addressable events to be deleted." - -**Recommended fix:** -Extract `"a"` tags from the deletion event, parse the `kind:pubkey:d_tag` format, and soft-delete matching events from the addressable/replaceable state tables, ensuring the deleter's pubkey matches. - -**Confidence:** High - ---- - -## [Medium] Subscription Index GenServer Is a Single-Point Bottleneck - -**Area:** performance, OTP/design - -**Why it matters:** -Every event fanout goes through `Index.candidate_subscription_keys/1`, which is a synchronous `GenServer.call` to a single process. Under load with many connections and high event throughput, this process becomes the serialization point for all fanout operations. - -**Evidence:** -`lib/parrhesia/subscriptions/index.ex`: -- `candidate_subscription_keys/2` (line 68): `GenServer.call(server, {:candidate_subscription_keys, event})` -- This is called from every connection process for every ingested event (via `fanout_event/1` in `connection.ex` line 688). -- The ETS tables are `:protected`, meaning only the owning GenServer can write but any process can read. - -**Recommended fix:** -Since the ETS tables are already `:protected` (readable by all processes), make `candidate_subscription_keys/1` read directly from ETS without going through the GenServer. Only mutations (upsert/remove) need to go through the GenServer. This eliminates the serialization bottleneck entirely. - -Actually, the tables are `:protected` which means other processes CAN read. Refactor `candidate_subscription_keys` to read ETS directly from the caller's process, bypassing the GenServer. - -**Confidence:** High - ---- - -## [Medium] Moderation Cache ETS Table Creation Race Condition - -**Area:** robustness, OTP/design - -**Why it matters:** -The moderation cache ETS table is lazily created on first access via `cache_table_ref/0`. If two processes simultaneously call a moderation function before the table exists, both will attempt `ets.new(:parrhesia_moderation_cache, [:named_table, ...])` — one will succeed and one will hit the rescue clause. While the rescue catches the `ArgumentError`, this is a race-prone pattern. - -**Evidence:** -`lib/parrhesia/storage/adapters/postgres/moderation.ex` lines 211–231: -```elixir -defp cache_table_ref do - case :ets.whereis(@cache_table) do - :undefined -> - try do - :ets.new(@cache_table, [...]) - rescue - ArgumentError -> @cache_table - end - @cache_table - _table_ref -> - @cache_table - end -end +``` +Parrhesia.Supervisor (:one_for_one) +├── Telemetry (Supervisor, :one_for_one) +│ ├── TelemetryMetricsPrometheus.Core +│ └── :telemetry_poller +├── Config (GenServer, ETS-backed) +├── Storage.Supervisor (:one_for_one) +│ ├── ModerationCache (GenServer) +│ └── Repo (Ecto.Repo) +├── Subscriptions.Supervisor (:one_for_one) +│ ├── Index (GenServer, 7 ETS tables) +│ ├── Stream.Registry (Registry) +│ ├── Stream.Supervisor (DynamicSupervisor) +│ │ └── [dynamic] Stream.Subscription (GenServer, :temporary) +│ ├── Negentropy.Sessions (GenServer) +│ └── Fanout.MultiNode (GenServer, :pg group) +├── Auth.Supervisor (:one_for_one) +│ ├── Challenges (GenServer) +│ └── Identity.Manager (GenServer) +├── Sync.Supervisor (:one_for_one) +│ ├── WorkerRegistry (Registry) +│ ├── WorkerSupervisor (DynamicSupervisor) +│ │ └── [dynamic] Sync.Worker (GenServer, :transient) +│ └── Sync.Manager (GenServer) +├── Policy.Supervisor (:one_for_one) [empty — stateless modules] +├── Web.Endpoint (Supervisor, :one_for_one) +│ └── [per-listener] Bandit +│ └── [per-connection] Web.Connection (WebSock) +└── Tasks.Supervisor (:one_for_one) + ├── ExpirationWorker (GenServer, optional) + └── PartitionRetentionWorker (GenServer) ``` -Additionally, `ensure_cache_scope_loaded/1` has a TOCTOU race: it checks `ets.member(table, loaded_key)`, then loads from DB and inserts — two processes could both load and insert simultaneously, though this is less harmful (just redundant work). +### Findings -**Recommended fix:** -Create the ETS table in a supervised process (e.g., in `Parrhesia.Policy.Supervisor` or `Parrhesia.Storage.Supervisor`) at startup, not lazily. This eliminates the race entirely. +| Check | Result | +|-------|--------| +| Bare `spawn()` calls | ✅ None found | +| `Task.start` without supervision | ✅ None found | +| `Task.async` without supervision | ✅ None found | +| Unsupervised `Process.` calls | ✅ None found | +| All `Process.monitor` paired with `Process.demonitor(ref, [:flush])` | ✅ Confirmed in Index, Challenges, Subscription, Sessions | +| Restart strategies appropriate | ✅ All `:one_for_one` — children are independent | +| `:transient` for Sync.Worker | ✅ Correct — no restart on normal shutdown | +| `:temporary` for Stream.Subscription | ✅ Correct — never restart dead subscriptions | +| Crash cascade possibility | ✅ None — all supervisors isolate failures | +| Deadlock potential | ✅ None — no circular GenServer calls detected | +| ETS table ownership | ✅ Owned by supervising GenServer; recreated on restart | -**Confidence:** High +### Minor Observations + +- **Policy.Supervisor is empty**: Contains no children (stateless policy modules). Exists as a future extension point — no performance cost. +- **Identity.Manager** silently catches init failures: Returns `{:ok, %{}}` even if identity setup fails. Correct for startup resilience but consider a telemetry event for visibility. +- **4-level max depth** in the tree: Optimal — deep enough for isolation, shallow enough for comprehension. --- -## [Medium] Archiver SQL Injection +## 4. Backpressure & Load Handling -**Area:** security +**Verdict: ⚠️ Concern — per-connection defences are solid; global protection is absent** -**Why it matters:** -The `Parrhesia.Storage.Archiver.archive_sql/2` function directly interpolates arguments into a SQL string without any sanitization or quoting. +### What Exists -**Evidence:** -`lib/parrhesia/storage/archiver.ex` line 32: -```elixir -def archive_sql(partition_name, archive_table_name) do - "INSERT INTO #{archive_table_name} SELECT * FROM #{partition_name};" -end +| Mechanism | Implementation | Default | Location | +|-----------|---------------|---------|----------| +| Per-connection event rate limit | Window-based counter | 120 events/sec | `connection.ex:1598-1624` | +| Per-connection subscription cap | Hard limit at REQ time | 32 subs | `connection.ex:1098-1109` | +| Per-connection outbound queue | Bounded `:queue` | 256 entries | `connection.ex` state | +| Outbound overflow strategy | `:close` / `:drop_newest` / `:drop_oldest` | `:close` | `connection.ex:987-1013` | +| Negentropy session limits | Per-connection + global cap | 8/conn, 10k global | `negentropy/sessions.ex` | +| Filter count per REQ | Hard limit | 16 filters | `connection.ex` | +| Queue pressure telemetry | Pressure gauge (0-1 scale) | Alert at ≥ 0.75 | `telemetry.ex:57-81` | + +### What's Missing + +| Gap | Risk | Recommendation | +|-----|------|----------------| +| **No GenStage/Broadway backpressure** | Event ingestion is synchronous — publisher blocks on DB write + fanout | Consider async fanout via GenStage or Task.Supervisor | +| **No global rate limit** | 1000 connections × 120 evt/s = 120k evt/s with no relay-wide cap | Add global event counter with configurable ceiling | +| **No connection cap** | Bandit accepts unlimited connections up to OS limits | Set `max_connections` in Bandit options or add a plug-based cap | +| **No per-IP rate limit** | Single IP can open many connections, each with independent limits | Add IP-level rate limiting or document reverse-proxy requirement | +| **No subscription index size cap** | Global ETS subscription index can grow unbounded | Add global subscription count metric + circuit breaker | +| **Synchronous fanout** | `fanout_event/1` iterates all matching subs in publisher's process | Decouple fanout into async dispatch | +| **No read/write pool separation** | Writes can starve read queries under heavy ingest | Consider separate Ecto repos or query prioritisation | +| **No graceful drain on shutdown** | Listener shutdown doesn't flush queued events | Add drain timer before connection termination | +| **No mailbox depth monitoring** | `Process.info(pid, :message_queue_len)` never checked | Add periodic mailbox depth telemetry for subscription processes | + +### Blast Radius: Flood Attack Scenario + +A high-volume client flooding the relay with valid events at maximum rate: + +1. **120 events/sec per connection** — rate-limited per-connection ✅ +2. **But**: attacker opens 100 connections from different IPs → 12k events/sec ⚠️ +3. Each event hits the Ecto pool (32 connections) → pool saturation at ~100 concurrent writes +4. Fanout blocks publisher process → connection mailbox grows +5. No global circuit breaker → relay degrades gradually via Ecto queue timeouts +6. Memory grows in subscription process mailboxes → eventual OOM + +**Mitigation**: Deploy behind a reverse proxy (nginx/HAProxy) with connection and rate limits. This is a common pattern for BEAM applications but should be documented as a deployment requirement. + +--- + +## 5. Data Integrity & PostgreSQL + +**Verdict: ✅ Sound — well-designed schema with strong indexing; minor gaps in constraints and FTS** + +### Schema Overview + +Events are stored in monthly RANGE-partitioned tables keyed on `created_at`: + +```sql +events (PARTITION BY RANGE (created_at)) + id bytea (32 bytes, SHA256) + pubkey bytea (32 bytes) + created_at bigint (unix timestamp, partition key) + kind integer + content text + sig bytea (64 bytes) + d_tag text (nullable, for addressable events) + tags jsonb (denormalised tag array) + deleted_at bigint (soft delete) + expires_at bigint (NIP-40) + +event_tags (PARTITION BY RANGE (event_created_at)) + event_created_at bigint + event_id bytea + name text + value text + idx integer + +-- State tables for NIP-16 replaceable/addressable events +replaceable_event_state (pubkey, kind → event_id, created_at) +addressable_event_state (pubkey, kind, d_tag → event_id, created_at) ``` -If either argument is derived from user input or external configuration, this is a SQL injection vector. +### Indexing Strategy: ✅ Excellent -**Attack scenario:** -If the management API or any admin tool passes user-controlled input to this function (e.g., a partition name from a web request), an attacker could inject: `archive_sql("events_default; DROP TABLE events; --", "archive")`. +| Index | Columns | Purpose | Assessment | +|-------|---------|---------|------------| +| `events_kind_created_at_idx` | (kind, created_at DESC) | Kind filter | ✅ Primary access pattern | +| `events_pubkey_created_at_idx` | (pubkey, created_at DESC) | Author filter | ✅ Primary access pattern | +| `events_created_at_idx` | (created_at DESC) | Time range | ✅ Partition pruning | +| `events_id_idx` | (id) | Direct lookup / dedup | ✅ Essential | +| `events_expires_at_idx` | Partial, WHERE expires_at IS NOT NULL | Expiration worker | ✅ Space-efficient | +| `events_deleted_at_idx` | Partial, WHERE deleted_at IS NOT NULL | Soft-delete filter | ✅ Space-efficient | +| `event_tags_name_value_idx` | (name, value, event_created_at DESC) | Generic tag lookup | ✅ Covers all tag queries | +| `event_tags_i_idx` | Partial (value, event_created_at) WHERE name='i' | Hashtag lookups | ✅ Optimised for common pattern | +| `event_tags_h_idx` | Partial (value, event_created_at) WHERE name='h' | Group/relay hint | ✅ Marmot-specific optimisation | -**Recommended fix:** -Quote identifiers using `~s("#{identifier}")` or better, use Ecto's `Ecto.Adapters.SQL.query/3` with proper identifier quoting. Validate that inputs match expected partition name patterns (e.g., `events_YYYYMM`). +### Query Construction: ✅ Sound -**Confidence:** Medium (depends on whether this function is exposed to external input) +Nostr filters are translated to Ecto queries via `postgres/events.ex`: +- Single-filter: composed dynamically with `where/3` clauses +- Multi-filter: `UNION ALL` of individual filter queries, then Elixir-level dedup +- Tag filters: primary tag uses index join; secondary tags use `EXISTS` subqueries (properly parameterised) +- All values bound via Ecto's `^` operator — no SQL injection vectors + +### N+1 Analysis: ✅ No Critical Issues + +- Tag filtering uses `EXISTS` subqueries rather than joins — one query per filter, not per tag +- Multiple filters produce N queries (one per filter), not N per subscription +- `max_filters_per_req: 16` bounds this naturally +- Deduplication is done in Elixir (`MapSet`) after DB fetch — acceptable for bounded result sets + +### Issues Found + +| Issue | Severity | Detail | Remediation | +|-------|----------|--------|-------------| +| **No CHECK constraints on binary lengths** | Low | `id`, `pubkey`, `sig` have no `octet_length` checks at DB level | Add `CHECK (octet_length(id) = 32)` etc. via migration | +| **No length constraint on `d_tag`** | Low | Unbounded text field; could store arbitrarily large values | Add `CHECK (octet_length(d_tag) <= 256)` | +| **FTS is ILIKE-based** | Medium | NIP-50 search uses `ILIKE '%term%'` — linear scan, no ranking | Implement `tsvector` with GIN index for production-grade search | +| **Replaceable state race condition** | Low | SELECT → INSERT pattern in `insert_state_or_resolve_race/2` has a TOCTOU window | Mitigated by `ON CONFLICT :nothing` + retry, but native `ON CONFLICT DO UPDATE` would be cleaner | +| **Redundant tag storage** | Info | Tags stored in both `event_tags` table and `events.tags` JSONB column | Intentional denormalisation for different access patterns; documented tradeoff | +| **No indexes on state tables** | Low | `replaceable_event_state` and `addressable_event_state` lack explicit indexes | These are likely small tables; monitor query plans under load | + +### Partitioning: ✅ Excellent + +- Monthly partitions (`events_YYYY_MM`, `event_tags_YYYY_MM`) with lifecycle automation +- Default partition catches edge-case timestamps +- Retention worker pre-creates 2 months ahead; caps drops at 1 partition per run +- `protected_partition?/1` prevents accidental drop of parent/default +- Detach-before-drop pattern for safe concurrent operation + +### Connection Pooling: ✅ Well-Configured + +- Pool: 32 connections (configurable via `POOL_SIZE`) +- Queue target: 1000ms / Queue interval: 5000ms +- Standard Ecto queue management with graceful degradation --- -## [Medium] Count Query Materialises All Matching Event IDs in Memory +## 6. Security -**Area:** performance +### 6.1 Signature Validation -**Why it matters:** -The COUNT implementation fetches all matching event IDs into Elixir memory, deduplicates them with `MapSet.new()`, then counts. For large result sets, this is orders of magnitude slower and more memory-intensive than a SQL `COUNT(DISTINCT id)`. +**Verdict: ✅ Sound** -**Evidence:** -`lib/parrhesia/storage/adapters/postgres/events.ex` lines 111–127: -```elixir -def count(_context, filters, opts) when is_list(opts) do - ... - total_count = - filters - |> Enum.flat_map(fn filter -> - filter - |> event_id_query_for_filter(now, opts) - |> Repo.all() # fetches ALL matching IDs - end) - |> MapSet.new() # deduplicates in memory - |> MapSet.size() - ... -end -``` +- **Library**: `lib_secp256k1` (NIF-based Schnorr implementation) +- **Flow**: `API.Events.publish/2` → `Protocol.validate_event/1` → `EventValidator.validate_signature/1` → `Secp256k1.schnorr_valid?/3` +- **Timing**: Validation occurs BEFORE any database write — `persist_event/2` is only called after validation passes (`api/events.ex:41-44`) +- **Hex decoding**: Properly decodes hex sig/id/pubkey to binary before crypto verification +- **Error handling**: Wrapped in rescue clause; returns `{:error, :invalid_signature}` -**Attack scenario:** -Client sends `["COUNT", "c1", {"kinds": [1]}]` on a relay with 10 million kind-1 events. The relay fetches 10 million binary IDs into memory, builds a MapSet, then counts. This could use hundreds of megabytes of RAM per request. +**Concern**: Feature flag `verify_event_signatures` can disable verification. Currently only `false` in `test.exs`. **Ensure this is never disabled in production config.** Consider adding a startup assertion. -**Recommended fix:** -For single-filter counts, use `SELECT COUNT(*)` or `SELECT COUNT(DISTINCT id)` directly in SQL. For multi-filter counts where deduplication is needed, use `UNION` in SQL rather than materialising in Elixir. +### 6.2 SQL Injection -**Confidence:** High +**Verdict: ✅ Sound — all queries parameterised** + +- Only 2 `fragment()` uses in `postgres/events.ex`, both with bound parameters: + ```elixir + fragment("EXISTS (SELECT 1 FROM event_tags AS tag WHERE + tag.event_created_at = ? AND tag.event_id = ? AND tag.name = ? AND tag.value = ANY(?))", + event.created_at, event.id, ^tag_name, type(^values, {:array, :string})) + ``` +- All dynamic query conditions built via Ecto's `dynamic()` macro with `^` bindings +- No string interpolation of user input in any query path +- No calls to `Ecto.Adapters.SQL.query/3` with raw SQL + +### 6.3 Resource Exhaustion + +**Verdict: ⚠️ Concern — payload size enforced but tag arrays unbounded** + +| Control | Status | Default | Location | +|---------|--------|---------|----------| +| Max frame size | ✅ Enforced | 1 MB | `connection.ex:135-140` (before JSON parse) | +| Max event size | ✅ Enforced | 256 KB | `api/events.ex:40` (after JSON parse, before DB) | +| Max filters per REQ | ✅ Enforced | 16 | `connection.ex` | +| Max filter limit | ✅ Enforced | 500 | `filter.ex` | +| Max subscriptions | ✅ Enforced | 32/conn | `connection.ex:1098` | +| **Max tags per event** | **🔴 Missing** | **None** | **Event with 10k tags accepted if within 256KB** | +| **Max tag values per filter** | **🔴 Missing** | **None** | **Filter with 1000 `#p` values generates expensive query** | +| **Max content field size** | **⚠️ Indirect** | **256KB overall** | **No separate limit; large content consumes storage** | +| Max negentropy sessions | ✅ Enforced | 8/conn, 10k global | `negentropy/sessions.ex` | +| Event rate limit | ✅ Enforced | 120/sec/conn | `connection.ex:1598` | + +**Remediation**: Add configurable `max_tags_per_event` and `max_tag_values_per_filter` limits. + +### 6.4 NIP-42 AUTH Flow + +**Verdict: ✅ Sound** + +- Challenge generated with 16 bytes of cryptographic randomness (`Auth.Challenges`) +- Challenge is connection-scoped (one per connection, tracked by PID) +- On successful AUTH: challenge is rotated via `rotate_auth_challenge/1` +- Freshness validated: `created_at` must be within `max_age_seconds` (default 600s) +- Relay URL tag must match connection's relay URL +- Process monitoring cleans up challenges when connections die + +### 6.5 NIP-98 HTTP Auth + +**Verdict: ⚠️ Concern — token replay window** + +- **Signature**: Properly verified via same `EventValidator.validate/1` path ✅ +- **URL binding**: Exact method + URL match required ✅ +- **Freshness**: 60-second window ✅ +- **🔴 Token reuse**: Signed NIP-98 events are NOT invalidated after use. Within the 60-second window, the same signed event can be replayed on the same endpoint. +- **Remediation**: Track used event IDs in a TTL-expiring ETS table or similar. Reject events with previously-seen IDs. + +### 6.6 Injection Surfaces + +**Verdict: ✅ Sound** + +- All filter tag values parameterised through Ecto +- LIKE patterns properly escaped via `escape_like_pattern/1` (`%`, `_`, `\` escaped) +- Event field validation is comprehensive: + - IDs/pubkeys: 64-char lowercase hex (32 bytes) + - Signatures: 128-char lowercase hex (64 bytes) + - Kinds: integer 0-65535 + - Tags: validated as non-empty string arrays + +### 6.7 Other Security Notes + +- **No `IO.inspect` of sensitive data**: Only 2 uses, both for config logging +- **No hardcoded secrets**: All credentials from environment variables +- **Binary frames rejected**: WebSocket handler only accepts text frames +- **Subscription ID validated**: Non-empty, max 64 chars --- -## [Medium] NIP-42 AUTH Does Not Validate created_at Freshness +## 7. Overall Assessment -**Area:** protocol correctness, security +Parrhesia is a **well-engineered Nostr relay** that demonstrates expert-level Elixir/OTP knowledge and careful attention to the Nostr protocol. The codebase is clean, layered, and maintainable. -**Why it matters:** -NIP-42 suggests AUTH events should have a `created_at` close to current time (within ~10 minutes). The relay's AUTH handler validates the event (which includes a future-skew check of 15 minutes) but does not check if the event is too old. An AUTH event from days ago with a matching challenge could be replayed. +### Strengths -**Evidence:** -`lib/parrhesia/web/connection.ex`: -- `handle_auth/2` calls `Protocol.validate_event(auth_event)` which checks future skew but not past staleness. -- `validate_auth_event/1` (line 573) only checks kind and challenge tag. -- No `created_at` freshness check for AUTH events. +- **Exemplary OTP architecture**: Zero unsafe practices; proper supervision, monitoring, and crash isolation throughout +- **Clean separation of concerns**: Transport → Protocol → API → Policy → Storage layers with no bleed-through +- **Adapter pattern for storage**: PostgreSQL and in-memory backends are interchangeable — excellent for testing +- **Comprehensive protocol support**: 14 of 18 claimed NIPs fully implemented +- **Strong input validation**: All event fields rigorously validated; SQL injection surface is nil +- **Signature-first validation**: Schnorr verification occurs before any persistence — correct ordering +- **Rich telemetry**: Queue depth, pressure, overflow, ingest/query/fanout latency all instrumented +- **Production-ready partitioning**: Monthly partitions with automated lifecycle management +- **Extensive test coverage**: Unit, integration, and E2E test suites with property-based testing -The NIP-98 implementation (`auth/nip98.ex`) DOES have a 60-second freshness check, but the WebSocket AUTH path does not. +### Weaknesses -**Recommended fix:** -Add a staleness check: reject AUTH events where `created_at` is more than N seconds in the past (e.g., 600 seconds matching NIP-42 suggestion). - -**Confidence:** High +- **No global load protection**: Per-connection limits exist but no relay-wide circuit breakers +- **Synchronous fanout**: Publisher blocks on subscription delivery — could bottleneck under high cardinality +- **NIP-50 search is ILIKE-based**: Will not scale; needs tsvector/GIN for production workloads +- **Two ghost NIPs**: NIP-66 and NIP-43 (partial) claimed but not meaningfully implemented +- **NIP-98 token replay**: 60-second replay window without event ID deduplication +- **Unbounded tag arrays**: No limit on tags per event or tag values per filter --- -## [Low] NIP-11 Missing CORS Headers +## 8. Prioritised Remediation List -**Area:** protocol correctness +### 🔴 Critical (address before production) -**Why it matters:** -NIP-11 states relays MUST accept CORS requests by sending appropriate headers. The relay info endpoint does not set CORS headers. +| # | Issue | Impact | Fix | +|---|-------|--------|-----| +| 1 | **Unbounded tag arrays** | Event with 10k tags causes expensive DB inserts and index bloat; filter with 1000 tag values generates pathologically expensive queries | Add `max_tags_per_event` (e.g., 2000) and `max_tag_values_per_filter` (e.g., 256) to config and enforce in `event_validator.ex` and `filter.ex` | +| 2 | **No global rate limiting** | No relay-wide cap on event throughput; many connections can overwhelm DB pool | Add global event counter (ETS atomic) with configurable ceiling; return `rate-limited:` when exceeded | +| 3 | **NIP-98 token replay** | Signed management API tokens replayable within 60s window | Track used NIP-98 event IDs in TTL-expiring store; reject duplicates in `auth/nip98.ex` | -**Evidence:** -`lib/parrhesia/web/router.ex` line 44–55: the `/relay` GET handler returns NIP-11 JSON but does not set `Access-Control-Allow-Origin`, `Access-Control-Allow-Headers`, or `Access-Control-Allow-Methods` headers. No CORS plug is configured in the router. +### ⚠️ Important (address before scale) -**Recommended fix:** -Add CORS headers to the NIP-11 response, at minimum `Access-Control-Allow-Origin: *`. +| # | Issue | Impact | Fix | +|---|-------|--------|-----| +| 4 | **Synchronous fanout** | Publisher blocks while iterating all matching subscriptions; latency spikes under high subscription cardinality | Decouple `fanout_event/1` into async dispatch via `Task.Supervisor` or dedicated GenServer pool | +| 5 | **No connection cap** | Unlimited WebSocket connections accepted up to OS limits | Set `max_connections` in Bandit listener options; document reverse-proxy requirement | +| 6 | **NIP-50 FTS is ILIKE** | Linear scan of content column; unacceptable at scale | Add `tsvector` column + GIN index; update query to use `@@` operator | +| 7 | **Ghost NIP-66** | Claimed but unimplemented; misleads clients querying relay info | Remove 66 from `supported_nips` in `relay_info.ex` | +| 8 | **No DB pool separation** | Writes can starve reads under heavy ingest | Consider separate Ecto repo for reads, or use Ecto's `:queue_target` tuning | +| 9 | **Missing CHECK constraints** | Binary fields (`id`, `pubkey`, `sig`) lack length constraints at DB level | Add migration with `CHECK (octet_length(id) = 32)` etc. | -**Confidence:** High +### 💡 Nice to Have + +| # | Issue | Impact | Fix | +|---|-------|--------|-----| +| 10 | **NIP-43 partial** | Join/leave mechanics unimplemented | Document as partial or implement kinds 28934-28936 | +| 11 | **Signature verification feature flag** | Could be disabled in production config | Add startup assertion or compile-time check | +| 12 | **Mailbox depth monitoring** | No visibility into subscription process queue sizes | Add periodic `Process.info/2` telemetry for dynamic children | +| 13 | **Graceful shutdown drain** | No flush of queued events before connection termination | Add drain timer in listener reload/shutdown path | +| 14 | **Replaceable state race** | SELECT→INSERT TOCTOU in state tables | Refactor to native `ON CONFLICT DO UPDATE` | +| 15 | **Connection handler size** | `web/connection.ex` is large | Extract message handlers into sub-modules for maintainability | --- -## [Low] Event Query Deduplication Done in Elixir Instead of SQL - -**Area:** performance - -**Why it matters:** -When a REQ has multiple filters, each filter runs a separate DB query, results are merged and deduplicated in Elixir using `Map.put_new/3`. This means the relay may fetch duplicate events from the DB and transfer them over the wire from PostgreSQL, only to discard them. - -**Evidence:** -`lib/parrhesia/storage/adapters/postgres/events.ex` lines 85–95: -```elixir -persisted_events = - filters - |> Enum.flat_map(fn filter -> - filter |> event_query_for_filter(now, opts) |> Repo.all() - end) - |> deduplicate_events() - |> sort_persisted_events() -``` - -**Recommended fix:** -For multiple filters, consider using SQL `UNION` or `UNION ALL` with a final `DISTINCT ON` to push deduplication to the database. Alternatively, for the common case of a single filter (which is the majority of REQ messages), this is fine as-is. - -**Confidence:** Medium - ---- - -## [Low] No Validation of Subscription ID Content - -**Area:** robustness - -**Why it matters:** -Subscription IDs are validated for non-emptiness and max length (64 chars) but not for content. NIP-01 says subscription IDs are "arbitrary" strings, but allowing control characters, null bytes, or extremely long Unicode sequences could cause issues with logging, telemetry, or downstream systems. - -**Evidence:** -`lib/parrhesia/protocol.ex` line 218: -```elixir -defp valid_subscription_id?(subscription_id) do - subscription_id != "" and String.length(subscription_id) <= 64 -end -``` - -`String.length/1` counts Unicode graphemes, not bytes. A subscription ID of 64 emoji characters could be hundreds of bytes. - -**Recommended fix:** -Consider validating that subscription IDs contain only printable ASCII, or at least limit by byte size rather than grapheme count. - -**Confidence:** Medium - ---- - -# Protocol Compliance Review - -## NIPs Implemented -- **NIP-01**: Core protocol — substantially implemented. Critical gaps: no signature verification, lossy tags, no ephemeral handling. -- **NIP-09**: Event deletion — partially implemented (kind 5 with `e` tags only, missing `a` tag deletion). -- **NIP-11**: Relay information — implemented, missing CORS headers. -- **NIP-22**: Event `created_at` limits — implemented (future skew check, configurable). -- **NIP-40**: Expiration — implemented (storage, query filtering, periodic cleanup). Does not reject already-expired events on publish (SHOULD per spec). -- **NIP-42**: Authentication — implemented with challenge-response. Missing relay tag validation, AUTH event staleness check. -- **NIP-45**: COUNT — implemented with basic and HLL support. Performance concern with in-memory deduplication. -- **NIP-50**: Search — implemented via ILIKE. SQL injection concern. No full-text search. -- **NIP-70**: Protected events — implemented (tag check, pubkey match). Note: protected tag `["-"]` is lost on retrieval due to single-element tag storage bug. -- **NIP-77**: Negentropy — stub implementation (session tracking only, no actual reconciliation logic). -- **NIP-86**: Relay management — implemented with NIP-98 auth and audit logging. -- **NIP-98**: HTTP auth — implemented with freshness check. -- **MARMOT**: Kinds 443–449, 1059, 10050–10051 — validation and policy enforcement implemented. - -## Non-Compliant Behaviours -1. **No signature verification** — violates NIP-01 MUST. -2. **Lossy tag storage** — violates NIP-01 data integrity. -3. **Ephemeral events persisted** — violates NIP-01 SHOULD NOT store. -4. **AUTH missing relay tag check** — violates NIP-42 MUST. -5. **NIP-09 missing `a` tag deletion** — partial implementation. -6. **NIP-40: expired events accepted on publish** — violates SHOULD reject. -7. **NIP-11 missing CORS** — violates MUST. - -## Ambiguous Areas -- **NIP-01 replaceable event tie-breaking**: implemented correctly (lowest ID wins). -- **Deletion event storage**: kind 5 events are stored (correct — relay SHOULD continue publishing deletion requests). -- **NIP-45 HLL**: the HLL payload generation is a placeholder (hash of filter+count), not actual HyperLogLog registers. Clients expecting real HLL data will get nonsense. - ---- - -# Robustness Review - -The relay handles several failure modes well: -- WebSocket binary frames are rejected with a clear notice. -- Invalid JSON returns a structured NOTICE. -- GenServer exits are caught with `catch :exit` patterns throughout the connection handler. -- Outbound queue has configurable backpressure (close, drop_oldest, drop_newest). -- Subscription limits are enforced per connection. -- Process monitors clean up subscription index entries when connections die. - -**Key resilience gaps:** -1. **No ingest rate limiting** — one client can monopolise the relay. -2. **No payload size enforcement** — oversized frames/events are processed. -3. **Unbounded tag count** — an event with 100,000 tags will generate 100,000 DB inserts in a single transaction. -4. **No filter complexity limits** — a filter with hundreds of tag values generates large `ANY(...)` queries. -5. **COUNT query memory explosion** — large counts materialise all IDs in memory. -6. **No timeout on DB queries** — a slow query (e.g., adversarial search pattern) blocks the connection process indefinitely. -7. **Single-GenServer bottleneck** — Subscription Index serialises all fanout lookups. - -**Can one bad client destabilise the relay?** Yes. Through event spam (no rate limit), adversarial search patterns (LIKE injection), or large COUNT queries (memory exhaustion). - ---- - -# Security Review - -**Primary Attack Surfaces:** -1. **WebSocket ingress** — unauthenticated by default, no rate limiting, no payload size enforcement. -2. **NIP-50 search** — LIKE pattern injection enables CPU/IO exhaustion. -3. **NIP-86 management API** — properly gated by NIP-98, but `management_auth_required` is a config flag that defaults to `true`. If misconfigured, management API is open. -4. **Event forgery** — no signature verification means complete trust of client-provided pubkeys. - -**DoS Vectors (ranked by impact):** -1. Event spam flood (unbounded ingest rate). -2. Adversarial ILIKE search patterns (DB CPU exhaustion). -3. Large COUNT queries (memory exhaustion). -4. Many concurrent subscriptions with broad filters (fanout amplification). -5. Oversized events with thousands of tags (transaction bloat). -6. Rapid REQ/CLOSE cycling (subscription index churn through single GenServer). - -**Authentication/Authorization:** -- NIP-42 AUTH flow works but is weakened by missing relay tag validation. -- Protected event enforcement is correct (pubkey match required). -- Giftwrap (kind 1059) access control is properly implemented. -- Management API NIP-98 auth is solid with freshness check. - -**No dynamic atom creation risks found.** Method names in admin are handled as strings. No `String.to_atom` or unsafe deserialization patterns detected. - -**Information leakage:** Error messages in some paths use `inspect(reason)` which could leak internal Elixir terms to clients (e.g., `connection.ex` line 297, line 353, line 389). Consider sanitising. - ---- - -# Performance Review - -**Likely Hotspots:** -1. **Event ingest path**: validation → policy check → DB transaction (3 inserts + possible state table upsert). The transaction is the bottleneck — each event requires at minimum 2 DB round-trips (event_ids + events insert), plus tag inserts. -2. **Subscription fanout**: `Index.candidate_subscription_keys/1` through GenServer.call — serialisation point. -3. **Query path**: per-filter DB queries without UNION, Elixir-side deduplication and sorting. -4. **COUNT path**: materialises all matching IDs in memory. -5. **Search (ILIKE)**: sequential scan without text search index. - -**Missing Indexes:** -- No index on `events.content` for search (NIP-50). ILIKE requires sequential scan. -- No composite index on `events (pubkey, kind, created_at)` for replaceable event queries. -- The `event_tags` index on `(name, value, event_created_at)` is good for tag queries. - -**Scaling Ceiling:** -- **DB-bound** at moderate load (event ingest transactions). -- **CPU-bound** at high event rates if signature verification is added. -- **Memory-bound** if adversarial COUNT queries are submitted. -- **GenServer-bound** on fanout at high subscription counts. - -**Top 3 Performance Improvements by Impact:** -1. **Make subscription index reads lock-free** — read ETS directly instead of through GenServer (effort: S, impact: High). -2. **Push COUNT to SQL** — `SELECT COUNT(DISTINCT id)` instead of materialising (effort: S, impact: High). -3. **Add full-text search index** — `GIN` index on `tsvector` column for NIP-50, replacing ILIKE (effort: M, impact: High). - ---- - -# Database and Schema Review - -**Strengths:** -- Range partitioning on `events.created_at` — good for time-based queries and partition pruning. -- Composite primary key `(created_at, id)` enables partition pruning on most queries. -- `event_ids` table for deduplication with `ON CONFLICT :nothing` — clean idempotency. -- State tables for replaceable/addressable events — correct approach with proper upsert/retire logic. -- Partial indexes on `expires_at` and `deleted_at` — avoids indexing NULLs. -- FK cascade from `event_tags` to `events` — ensures tag cleanup on delete. - -**Weaknesses:** -1. **No unique index on `events.id`** — only a non-unique index. Two events with the same ID but different `created_at` could theoretically exist (the `event_ids` table prevents this at the application level, but there's no DB-level constraint on the events table). -2. **`event_tags` stores only name+value** — data loss for multi-element tags (Critical finding above). -3. **No `content` index for search** — ILIKE without index = sequential scan. -4. **`events.kind` is `integer` (4 bytes)** — NIP-01 allows kinds 0–65535, so `smallint` (2 bytes) would suffice and save space. -5. **No retention/partitioning strategy documented** — the default partition catches everything. No automated partition creation or cleanup. -6. **`d_tag` column in events table** — redundant with tag storage (but useful for addressable event queries). Not indexed, so no direct benefit. The addressable_event_state table handles this. -7. **No index on `events (id, created_at)` for deletion queries** — `delete_by_request` queries by `id` and `pubkey` but the `id` index doesn't include `pubkey`. - -**Missing DB-Level Invariants:** -- Events table should have a unique constraint on `id` (across partitions, which is tricky with range partitioning — the `event_ids` table compensates). -- No CHECK constraint on `kind >= 0`. -- No CHECK constraint on `created_at >= 0`. - ---- - -# Test Review - -**Well-Covered Areas:** -- Protocol encode/decode (`protocol_test.exs`) -- Filter validation and matching, including property-based tests (`filter_test.exs`, `filter_property_test.exs`) -- Event validation including MARMOT-specific kinds (`event_validator_marmot_test.exs`) -- Policy enforcement (`event_policy_test.exs`) -- Storage adapter contract compliance (`adapter_contract_test.exs`, `behaviour_contracts_test.exs`) -- PostgreSQL event lifecycle (put, query, delete, replace) (`events_lifecycle_test.exs`) -- WebSocket connection lifecycle (`connection_test.exs`) -- Auth challenges (`challenges_test.exs`) -- NIP-98 HTTP auth (`nip98_test.exs`) -- Fault injection (`fault_injection_test.exs`) -- Query plan regression (`query_plan_regression_test.exs`) — excellent practice - -**Missing Critical Tests:** -1. **No signature verification tests** (because the feature doesn't exist). -2. **No test for tag data integrity** — round-trip test that verifies events with multi-element tags are returned unchanged. -3. **No ephemeral event test** — verifying kind 20000+ events are not persisted. -4. **No NIP-09 `a` tag deletion test**. -5. **No adversarial input tests** — LIKE injection patterns, oversized payloads, events with extreme tag counts. -6. **No concurrent write tests** — multiple processes writing the same replaceable event simultaneously. -7. **No AUTH relay tag validation test**. -8. **No test for expired event rejection on publish** (NIP-40). - -**5 Most Valuable Tests to Add:** -1. Round-trip tag integrity: publish event with multi-element tags, query back, verify tags are identical. -2. Signature verification: publish event with wrong signature, verify rejection. -3. Concurrent replaceable event upsert: 10 processes writing same pubkey+kind, verify only one winner. -4. Adversarial search pattern: verify ILIKE with `%` metacharacters doesn't cause excessive query time. -5. Ingest rate limiting under load: verify relay remains responsive under event flood. - ---- - -# Quick Wins - -| Change | Impact | Effort | -|--------|--------|--------| -| Add Schnorr signature verification | Critical | M | -| Store full tags (add `tags` JSONB column to events) | Critical | M | -| Escape LIKE metacharacters in search | High | S | -| Read subscription index ETS directly (bypass GenServer for reads) | High | S | -| Push COUNT to SQL `COUNT(DISTINCT)` | High | S | -| Add `max_frame_size` to Bandit WebSocket options | High | S | -| Add AUTH relay tag validation | High | S | -| Skip persistence for ephemeral events | High | S | -| Add payload size check before JSON decode | High | S | -| Add CORS headers to NIP-11 endpoint | Low | S | -| Create ETS moderation cache table in supervisor | Medium | S | -| Add `created_at` staleness check to AUTH handler | Medium | S | - ---- - -# Deep Refactor Opportunities - -1. **Full-text search for NIP-50**: Replace ILIKE with PostgreSQL `tsvector`/`tsquery` and a GIN index. This eliminates the LIKE injection vector and dramatically improves search performance. Effort: M. Worth it if search is a used feature. - -2. **SQL UNION for multi-filter queries**: Instead of running N queries and deduplicating in Elixir, build a single SQL query with UNION ALL and DISTINCT. Reduces DB round-trips and pushes deduplication to the engine. Effort: M. - -3. **Per-connection rate limiter**: Add a token-bucket rate limiter to the connection state that throttles EVENT submissions. Consider a pluggable rate-limiting behaviour for flexibility. Effort: M. - -4. **Event partitioning strategy**: Automate partition creation (monthly or weekly) and implement partition detach/archive for old data. The current default partition will accumulate all data forever. Effort: L. - -5. **Batched tag insertion**: Instead of `Repo.insert_all` for tags within the transaction, accumulate tags and use a single multi-row insert with explicit conflict handling. Reduces round-trips for events with many tags. Effort: S. - ---- - -# Final Verdict - -**Would I trust this relay:** -- **For local development:** Yes, with awareness of the signature bypass. -- **For a small private relay (trusted clients):** Conditionally, after fixing lossy tags. The signature gap is tolerable only if all clients are trusted. -- **For a medium public relay:** No. Missing rate limiting, signature verification, and the LIKE injection vector make it unsafe. -- **For a hostile public internet deployment:** Absolutely not. - ---- - -**Ship now?** No. - -**Top blockers before deployment:** - -1. **Add Schnorr signature verification** — without this, the relay has no identity security. -2. **Fix lossy tag storage** — store full tag arrays so events survive round-trips intact. -3. **Handle ephemeral events** — don't persist kinds 20000–29999. -4. **Escape LIKE metacharacters in search** — prevent DoS via adversarial patterns. -5. **Enforce payload size limits** — pass `max_frame_size` to Bandit, check payload size before decode. -6. **Add basic ingest rate limiting** — per-connection token bucket at minimum. -7. **Add AUTH relay tag validation** — prevent cross-relay AUTH replay. - -After these seven fixes, the relay would be suitable for a private deployment with moderate trust. Public deployment would additionally require: -- Per-IP rate limiting -- Full-text search index (replacing ILIKE) -- SQL-based COUNT -- Lock-free subscription index reads -- Ephemeral event NIP-09 `a` tag deletion -- Comprehensive adversarial input testing +*End of audit. Generated 2026-03-17.*