# Hardening Review: Parrhesia Nostr Relay You are a security engineer specialising in real-time WebSocket servers, Erlang/OTP systems, and protocol-level abuse. You are reviewing **Parrhesia**, a Nostr relay (NIP-01 compliant) written in Elixir, for hardening opportunities — with a primary focus on **denial-of-service resilience** and a secondary focus on the full attack surface. Produce a prioritised list of **specific, actionable recommendations** with rationale. For each recommendation, state: 1. The attack or failure mode it mitigates 2. Suggested implementation (config change, code change, or architectural change) 3. Severity estimate (critical / high / medium / low) --- ## 1. Architecture Overview | Component | Technology | Notes | |---|---|---| | Runtime | Elixir/OTP 27, BEAM VM | Each WS connection is a separate process | | HTTP server | Bandit (pure Elixir) | HTTP/1.1 only, no HTTP/2 | | WebSocket | `websock_adapter` | Text frames only; binary rejected | | Database | PostgreSQL via Ecto | Range-partitioned `events` table by `created_at` | | Caching | ETS | Config snapshot + moderation ban/allow lists | | Multi-node | Erlang `:pg` groups | Fanout across BEAM cluster nodes | | Metrics | Prometheus (Telemetry) | `/metrics` endpoint | | TLS termination | **Out of scope** — handled by reverse proxy (nginx/Caddy) | ### Supervision Tree ``` Parrhesia.Supervisor ├─ Telemetry (Prometheus exporter) ├─ Config (ETS snapshot of runtime config) ├─ Storage.Supervisor (Ecto repo + moderation cache) ├─ Subscriptions.Supervisor (ETS subscription index for fanout) ├─ Auth.Supervisor (NIP-42 challenge GenServer) ├─ Policy.Supervisor (policy enforcement) ├─ Web.Endpoint (Bandit listener) └─ Tasks.Supervisor (ExpirationWorker, 30s GC loop) ``` ### Data Flow 1. Client connects via WebSocket at `/relay` 2. NIP-42 AUTH challenge issued immediately (16-byte random, base64url) 3. Inbound text frames are: size-checked → JSON-decoded → rate-limited → protocol-dispatched 4. EVENT messages: validated → policy-checked → stored in Postgres → ACK → async fanout to matching subscriptions 5. REQ messages: filters validated → Postgres query → results streamed → EOSE → live subscription registered 6. Fanout: post-ingest, subscription index (ETS) is traversed; matching connection processes receive events via `send/2` --- ## 2. Current Defences Inventory ### Connection Layer | Defence | Value | Enforcement Point | |---|---|---| | Max WebSocket frame size | **1,048,576 bytes (1 MiB)** | Checked in `handle_in` *before* JSON decode, and at Bandit upgrade (`max_frame_size`) | | WebSocket upgrade timeout | **60,000 ms** | Passed to `WebSockAdapter.upgrade` | | Binary frame rejection | Returns NOTICE, connection stays open | `handle_in` opcode check | | Outbound queue limit | **256 events** per connection | Overflow strategy: **`:close`** (WS 1008) | | Outbound drain batch | **64 events** | Async drain via `send(self(), :drain_outbound_queue)` | | Outbound pressure telemetry | Threshold at **75%** of queue | Emits telemetry event only, no enforcement | | IP blocking | Via moderation cache (ETS) | Management API can add blocked IPs | ### Protocol Layer | Defence | Value | Notes | |---|---|---| | Max event JSON size | **262,144 bytes (256 KiB)** | Re-serialises decoded event and checks byte size | | Max filters per REQ | **16** | Rejected at filter validation | | Max filter `limit` | **500** | `min(client_limit, 500)` applied at query time | | Max subscriptions per connection | **32** | Existing sub IDs updated without counting toward limit | | Subscription ID max length | **64 characters** | Must be non-empty | | Event kind range | **0–65,535** | Integer range check | | Max future event skew | **900 seconds (15 min)** | Events with `created_at > now + 900` rejected | | Unknown filter keys | **Rejected** | Allowed: `ids`, `authors`, `kinds`, `since`, `until`, `limit`, `search`, `#` | ### Event Validation Pipeline Strict order: 1. Required fields present (`id`, `pubkey`, `created_at`, `kind`, `tags`, `content`, `sig`) 2. `id` — 64-char lowercase hex 3. `pubkey` — 64-char lowercase hex 4. `created_at` — non-negative integer, max 900s future skew 5. `kind` — integer in [0, 65535] 6. `tags` — list of non-empty string arrays (**no length limit on tags array or individual tag values**) 7. `content` — any binary string 8. `sig` — 128-char lowercase hex 9. ID hash recomputation and comparison 10. Schnorr signature verification via `lib_secp256k1` (gated by `verify_event_signatures` flag, default `true`) ### Rate Limiting | Defence | Value | Notes | |---|---|---| | Event ingest rate | **120 events per window** | Per-connection sliding window | | Ingest window | **1 second** | Resets on first event after expiry | | No per-IP connection rate limiting | — | Must be handled at reverse proxy | | No global connection count ceiling | — | BEAM handles thousands but no configured limit | ### Authentication (NIP-42) - Challenge issued to **all** connections on connect (optional escalation model) - AUTH event must: pass full NIP-01 validation, be kind `22242`, contain matching `challenge` tag, contain matching `relay` tag - `created_at` freshness: must be `>= now - 600s` (10 min) - On success: pubkey added to `authenticated_pubkeys` MapSet; challenge rotated - Supports multiple authenticated pubkeys per connection ### Authentication (NIP-98 HTTP) - Management endpoint (`POST /management`) requires NIP-98 header - Auth event must be kind `27235`, `created_at` within **60 seconds** of now - Must include `method` and `u` tags matching request exactly ### Access Control - `auth_required_for_writes`: default **false** (configurable) - `auth_required_for_reads`: default **false** (configurable) - Protected events (NIP-70, tagged `["-"]`): require auth + pubkey match - Giftwrap (kind 1059): unauthenticated REQ → CLOSED; authenticated REQ must include `#p` containing own pubkey ### Database - All queries use Ecto parameterised bindings — no raw string interpolation - LIKE search patterns escaped (`%`, `_`, `\` characters) - Deletion enforces `pubkey == deleter_pubkey` in WHERE clause - Soft-delete via `deleted_at`; hard-delete only via vanish (NIP-62) or expiration purge - DB pool: **32 connections** (prod), queue target 1s, interval 5s ### Moderation - Banned pubkeys, allowed pubkeys, banned events, blocked IPs stored in ETS cache - Management API (NIP-98 authed) for CRUD on moderation lists - Cache invalidated atomically on writes --- ## 3. Known Gaps and Areas of Concern The following are areas where the current implementation may be vulnerable or where defences could be strengthened. **Please evaluate each and provide recommendations.** ### 3.1 Connection Exhaustion - There is **no global limit on concurrent WebSocket connections**. Each connection is an Elixir process (~2–3 KiB base), but subscriptions, auth state, and outbound queues add per-connection memory. - There is **no per-IP connection rate limiting at the application layer**. IP blocking exists but is reactive (management API), not automatic. - There is **no idle timeout** after the WebSocket upgrade completes. A connection can remain open indefinitely without sending or receiving messages. **Questions:** - What connection limits should be configured at the Bandit/BEAM level? - Should an idle timeout be implemented? If so, what value balances real-time subscription use against resource waste? - Should per-IP connection counting be implemented at the application layer, or is this strictly a reverse proxy concern? ### 3.2 Subscription Abuse - A single connection can hold **32 subscriptions**, each with up to **16 filters**. That's 512 filter predicates per connection being evaluated on every fanout. - Filter arrays (`ids`, `authors`, `kinds`, tag values) have **no element count limits**. A filter could contain thousands of author pubkeys. - There is no cost accounting for "expensive" subscriptions (e.g., wide open filters matching all events). **Questions:** - Should filter array element counts be bounded? If so, what limits per field? - Should there be a per-connection "filter complexity" budget? - How expensive is the current ETS subscription index traversal at scale (e.g., 10K concurrent connections × 32 subs each)? ### 3.3 Tag Array Size - Event validation does **not limit the number of tags** or the length of individual tag values beyond the 256 KiB total event size cap. - A maximally-tagged event could contain thousands of short tags, causing amplification in `event_tags` table inserts (one row per tag). **Questions:** - Should a max tag count be enforced? What is a reasonable limit? - What is the insert cost of storing e.g. 1,000 tags per event? Could this be used for write amplification? - Should individual tag value lengths be bounded? ### 3.4 AUTH Timing - AUTH event `created_at` freshness only checks the **lower bound** (`>= now - 600`). An AUTH event with `created_at` far in the future passes validation. - Regular events have a future skew cap of 900s, but AUTH events do not. **Questions:** - Should AUTH events also enforce a future `created_at` bound? - Is a 600-second AUTH window too wide? Could it be reduced? ### 3.5 Outbound Amplification - A single inbound EVENT can fan out to an unbounded number of matching subscriptions across all connections. - The outbound queue (256 events, `:close` strategy) protects individual connections but does not limit total fanout work per event. - The fanout traverses the ETS subscription index synchronously in the ingesting connection's process. **Questions:** - Should fanout be bounded per event (e.g., max N recipients before yielding)? - Should fanout happen in a separate process pool rather than inline? - Is the `:close` overflow strategy optimal, or would `:drop_oldest` be better for well-behaved clients with temporary backpressure? ### 3.6 Query Amplification - A single REQ with 16 filters, each with `limit: 500`, could trigger 16 separate Postgres queries returning up to 8,000 events total. - COUNT requests also execute per-filter queries (now deduplicated via UNION ALL). - `search` filters use `ILIKE %pattern%` which cannot use B-tree indexes. **Questions:** - Should there be a per-REQ total result cap (across all filters)? - Should `search` queries be rate-limited or require a minimum pattern length? - Should COUNT be disabled or rate-limited separately? - Are there missing indexes that would help common query patterns? ### 3.7 Multi-Node Trust - Events received via `:remote_fanout_event` from peer BEAM nodes **skip all validation and policy checks** and go directly to the subscription index. - This assumes all cluster peers are trusted. **Questions:** - If cluster membership is dynamic or spans trust boundaries, should remote events be re-validated? - Should there be a shared secret or HMAC on inter-node messages? ### 3.8 Metrics Endpoint - `/metrics` (Prometheus) is **unauthenticated**. - Exposes internal telemetry: connection counts, event throughput, queue depths, database timing. **Questions:** - Should `/metrics` require authentication or be restricted to internal networks? - Could metrics data be used to profile the relay's capacity and craft targeted attacks? ### 3.9 Negentropy Stub - NEG-OPEN, NEG-MSG, NEG-CLOSE messages are accepted and acknowledged but the reconciliation logic is a stub (cursor counter only). - Are there resource implications of accepting negentropy sessions without real implementation? ### 3.10 Event Re-Serialisation Cost - To enforce the 256 KiB event size limit, the relay calls `JSON.encode!(event)` on the already-decoded event map. This re-serialisation happens on every inbound EVENT. - Could this be replaced with a byte-length check on the raw frame payload (already available)? --- ## 4. Specific Review Requests Beyond the gaps above, please also evaluate: 1. **Bandit configuration**: Are there Bandit-level options (max connections, header limits, request timeouts, keepalive settings) that should be tuned for a public-facing relay? 2. **BEAM VM flags**: Are there any Erlang VM flags (`+P`, `+Q`, `+S`, memory limits) that should be set for production hardening? 3. **Ecto pool exhaustion**: With 32 DB connections and potentially thousands of concurrent REQ queries, what happens under pool exhaustion? Is the 1s queue target + 5s interval appropriate? 4. **ETS table sizing**: The subscription index and moderation cache use ETS. Are there memory limits or table options (`read_concurrency`, `write_concurrency`, `compressed`) that should be tuned? 5. **Process mailbox overflow**: Connection processes receive events via `send/2` during fanout. If a process is slow to consume, its mailbox grows. The outbound queue mechanism is application-level — but is the BEAM-level mailbox also protected? 6. **Reverse proxy recommendations**: What nginx/Caddy configuration should complement the relay's defences? (Rate limiting, connection limits, WebSocket-specific settings, request body size.) 7. **Monitoring and alerting**: What telemetry signals should trigger alerts? (Connection count spikes, queue overflow rates, DB pool saturation, error rates.) --- ## 5. Out of Scope The following are **not** in scope for this review: - TLS configuration (handled by reverse proxy) - DNS and network-level DDoS mitigation - Operating system hardening - Key management for the relay identity - Client-side security - Nostr protocol design flaws (we implement the spec as-is) --- ## 6. Response Format For each recommendation, use this format: ### [Severity] Title **Attack/failure mode:** What goes wrong without this mitigation. **Current state:** What exists today (or doesn't). **Recommendation:** Specific change — config value, code change, or architectural decision. **Trade-offs:** Any impact on legitimate users or operational complexity.