280 lines
14 KiB
Markdown
280 lines
14 KiB
Markdown
# Hardening Review: Parrhesia Nostr Relay
|
||
|
||
You are a security engineer specialising in real-time WebSocket servers, Erlang/OTP systems, and protocol-level abuse. You are reviewing **Parrhesia**, a Nostr relay (NIP-01 compliant) written in Elixir, for hardening opportunities — with a primary focus on **denial-of-service resilience** and a secondary focus on the full attack surface.
|
||
|
||
Produce a prioritised list of **specific, actionable recommendations** with rationale. For each recommendation, state:
|
||
1. The attack or failure mode it mitigates
|
||
2. Suggested implementation (config change, code change, or architectural change)
|
||
3. Severity estimate (critical / high / medium / low)
|
||
|
||
---
|
||
|
||
## 1. Architecture Overview
|
||
|
||
| Component | Technology | Notes |
|
||
|---|---|---|
|
||
| Runtime | Elixir/OTP 27, BEAM VM | Each WS connection is a separate process |
|
||
| HTTP server | Bandit (pure Elixir) | HTTP/1.1 only, no HTTP/2 |
|
||
| WebSocket | `websock_adapter` | Text frames only; binary rejected |
|
||
| Database | PostgreSQL via Ecto | Range-partitioned `events` table by `created_at` |
|
||
| Caching | ETS | Config snapshot + moderation ban/allow lists |
|
||
| Multi-node | Erlang `:pg` groups | Fanout across BEAM cluster nodes |
|
||
| Metrics | Prometheus (Telemetry) | `/metrics` endpoint |
|
||
| TLS termination | **Out of scope** — handled by reverse proxy (nginx/Caddy) |
|
||
|
||
### Supervision Tree
|
||
|
||
```
|
||
Parrhesia.Supervisor
|
||
├─ Telemetry (Prometheus exporter)
|
||
├─ Config (ETS snapshot of runtime config)
|
||
├─ Storage.Supervisor (Ecto repo + moderation cache)
|
||
├─ Subscriptions.Supervisor (ETS subscription index for fanout)
|
||
├─ Auth.Supervisor (NIP-42 challenge GenServer)
|
||
├─ Policy.Supervisor (policy enforcement)
|
||
├─ Web.Endpoint (Bandit listener)
|
||
└─ Tasks.Supervisor (ExpirationWorker, 30s GC loop)
|
||
```
|
||
|
||
### Data Flow
|
||
|
||
1. Client connects via WebSocket at `/relay`
|
||
2. NIP-42 AUTH challenge issued immediately (16-byte random, base64url)
|
||
3. Inbound text frames are: size-checked → JSON-decoded → rate-limited → protocol-dispatched
|
||
4. EVENT messages: validated → policy-checked → stored in Postgres → ACK → async fanout to matching subscriptions
|
||
5. REQ messages: filters validated → Postgres query → results streamed → EOSE → live subscription registered
|
||
6. Fanout: post-ingest, subscription index (ETS) is traversed; matching connection processes receive events via `send/2`
|
||
|
||
---
|
||
|
||
## 2. Current Defences Inventory
|
||
|
||
### Connection Layer
|
||
|
||
| Defence | Value | Enforcement Point |
|
||
|---|---|---|
|
||
| Max WebSocket frame size | **1,048,576 bytes (1 MiB)** | Checked in `handle_in` *before* JSON decode, and at Bandit upgrade (`max_frame_size`) |
|
||
| WebSocket upgrade timeout | **60,000 ms** | Passed to `WebSockAdapter.upgrade` |
|
||
| Binary frame rejection | Returns NOTICE, connection stays open | `handle_in` opcode check |
|
||
| Outbound queue limit | **256 events** per connection | Overflow strategy: **`:close`** (WS 1008) |
|
||
| Outbound drain batch | **64 events** | Async drain via `send(self(), :drain_outbound_queue)` |
|
||
| Outbound pressure telemetry | Threshold at **75%** of queue | Emits telemetry event only, no enforcement |
|
||
| IP blocking | Via moderation cache (ETS) | Management API can add blocked IPs |
|
||
|
||
### Protocol Layer
|
||
|
||
| Defence | Value | Notes |
|
||
|---|---|---|
|
||
| Max event JSON size | **262,144 bytes (256 KiB)** | Re-serialises decoded event and checks byte size |
|
||
| Max filters per REQ | **16** | Rejected at filter validation |
|
||
| Max filter `limit` | **500** | `min(client_limit, 500)` applied at query time |
|
||
| Max subscriptions per connection | **32** | Existing sub IDs updated without counting toward limit |
|
||
| Subscription ID max length | **64 characters** | Must be non-empty |
|
||
| Event kind range | **0–65,535** | Integer range check |
|
||
| Max future event skew | **900 seconds (15 min)** | Events with `created_at > now + 900` rejected |
|
||
| Unknown filter keys | **Rejected** | Allowed: `ids`, `authors`, `kinds`, `since`, `until`, `limit`, `search`, `#<letter>` |
|
||
|
||
### Event Validation Pipeline
|
||
|
||
Strict order:
|
||
1. Required fields present (`id`, `pubkey`, `created_at`, `kind`, `tags`, `content`, `sig`)
|
||
2. `id` — 64-char lowercase hex
|
||
3. `pubkey` — 64-char lowercase hex
|
||
4. `created_at` — non-negative integer, max 900s future skew
|
||
5. `kind` — integer in [0, 65535]
|
||
6. `tags` — list of non-empty string arrays (**no length limit on tags array or individual tag values**)
|
||
7. `content` — any binary string
|
||
8. `sig` — 128-char lowercase hex
|
||
9. ID hash recomputation and comparison
|
||
10. Schnorr signature verification via `lib_secp256k1` (gated by `verify_event_signatures` flag, default `true`)
|
||
|
||
### Rate Limiting
|
||
|
||
| Defence | Value | Notes |
|
||
|---|---|---|
|
||
| Event ingest rate | **120 events per window** | Per-connection sliding window |
|
||
| Ingest window | **1 second** | Resets on first event after expiry |
|
||
| No per-IP connection rate limiting | — | Must be handled at reverse proxy |
|
||
| No global connection count ceiling | — | BEAM handles thousands but no configured limit |
|
||
|
||
### Authentication (NIP-42)
|
||
|
||
- Challenge issued to **all** connections on connect (optional escalation model)
|
||
- AUTH event must: pass full NIP-01 validation, be kind `22242`, contain matching `challenge` tag, contain matching `relay` tag
|
||
- `created_at` freshness: must be `>= now - 600s` (10 min)
|
||
- On success: pubkey added to `authenticated_pubkeys` MapSet; challenge rotated
|
||
- Supports multiple authenticated pubkeys per connection
|
||
|
||
### Authentication (NIP-98 HTTP)
|
||
|
||
- Management endpoint (`POST /management`) requires NIP-98 header
|
||
- Auth event must be kind `27235`, `created_at` within **60 seconds** of now
|
||
- Must include `method` and `u` tags matching request exactly
|
||
|
||
### Access Control
|
||
|
||
- `auth_required_for_writes`: default **false** (configurable)
|
||
- `auth_required_for_reads`: default **false** (configurable)
|
||
- Protected events (NIP-70, tagged `["-"]`): require auth + pubkey match
|
||
- Giftwrap (kind 1059): unauthenticated REQ → CLOSED; authenticated REQ must include `#p` containing own pubkey
|
||
|
||
### Database
|
||
|
||
- All queries use Ecto parameterised bindings — no raw string interpolation
|
||
- LIKE search patterns escaped (`%`, `_`, `\` characters)
|
||
- Deletion enforces `pubkey == deleter_pubkey` in WHERE clause
|
||
- Soft-delete via `deleted_at`; hard-delete only via vanish (NIP-62) or expiration purge
|
||
- DB pool: **32 connections** (prod), queue target 1s, interval 5s
|
||
|
||
### Moderation
|
||
|
||
- Banned pubkeys, allowed pubkeys, banned events, blocked IPs stored in ETS cache
|
||
- Management API (NIP-98 authed) for CRUD on moderation lists
|
||
- Cache invalidated atomically on writes
|
||
|
||
---
|
||
|
||
## 3. Known Gaps and Areas of Concern
|
||
|
||
The following are areas where the current implementation may be vulnerable or where defences could be strengthened. **Please evaluate each and provide recommendations.**
|
||
|
||
### 3.1 Connection Exhaustion
|
||
|
||
- There is **no global limit on concurrent WebSocket connections**. Each connection is an Elixir process (~2–3 KiB base), but subscriptions, auth state, and outbound queues add per-connection memory.
|
||
- There is **no per-IP connection rate limiting at the application layer**. IP blocking exists but is reactive (management API), not automatic.
|
||
- There is **no idle timeout** after the WebSocket upgrade completes. A connection can remain open indefinitely without sending or receiving messages.
|
||
|
||
**Questions:**
|
||
- What connection limits should be configured at the Bandit/BEAM level?
|
||
- Should an idle timeout be implemented? If so, what value balances real-time subscription use against resource waste?
|
||
- Should per-IP connection counting be implemented at the application layer, or is this strictly a reverse proxy concern?
|
||
|
||
### 3.2 Subscription Abuse
|
||
|
||
- A single connection can hold **32 subscriptions**, each with up to **16 filters**. That's 512 filter predicates per connection being evaluated on every fanout.
|
||
- Filter arrays (`ids`, `authors`, `kinds`, tag values) have **no element count limits**. A filter could contain thousands of author pubkeys.
|
||
- There is no cost accounting for "expensive" subscriptions (e.g., wide open filters matching all events).
|
||
|
||
**Questions:**
|
||
- Should filter array element counts be bounded? If so, what limits per field?
|
||
- Should there be a per-connection "filter complexity" budget?
|
||
- How expensive is the current ETS subscription index traversal at scale (e.g., 10K concurrent connections × 32 subs each)?
|
||
|
||
### 3.3 Tag Array Size
|
||
|
||
- Event validation does **not limit the number of tags** or the length of individual tag values beyond the 256 KiB total event size cap.
|
||
- A maximally-tagged event could contain thousands of short tags, causing amplification in `event_tags` table inserts (one row per tag).
|
||
|
||
**Questions:**
|
||
- Should a max tag count be enforced? What is a reasonable limit?
|
||
- What is the insert cost of storing e.g. 1,000 tags per event? Could this be used for write amplification?
|
||
- Should individual tag value lengths be bounded?
|
||
|
||
### 3.4 AUTH Timing
|
||
|
||
- AUTH event `created_at` freshness only checks the **lower bound** (`>= now - 600`). An AUTH event with `created_at` far in the future passes validation.
|
||
- Regular events have a future skew cap of 900s, but AUTH events do not.
|
||
|
||
**Questions:**
|
||
- Should AUTH events also enforce a future `created_at` bound?
|
||
- Is a 600-second AUTH window too wide? Could it be reduced?
|
||
|
||
### 3.5 Outbound Amplification
|
||
|
||
- A single inbound EVENT can fan out to an unbounded number of matching subscriptions across all connections.
|
||
- The outbound queue (256 events, `:close` strategy) protects individual connections but does not limit total fanout work per event.
|
||
- The fanout traverses the ETS subscription index synchronously in the ingesting connection's process.
|
||
|
||
**Questions:**
|
||
- Should fanout be bounded per event (e.g., max N recipients before yielding)?
|
||
- Should fanout happen in a separate process pool rather than inline?
|
||
- Is the `:close` overflow strategy optimal, or would `:drop_oldest` be better for well-behaved clients with temporary backpressure?
|
||
|
||
### 3.6 Query Amplification
|
||
|
||
- A single REQ with 16 filters, each with `limit: 500`, could trigger 16 separate Postgres queries returning up to 8,000 events total.
|
||
- COUNT requests also execute per-filter queries (now deduplicated via UNION ALL).
|
||
- `search` filters use `ILIKE %pattern%` which cannot use B-tree indexes.
|
||
|
||
**Questions:**
|
||
- Should there be a per-REQ total result cap (across all filters)?
|
||
- Should `search` queries be rate-limited or require a minimum pattern length?
|
||
- Should COUNT be disabled or rate-limited separately?
|
||
- Are there missing indexes that would help common query patterns?
|
||
|
||
### 3.7 Multi-Node Trust
|
||
|
||
- Events received via `:remote_fanout_event` from peer BEAM nodes **skip all validation and policy checks** and go directly to the subscription index.
|
||
- This assumes all cluster peers are trusted.
|
||
|
||
**Questions:**
|
||
- If cluster membership is dynamic or spans trust boundaries, should remote events be re-validated?
|
||
- Should there be a shared secret or HMAC on inter-node messages?
|
||
|
||
### 3.8 Metrics Endpoint
|
||
|
||
- `/metrics` (Prometheus) is **unauthenticated**.
|
||
- Exposes internal telemetry: connection counts, event throughput, queue depths, database timing.
|
||
|
||
**Questions:**
|
||
- Should `/metrics` require authentication or be restricted to internal networks?
|
||
- Could metrics data be used to profile the relay's capacity and craft targeted attacks?
|
||
|
||
### 3.9 Negentropy Stub
|
||
|
||
- NEG-OPEN, NEG-MSG, NEG-CLOSE messages are accepted and acknowledged but the reconciliation logic is a stub (cursor counter only).
|
||
- Are there resource implications of accepting negentropy sessions without real implementation?
|
||
|
||
### 3.10 Event Re-Serialisation Cost
|
||
|
||
- To enforce the 256 KiB event size limit, the relay calls `JSON.encode!(event)` on the already-decoded event map. This re-serialisation happens on every inbound EVENT.
|
||
- Could this be replaced with a byte-length check on the raw frame payload (already available)?
|
||
|
||
---
|
||
|
||
## 4. Specific Review Requests
|
||
|
||
Beyond the gaps above, please also evaluate:
|
||
|
||
1. **Bandit configuration**: Are there Bandit-level options (max connections, header limits, request timeouts, keepalive settings) that should be tuned for a public-facing relay?
|
||
|
||
2. **BEAM VM flags**: Are there any Erlang VM flags (`+P`, `+Q`, `+S`, memory limits) that should be set for production hardening?
|
||
|
||
3. **Ecto pool exhaustion**: With 32 DB connections and potentially thousands of concurrent REQ queries, what happens under pool exhaustion? Is the 1s queue target + 5s interval appropriate?
|
||
|
||
4. **ETS table sizing**: The subscription index and moderation cache use ETS. Are there memory limits or table options (`read_concurrency`, `write_concurrency`, `compressed`) that should be tuned?
|
||
|
||
5. **Process mailbox overflow**: Connection processes receive events via `send/2` during fanout. If a process is slow to consume, its mailbox grows. The outbound queue mechanism is application-level — but is the BEAM-level mailbox also protected?
|
||
|
||
6. **Reverse proxy recommendations**: What nginx/Caddy configuration should complement the relay's defences? (Rate limiting, connection limits, WebSocket-specific settings, request body size.)
|
||
|
||
7. **Monitoring and alerting**: What telemetry signals should trigger alerts? (Connection count spikes, queue overflow rates, DB pool saturation, error rates.)
|
||
|
||
---
|
||
|
||
## 5. Out of Scope
|
||
|
||
The following are **not** in scope for this review:
|
||
- TLS configuration (handled by reverse proxy)
|
||
- DNS and network-level DDoS mitigation
|
||
- Operating system hardening
|
||
- Key management for the relay identity
|
||
- Client-side security
|
||
- Nostr protocol design flaws (we implement the spec as-is)
|
||
|
||
---
|
||
|
||
## 6. Response Format
|
||
|
||
For each recommendation, use this format:
|
||
|
||
### [Severity] Title
|
||
|
||
**Attack/failure mode:** What goes wrong without this mitigation.
|
||
|
||
**Current state:** What exists today (or doesn't).
|
||
|
||
**Recommendation:** Specific change — config value, code change, or architectural decision.
|
||
|
||
**Trade-offs:** Any impact on legitimate users or operational complexity.
|