buid/test/docs: Docker image, Darwin fix, logo
This commit is contained in:
279
docs/slop/HARDEN.md
Normal file
279
docs/slop/HARDEN.md
Normal file
@@ -0,0 +1,279 @@
|
||||
# Hardening Review: Parrhesia Nostr Relay
|
||||
|
||||
You are a security engineer specialising in real-time WebSocket servers, Erlang/OTP systems, and protocol-level abuse. You are reviewing **Parrhesia**, a Nostr relay (NIP-01 compliant) written in Elixir, for hardening opportunities — with a primary focus on **denial-of-service resilience** and a secondary focus on the full attack surface.
|
||||
|
||||
Produce a prioritised list of **specific, actionable recommendations** with rationale. For each recommendation, state:
|
||||
1. The attack or failure mode it mitigates
|
||||
2. Suggested implementation (config change, code change, or architectural change)
|
||||
3. Severity estimate (critical / high / medium / low)
|
||||
|
||||
---
|
||||
|
||||
## 1. Architecture Overview
|
||||
|
||||
| Component | Technology | Notes |
|
||||
|---|---|---|
|
||||
| Runtime | Elixir/OTP 27, BEAM VM | Each WS connection is a separate process |
|
||||
| HTTP server | Bandit (pure Elixir) | HTTP/1.1 only, no HTTP/2 |
|
||||
| WebSocket | `websock_adapter` | Text frames only; binary rejected |
|
||||
| Database | PostgreSQL via Ecto | Range-partitioned `events` table by `created_at` |
|
||||
| Caching | ETS | Config snapshot + moderation ban/allow lists |
|
||||
| Multi-node | Erlang `:pg` groups | Fanout across BEAM cluster nodes |
|
||||
| Metrics | Prometheus (Telemetry) | `/metrics` endpoint |
|
||||
| TLS termination | **Out of scope** — handled by reverse proxy (nginx/Caddy) |
|
||||
|
||||
### Supervision Tree
|
||||
|
||||
```
|
||||
Parrhesia.Supervisor
|
||||
├─ Telemetry (Prometheus exporter)
|
||||
├─ Config (ETS snapshot of runtime config)
|
||||
├─ Storage.Supervisor (Ecto repo + moderation cache)
|
||||
├─ Subscriptions.Supervisor (ETS subscription index for fanout)
|
||||
├─ Auth.Supervisor (NIP-42 challenge GenServer)
|
||||
├─ Policy.Supervisor (policy enforcement)
|
||||
├─ Web.Endpoint (Bandit listener)
|
||||
└─ Tasks.Supervisor (ExpirationWorker, 30s GC loop)
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
|
||||
1. Client connects via WebSocket at `/relay`
|
||||
2. NIP-42 AUTH challenge issued immediately (16-byte random, base64url)
|
||||
3. Inbound text frames are: size-checked → JSON-decoded → rate-limited → protocol-dispatched
|
||||
4. EVENT messages: validated → policy-checked → stored in Postgres → ACK → async fanout to matching subscriptions
|
||||
5. REQ messages: filters validated → Postgres query → results streamed → EOSE → live subscription registered
|
||||
6. Fanout: post-ingest, subscription index (ETS) is traversed; matching connection processes receive events via `send/2`
|
||||
|
||||
---
|
||||
|
||||
## 2. Current Defences Inventory
|
||||
|
||||
### Connection Layer
|
||||
|
||||
| Defence | Value | Enforcement Point |
|
||||
|---|---|---|
|
||||
| Max WebSocket frame size | **1,048,576 bytes (1 MiB)** | Checked in `handle_in` *before* JSON decode, and at Bandit upgrade (`max_frame_size`) |
|
||||
| WebSocket upgrade timeout | **60,000 ms** | Passed to `WebSockAdapter.upgrade` |
|
||||
| Binary frame rejection | Returns NOTICE, connection stays open | `handle_in` opcode check |
|
||||
| Outbound queue limit | **256 events** per connection | Overflow strategy: **`:close`** (WS 1008) |
|
||||
| Outbound drain batch | **64 events** | Async drain via `send(self(), :drain_outbound_queue)` |
|
||||
| Outbound pressure telemetry | Threshold at **75%** of queue | Emits telemetry event only, no enforcement |
|
||||
| IP blocking | Via moderation cache (ETS) | Management API can add blocked IPs |
|
||||
|
||||
### Protocol Layer
|
||||
|
||||
| Defence | Value | Notes |
|
||||
|---|---|---|
|
||||
| Max event JSON size | **262,144 bytes (256 KiB)** | Re-serialises decoded event and checks byte size |
|
||||
| Max filters per REQ | **16** | Rejected at filter validation |
|
||||
| Max filter `limit` | **500** | `min(client_limit, 500)` applied at query time |
|
||||
| Max subscriptions per connection | **32** | Existing sub IDs updated without counting toward limit |
|
||||
| Subscription ID max length | **64 characters** | Must be non-empty |
|
||||
| Event kind range | **0–65,535** | Integer range check |
|
||||
| Max future event skew | **900 seconds (15 min)** | Events with `created_at > now + 900` rejected |
|
||||
| Unknown filter keys | **Rejected** | Allowed: `ids`, `authors`, `kinds`, `since`, `until`, `limit`, `search`, `#<letter>` |
|
||||
|
||||
### Event Validation Pipeline
|
||||
|
||||
Strict order:
|
||||
1. Required fields present (`id`, `pubkey`, `created_at`, `kind`, `tags`, `content`, `sig`)
|
||||
2. `id` — 64-char lowercase hex
|
||||
3. `pubkey` — 64-char lowercase hex
|
||||
4. `created_at` — non-negative integer, max 900s future skew
|
||||
5. `kind` — integer in [0, 65535]
|
||||
6. `tags` — list of non-empty string arrays (**no length limit on tags array or individual tag values**)
|
||||
7. `content` — any binary string
|
||||
8. `sig` — 128-char lowercase hex
|
||||
9. ID hash recomputation and comparison
|
||||
10. Schnorr signature verification via `lib_secp256k1` (gated by `verify_event_signatures` flag, default `true`)
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
| Defence | Value | Notes |
|
||||
|---|---|---|
|
||||
| Event ingest rate | **120 events per window** | Per-connection sliding window |
|
||||
| Ingest window | **1 second** | Resets on first event after expiry |
|
||||
| No per-IP connection rate limiting | — | Must be handled at reverse proxy |
|
||||
| No global connection count ceiling | — | BEAM handles thousands but no configured limit |
|
||||
|
||||
### Authentication (NIP-42)
|
||||
|
||||
- Challenge issued to **all** connections on connect (optional escalation model)
|
||||
- AUTH event must: pass full NIP-01 validation, be kind `22242`, contain matching `challenge` tag, contain matching `relay` tag
|
||||
- `created_at` freshness: must be `>= now - 600s` (10 min)
|
||||
- On success: pubkey added to `authenticated_pubkeys` MapSet; challenge rotated
|
||||
- Supports multiple authenticated pubkeys per connection
|
||||
|
||||
### Authentication (NIP-98 HTTP)
|
||||
|
||||
- Management endpoint (`POST /management`) requires NIP-98 header
|
||||
- Auth event must be kind `27235`, `created_at` within **60 seconds** of now
|
||||
- Must include `method` and `u` tags matching request exactly
|
||||
|
||||
### Access Control
|
||||
|
||||
- `auth_required_for_writes`: default **false** (configurable)
|
||||
- `auth_required_for_reads`: default **false** (configurable)
|
||||
- Protected events (NIP-70, tagged `["-"]`): require auth + pubkey match
|
||||
- Giftwrap (kind 1059): unauthenticated REQ → CLOSED; authenticated REQ must include `#p` containing own pubkey
|
||||
|
||||
### Database
|
||||
|
||||
- All queries use Ecto parameterised bindings — no raw string interpolation
|
||||
- LIKE search patterns escaped (`%`, `_`, `\` characters)
|
||||
- Deletion enforces `pubkey == deleter_pubkey` in WHERE clause
|
||||
- Soft-delete via `deleted_at`; hard-delete only via vanish (NIP-62) or expiration purge
|
||||
- DB pool: **32 connections** (prod), queue target 1s, interval 5s
|
||||
|
||||
### Moderation
|
||||
|
||||
- Banned pubkeys, allowed pubkeys, banned events, blocked IPs stored in ETS cache
|
||||
- Management API (NIP-98 authed) for CRUD on moderation lists
|
||||
- Cache invalidated atomically on writes
|
||||
|
||||
---
|
||||
|
||||
## 3. Known Gaps and Areas of Concern
|
||||
|
||||
The following are areas where the current implementation may be vulnerable or where defences could be strengthened. **Please evaluate each and provide recommendations.**
|
||||
|
||||
### 3.1 Connection Exhaustion
|
||||
|
||||
- There is **no global limit on concurrent WebSocket connections**. Each connection is an Elixir process (~2–3 KiB base), but subscriptions, auth state, and outbound queues add per-connection memory.
|
||||
- There is **no per-IP connection rate limiting at the application layer**. IP blocking exists but is reactive (management API), not automatic.
|
||||
- There is **no idle timeout** after the WebSocket upgrade completes. A connection can remain open indefinitely without sending or receiving messages.
|
||||
|
||||
**Questions:**
|
||||
- What connection limits should be configured at the Bandit/BEAM level?
|
||||
- Should an idle timeout be implemented? If so, what value balances real-time subscription use against resource waste?
|
||||
- Should per-IP connection counting be implemented at the application layer, or is this strictly a reverse proxy concern?
|
||||
|
||||
### 3.2 Subscription Abuse
|
||||
|
||||
- A single connection can hold **32 subscriptions**, each with up to **16 filters**. That's 512 filter predicates per connection being evaluated on every fanout.
|
||||
- Filter arrays (`ids`, `authors`, `kinds`, tag values) have **no element count limits**. A filter could contain thousands of author pubkeys.
|
||||
- There is no cost accounting for "expensive" subscriptions (e.g., wide open filters matching all events).
|
||||
|
||||
**Questions:**
|
||||
- Should filter array element counts be bounded? If so, what limits per field?
|
||||
- Should there be a per-connection "filter complexity" budget?
|
||||
- How expensive is the current ETS subscription index traversal at scale (e.g., 10K concurrent connections × 32 subs each)?
|
||||
|
||||
### 3.3 Tag Array Size
|
||||
|
||||
- Event validation does **not limit the number of tags** or the length of individual tag values beyond the 256 KiB total event size cap.
|
||||
- A maximally-tagged event could contain thousands of short tags, causing amplification in `event_tags` table inserts (one row per tag).
|
||||
|
||||
**Questions:**
|
||||
- Should a max tag count be enforced? What is a reasonable limit?
|
||||
- What is the insert cost of storing e.g. 1,000 tags per event? Could this be used for write amplification?
|
||||
- Should individual tag value lengths be bounded?
|
||||
|
||||
### 3.4 AUTH Timing
|
||||
|
||||
- AUTH event `created_at` freshness only checks the **lower bound** (`>= now - 600`). An AUTH event with `created_at` far in the future passes validation.
|
||||
- Regular events have a future skew cap of 900s, but AUTH events do not.
|
||||
|
||||
**Questions:**
|
||||
- Should AUTH events also enforce a future `created_at` bound?
|
||||
- Is a 600-second AUTH window too wide? Could it be reduced?
|
||||
|
||||
### 3.5 Outbound Amplification
|
||||
|
||||
- A single inbound EVENT can fan out to an unbounded number of matching subscriptions across all connections.
|
||||
- The outbound queue (256 events, `:close` strategy) protects individual connections but does not limit total fanout work per event.
|
||||
- The fanout traverses the ETS subscription index synchronously in the ingesting connection's process.
|
||||
|
||||
**Questions:**
|
||||
- Should fanout be bounded per event (e.g., max N recipients before yielding)?
|
||||
- Should fanout happen in a separate process pool rather than inline?
|
||||
- Is the `:close` overflow strategy optimal, or would `:drop_oldest` be better for well-behaved clients with temporary backpressure?
|
||||
|
||||
### 3.6 Query Amplification
|
||||
|
||||
- A single REQ with 16 filters, each with `limit: 500`, could trigger 16 separate Postgres queries returning up to 8,000 events total.
|
||||
- COUNT requests also execute per-filter queries (now deduplicated via UNION ALL).
|
||||
- `search` filters use `ILIKE %pattern%` which cannot use B-tree indexes.
|
||||
|
||||
**Questions:**
|
||||
- Should there be a per-REQ total result cap (across all filters)?
|
||||
- Should `search` queries be rate-limited or require a minimum pattern length?
|
||||
- Should COUNT be disabled or rate-limited separately?
|
||||
- Are there missing indexes that would help common query patterns?
|
||||
|
||||
### 3.7 Multi-Node Trust
|
||||
|
||||
- Events received via `:remote_fanout_event` from peer BEAM nodes **skip all validation and policy checks** and go directly to the subscription index.
|
||||
- This assumes all cluster peers are trusted.
|
||||
|
||||
**Questions:**
|
||||
- If cluster membership is dynamic or spans trust boundaries, should remote events be re-validated?
|
||||
- Should there be a shared secret or HMAC on inter-node messages?
|
||||
|
||||
### 3.8 Metrics Endpoint
|
||||
|
||||
- `/metrics` (Prometheus) is **unauthenticated**.
|
||||
- Exposes internal telemetry: connection counts, event throughput, queue depths, database timing.
|
||||
|
||||
**Questions:**
|
||||
- Should `/metrics` require authentication or be restricted to internal networks?
|
||||
- Could metrics data be used to profile the relay's capacity and craft targeted attacks?
|
||||
|
||||
### 3.9 Negentropy Stub
|
||||
|
||||
- NEG-OPEN, NEG-MSG, NEG-CLOSE messages are accepted and acknowledged but the reconciliation logic is a stub (cursor counter only).
|
||||
- Are there resource implications of accepting negentropy sessions without real implementation?
|
||||
|
||||
### 3.10 Event Re-Serialisation Cost
|
||||
|
||||
- To enforce the 256 KiB event size limit, the relay calls `JSON.encode!(event)` on the already-decoded event map. This re-serialisation happens on every inbound EVENT.
|
||||
- Could this be replaced with a byte-length check on the raw frame payload (already available)?
|
||||
|
||||
---
|
||||
|
||||
## 4. Specific Review Requests
|
||||
|
||||
Beyond the gaps above, please also evaluate:
|
||||
|
||||
1. **Bandit configuration**: Are there Bandit-level options (max connections, header limits, request timeouts, keepalive settings) that should be tuned for a public-facing relay?
|
||||
|
||||
2. **BEAM VM flags**: Are there any Erlang VM flags (`+P`, `+Q`, `+S`, memory limits) that should be set for production hardening?
|
||||
|
||||
3. **Ecto pool exhaustion**: With 32 DB connections and potentially thousands of concurrent REQ queries, what happens under pool exhaustion? Is the 1s queue target + 5s interval appropriate?
|
||||
|
||||
4. **ETS table sizing**: The subscription index and moderation cache use ETS. Are there memory limits or table options (`read_concurrency`, `write_concurrency`, `compressed`) that should be tuned?
|
||||
|
||||
5. **Process mailbox overflow**: Connection processes receive events via `send/2` during fanout. If a process is slow to consume, its mailbox grows. The outbound queue mechanism is application-level — but is the BEAM-level mailbox also protected?
|
||||
|
||||
6. **Reverse proxy recommendations**: What nginx/Caddy configuration should complement the relay's defences? (Rate limiting, connection limits, WebSocket-specific settings, request body size.)
|
||||
|
||||
7. **Monitoring and alerting**: What telemetry signals should trigger alerts? (Connection count spikes, queue overflow rates, DB pool saturation, error rates.)
|
||||
|
||||
---
|
||||
|
||||
## 5. Out of Scope
|
||||
|
||||
The following are **not** in scope for this review:
|
||||
- TLS configuration (handled by reverse proxy)
|
||||
- DNS and network-level DDoS mitigation
|
||||
- Operating system hardening
|
||||
- Key management for the relay identity
|
||||
- Client-side security
|
||||
- Nostr protocol design flaws (we implement the spec as-is)
|
||||
|
||||
---
|
||||
|
||||
## 6. Response Format
|
||||
|
||||
For each recommendation, use this format:
|
||||
|
||||
### [Severity] Title
|
||||
|
||||
**Attack/failure mode:** What goes wrong without this mitigation.
|
||||
|
||||
**Current state:** What exists today (or doesn't).
|
||||
|
||||
**Recommendation:** Specific change — config value, code change, or architectural decision.
|
||||
|
||||
**Trade-offs:** Any impact on legitimate users or operational complexity.
|
||||
Reference in New Issue
Block a user