Files

Steffen Beyer 708e26e4f4 buid/test/docs: Docker image, Darwin fix, logo

2026-03-14 17:08:07 +01:00

14 KiB

Raw Permalink Blame History

Hardening Review: Parrhesia Nostr Relay

You are a security engineer specialising in real-time WebSocket servers, Erlang/OTP systems, and protocol-level abuse. You are reviewing Parrhesia, a Nostr relay (NIP-01 compliant) written in Elixir, for hardening opportunities — with a primary focus on denial-of-service resilience and a secondary focus on the full attack surface.

Produce a prioritised list of specific, actionable recommendations with rationale. For each recommendation, state:

The attack or failure mode it mitigates
Suggested implementation (config change, code change, or architectural change)
Severity estimate (critical / high / medium / low)

1. Architecture Overview

Component	Technology	Notes
Runtime	Elixir/OTP 27, BEAM VM	Each WS connection is a separate process
HTTP server	Bandit (pure Elixir)	HTTP/1.1 only, no HTTP/2
WebSocket	`websock_adapter`	Text frames only; binary rejected
Database	PostgreSQL via Ecto	Range-partitioned `events` table by `created_at`
Caching	ETS	Config snapshot + moderation ban/allow lists
Multi-node	Erlang `:pg` groups	Fanout across BEAM cluster nodes
Metrics	Prometheus (Telemetry)	`/metrics` endpoint
TLS termination	Out of scope — handled by reverse proxy (nginx/Caddy)

Supervision Tree

Parrhesia.Supervisor
  ├─ Telemetry (Prometheus exporter)
  ├─ Config (ETS snapshot of runtime config)
  ├─ Storage.Supervisor (Ecto repo + moderation cache)
  ├─ Subscriptions.Supervisor (ETS subscription index for fanout)
  ├─ Auth.Supervisor (NIP-42 challenge GenServer)
  ├─ Policy.Supervisor (policy enforcement)
  ├─ Web.Endpoint (Bandit listener)
  └─ Tasks.Supervisor (ExpirationWorker, 30s GC loop)

Data Flow

Client connects via WebSocket at /relay
NIP-42 AUTH challenge issued immediately (16-byte random, base64url)
Inbound text frames are: size-checked → JSON-decoded → rate-limited → protocol-dispatched
EVENT messages: validated → policy-checked → stored in Postgres → ACK → async fanout to matching subscriptions
REQ messages: filters validated → Postgres query → results streamed → EOSE → live subscription registered
Fanout: post-ingest, subscription index (ETS) is traversed; matching connection processes receive events via send/2

2. Current Defences Inventory

Connection Layer

Defence	Value	Enforcement Point
Max WebSocket frame size	1,048,576 bytes (1 MiB)	Checked in `handle_in` before JSON decode, and at Bandit upgrade (`max_frame_size`)
WebSocket upgrade timeout	60,000 ms	Passed to `WebSockAdapter.upgrade`
Binary frame rejection	Returns NOTICE, connection stays open	`handle_in` opcode check
Outbound queue limit	256 events per connection	Overflow strategy: `:close` (WS 1008)
Outbound drain batch	64 events	Async drain via `send(self(), :drain_outbound_queue)`
Outbound pressure telemetry	Threshold at 75% of queue	Emits telemetry event only, no enforcement
IP blocking	Via moderation cache (ETS)	Management API can add blocked IPs

Protocol Layer

Defence	Value	Notes
Max event JSON size	262,144 bytes (256 KiB)	Re-serialises decoded event and checks byte size
Max filters per REQ	16	Rejected at filter validation
Max filter `limit`	500	`min(client_limit, 500)` applied at query time
Max subscriptions per connection	32	Existing sub IDs updated without counting toward limit
Subscription ID max length	64 characters	Must be non-empty
Event kind range	0–65,535	Integer range check
Max future event skew	900 seconds (15 min)	Events with `created_at > now + 900` rejected
Unknown filter keys	Rejected	Allowed: `ids`, `authors`, `kinds`, `since`, `until`, `limit`, `search`, `#<letter>`

Event Validation Pipeline

Strict order:

Required fields present (id, pubkey, created_at, kind, tags, content, sig)
id — 64-char lowercase hex
pubkey — 64-char lowercase hex
created_at — non-negative integer, max 900s future skew
kind — integer in [0, 65535]
tags — list of non-empty string arrays (no length limit on tags array or individual tag values)
content — any binary string
sig — 128-char lowercase hex
ID hash recomputation and comparison
Schnorr signature verification via lib_secp256k1 (gated by verify_event_signatures flag, default true)

Rate Limiting

Defence	Value	Notes
Event ingest rate	120 events per window	Per-connection sliding window
Ingest window	1 second	Resets on first event after expiry
No per-IP connection rate limiting	—	Must be handled at reverse proxy
No global connection count ceiling	—	BEAM handles thousands but no configured limit

Authentication (NIP-42)

Challenge issued to all connections on connect (optional escalation model)
AUTH event must: pass full NIP-01 validation, be kind 22242, contain matching challenge tag, contain matching relay tag
created_at freshness: must be >= now - 600s (10 min)
On success: pubkey added to authenticated_pubkeys MapSet; challenge rotated
Supports multiple authenticated pubkeys per connection

Authentication (NIP-98 HTTP)

Management endpoint (POST /management) requires NIP-98 header
Auth event must be kind 27235, created_at within 60 seconds of now
Must include method and u tags matching request exactly

Access Control

auth_required_for_writes: default false (configurable)
auth_required_for_reads: default false (configurable)
Protected events (NIP-70, tagged ["-"]): require auth + pubkey match
Giftwrap (kind 1059): unauthenticated REQ → CLOSED; authenticated REQ must include #p containing own pubkey

Database

All queries use Ecto parameterised bindings — no raw string interpolation
LIKE search patterns escaped (%, _, \ characters)
Deletion enforces pubkey == deleter_pubkey in WHERE clause
Soft-delete via deleted_at; hard-delete only via vanish (NIP-62) or expiration purge
DB pool: 32 connections (prod), queue target 1s, interval 5s

Moderation

Banned pubkeys, allowed pubkeys, banned events, blocked IPs stored in ETS cache
Management API (NIP-98 authed) for CRUD on moderation lists
Cache invalidated atomically on writes

3. Known Gaps and Areas of Concern

The following are areas where the current implementation may be vulnerable or where defences could be strengthened. Please evaluate each and provide recommendations.

3.1 Connection Exhaustion

There is no global limit on concurrent WebSocket connections. Each connection is an Elixir process (~2–3 KiB base), but subscriptions, auth state, and outbound queues add per-connection memory.
There is no per-IP connection rate limiting at the application layer. IP blocking exists but is reactive (management API), not automatic.
There is no idle timeout after the WebSocket upgrade completes. A connection can remain open indefinitely without sending or receiving messages.

Questions:

What connection limits should be configured at the Bandit/BEAM level?
Should an idle timeout be implemented? If so, what value balances real-time subscription use against resource waste?
Should per-IP connection counting be implemented at the application layer, or is this strictly a reverse proxy concern?

3.2 Subscription Abuse

A single connection can hold 32 subscriptions, each with up to 16 filters. That's 512 filter predicates per connection being evaluated on every fanout.
Filter arrays (ids, authors, kinds, tag values) have no element count limits. A filter could contain thousands of author pubkeys.
There is no cost accounting for "expensive" subscriptions (e.g., wide open filters matching all events).

Questions:

Should filter array element counts be bounded? If so, what limits per field?
Should there be a per-connection "filter complexity" budget?
How expensive is the current ETS subscription index traversal at scale (e.g., 10K concurrent connections × 32 subs each)?

3.3 Tag Array Size

Event validation does not limit the number of tags or the length of individual tag values beyond the 256 KiB total event size cap.
A maximally-tagged event could contain thousands of short tags, causing amplification in event_tags table inserts (one row per tag).

Questions:

Should a max tag count be enforced? What is a reasonable limit?
What is the insert cost of storing e.g. 1,000 tags per event? Could this be used for write amplification?
Should individual tag value lengths be bounded?

3.4 AUTH Timing

AUTH event created_at freshness only checks the lower bound (>= now - 600). An AUTH event with created_at far in the future passes validation.
Regular events have a future skew cap of 900s, but AUTH events do not.

Questions:

Should AUTH events also enforce a future created_at bound?
Is a 600-second AUTH window too wide? Could it be reduced?

3.5 Outbound Amplification

A single inbound EVENT can fan out to an unbounded number of matching subscriptions across all connections.
The outbound queue (256 events, :close strategy) protects individual connections but does not limit total fanout work per event.
The fanout traverses the ETS subscription index synchronously in the ingesting connection's process.

Questions:

Should fanout be bounded per event (e.g., max N recipients before yielding)?
Should fanout happen in a separate process pool rather than inline?
Is the :close overflow strategy optimal, or would :drop_oldest be better for well-behaved clients with temporary backpressure?

3.6 Query Amplification

A single REQ with 16 filters, each with limit: 500, could trigger 16 separate Postgres queries returning up to 8,000 events total.
COUNT requests also execute per-filter queries (now deduplicated via UNION ALL).
search filters use ILIKE %pattern% which cannot use B-tree indexes.

Questions:

Should there be a per-REQ total result cap (across all filters)?
Should search queries be rate-limited or require a minimum pattern length?
Should COUNT be disabled or rate-limited separately?
Are there missing indexes that would help common query patterns?

3.7 Multi-Node Trust

Events received via :remote_fanout_event from peer BEAM nodes skip all validation and policy checks and go directly to the subscription index.
This assumes all cluster peers are trusted.

Questions:

If cluster membership is dynamic or spans trust boundaries, should remote events be re-validated?
Should there be a shared secret or HMAC on inter-node messages?

3.8 Metrics Endpoint

/metrics (Prometheus) is unauthenticated.
Exposes internal telemetry: connection counts, event throughput, queue depths, database timing.

Questions:

Should /metrics require authentication or be restricted to internal networks?
Could metrics data be used to profile the relay's capacity and craft targeted attacks?

3.9 Negentropy Stub

NEG-OPEN, NEG-MSG, NEG-CLOSE messages are accepted and acknowledged but the reconciliation logic is a stub (cursor counter only).
Are there resource implications of accepting negentropy sessions without real implementation?

3.10 Event Re-Serialisation Cost

To enforce the 256 KiB event size limit, the relay calls JSON.encode!(event) on the already-decoded event map. This re-serialisation happens on every inbound EVENT.
Could this be replaced with a byte-length check on the raw frame payload (already available)?

4. Specific Review Requests

Beyond the gaps above, please also evaluate:

Bandit configuration: Are there Bandit-level options (max connections, header limits, request timeouts, keepalive settings) that should be tuned for a public-facing relay?
BEAM VM flags: Are there any Erlang VM flags (+P, +Q, +S, memory limits) that should be set for production hardening?
Ecto pool exhaustion: With 32 DB connections and potentially thousands of concurrent REQ queries, what happens under pool exhaustion? Is the 1s queue target + 5s interval appropriate?
ETS table sizing: The subscription index and moderation cache use ETS. Are there memory limits or table options (read_concurrency, write_concurrency, compressed) that should be tuned?
Process mailbox overflow: Connection processes receive events via send/2 during fanout. If a process is slow to consume, its mailbox grows. The outbound queue mechanism is application-level — but is the BEAM-level mailbox also protected?
Reverse proxy recommendations: What nginx/Caddy configuration should complement the relay's defences? (Rate limiting, connection limits, WebSocket-specific settings, request body size.)
Monitoring and alerting: What telemetry signals should trigger alerts? (Connection count spikes, queue overflow rates, DB pool saturation, error rates.)

5. Out of Scope

The following are not in scope for this review:

TLS configuration (handled by reverse proxy)
DNS and network-level DDoS mitigation
Operating system hardening
Key management for the relay identity
Client-side security
Nostr protocol design flaws (we implement the spec as-is)

6. Response Format

For each recommendation, use this format:

[Severity] Title

Attack/failure mode: What goes wrong without this mitigation.

Current state: What exists today (or doesn't).

Recommendation: Specific change — config value, code change, or architectural decision.

Trade-offs: Any impact on legitimate users or operational complexity.

14 KiB Raw Permalink Blame History Unescape Escape

Hardening Review: Parrhesia Nostr Relay

1. Architecture Overview

Supervision Tree

Data Flow

2. Current Defences Inventory

Connection Layer

Protocol Layer

Event Validation Pipeline

Rate Limiting

Authentication (NIP-42)

Authentication (NIP-98 HTTP)

Access Control

Database

Moderation

3. Known Gaps and Areas of Concern

3.1 Connection Exhaustion

3.2 Subscription Abuse

3.3 Tag Array Size

3.4 AUTH Timing

3.5 Outbound Amplification

3.6 Query Amplification

3.7 Multi-Node Trust

3.8 Metrics Endpoint

3.9 Negentropy Stub

3.10 Event Re-Serialisation Cost

4. Specific Review Requests

5. Out of Scope

6. Response Format

[Severity] Title

14 KiB

Raw Permalink Blame History