Files
parrhesia/docs/slop/REVIEW.md
2026-03-18 13:23:06 +01:00

467 lines
27 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Parrhesia Technical Audit
**Date:** 2026-03-17
**Reviewer:** BEAM Architecture & Security Specialist
**Version audited:** 0.5.0 (Alpha)
**Scope:** Full-spectrum audit — protocol compliance, architecture, OTP supervision, backpressure, data integrity, security
---
## Table of Contents
1. [Feature Parity & Protocol Completeness](#1-feature-parity--protocol-completeness)
2. [Architecture & Design Coherence](#2-architecture--design-coherence)
3. [OTP & Supervision Tree](#3-otp--supervision-tree)
4. [Backpressure & Load Handling](#4-backpressure--load-handling)
5. [Data Integrity & PostgreSQL](#5-data-integrity--postgresql)
6. [Security](#6-security)
7. [Overall Assessment](#7-overall-assessment)
8. [Prioritised Remediation List](#8-prioritised-remediation-list)
---
## 1. Feature Parity & Protocol Completeness
**Verdict: ⚠️ Concern — two ghost features, two partial implementations**
### Claimed NIPs: 1, 9, 11, 13, 17, 40, 42, 43, 44, 45, 50, 59, 62, 66, 70, 77, 86, 98
| NIP | Title | Status | Notes |
|-----|-------|--------|-------|
| 01 | Basic Protocol | ✅ Full | EVENT, REQ, CLOSE, EOSE, OK, NOTICE, CLOSED all correct |
| 09 | Event Deletion | ✅ Full | Kind 5 with `e`/`a`/`k` tags; author-only enforcement; soft delete |
| 11 | Relay Information | ✅ Full | `application/nostr+json` on same URI; limitations section present |
| 13 | Proof of Work | ✅ Full | Leading-zero-bit calculation; configurable `min_pow_difficulty` |
| 17 | Private DMs | ⚠️ Partial | Kind 14/15 accepted; encryption/decryption client-side only; relay stores giftwraps transparently |
| 40 | Expiration | ✅ Full | Tag parsed at ingest; background worker for cleanup |
| 42 | Authentication | ✅ Full | Challenge/response with Kind 22242; freshness check; challenge rotation |
| 43 | Relay Access Metadata | ⚠️ Partial | Kind 13534 membership tracking present; join/leave request mechanics (28934/28935/28936) unimplemented |
| 44 | Encrypted Payloads | ✅ Full | Relay-transparent; version 2 format recognised |
| 45 | Count | ✅ Full | COUNT messages with optional HyperLogLog; `approximate` field |
| 50 | Search | ⚠️ Partial | Basic `ILIKE` substring match only; no tsvector/pg_trgm; no extensions (domain, language, sentiment) |
| 59 | Gift Wrap | ⚠️ Partial | Kind 1059/13 accepted; encryption at client level; recipient auth gating present |
| 62 | Request to Vanish | ✅ Full | Kind 62; relay tag requirement; `ALL_RELAYS` support; hard delete |
| **66** | **Relay Discovery** | **🔴 Ghost** | **Listed in `supported_nips` but no relay-specific liveness/discovery logic exists. Kinds 30166/10166 stored as regular events only.** |
| 70 | Protected Events | ✅ Full | `["-"]` tag detection; AUTH required; pubkey matching |
| 77 | Negentropy Sync | ✅ Full | NEG-OPEN/MSG/CLOSE/ERR; session management; configurable limits |
| 86 | Management API | ✅ Full | JSON-RPC over HTTP; NIP-98 auth; full moderation method set |
| 98 | HTTP Auth | ✅ Full | Kind 27235; URL/method binding; SHA256 payload hash; freshness |
### Ghost Features
1. **NIP-66 (Relay Discovery and Liveness Monitoring)** — Listed in supported NIPs but the relay performs no special handling for kinds 30166 or 10166. They are accepted and stored as generic events. No liveness data generation exists. **Remove from `supported_nips` or implement.**
2. **NIP-43 (Relay Access Metadata)** — Group membership tracking (kind 13534, 8000, 8001) is present, but the join/leave request mechanics (kinds 28934, 28935, 28936) and invite-code validation are absent. This is a partial implementation marketed as complete. **Document as partial or implement join/leave flow.**
### Undocumented Features
- **Marmot MLS groups** (kinds 443-449) with specialised media validation, push notification policies, and `h`-tag scoping are implemented but are not standard NIPs. These are correctly treated as extensions rather than NIP claims.
### Remediation
| Item | Action | File(s) |
|------|--------|---------|
| NIP-66 ghost | Remove from `supported_nips` in `relay_info.ex` | `lib/parrhesia/web/relay_info.ex` |
| NIP-43 partial | Document as partial or implement join/leave | `lib/parrhesia/groups/flow.ex` |
| NIP-50 extensions | Consider tsvector for FTS; document missing extensions | `lib/parrhesia/storage/adapters/postgres/events.ex` |
---
## 2. Architecture & Design Coherence
**Verdict: ✅ Sound — clean layered architecture with proper separation of concerns**
### Layer Map
```
┌─────────────────────────────────────────────────────┐
│ Transport Layer (Bandit + WebSock) │
│ lib/parrhesia/web/ │
│ connection.ex · router.ex · endpoint.ex · listener │
├─────────────────────────────────────────────────────┤
│ Protocol Layer (codec + validation) │
│ lib/parrhesia/protocol/ │
│ protocol.ex · event_validator.ex · filter.ex │
├─────────────────────────────────────────────────────┤
│ API Layer (domain operations) │
│ lib/parrhesia/api/ │
│ events.ex · auth.ex · stream.ex · sync.ex · admin │
├─────────────────────────────────────────────────────┤
│ Policy Layer (authorisation & rules) │
│ lib/parrhesia/policy/ │
│ event_policy.ex · connection_policy.ex │
├─────────────────────────────────────────────────────┤
│ Storage Layer (adapter pattern) │
│ lib/parrhesia/storage/ │
│ behaviours → postgres adapters · memory adapters │
└─────────────────────────────────────────────────────┘
```
### Strengths
- **Clean adapter pattern**: Storage behaviours (`Storage.Events`, `Storage.Moderation`, etc.) are resolved at runtime via `Parrhesia.Storage`. PostgreSQL and in-memory implementations are interchangeable.
- **Protocol layer is pure**: `Protocol.decode_client/1` and `Protocol.encode_relay/1` contain no side effects — they are pure codecs. Validation is separated into `EventValidator`.
- **API layer mediates all domain logic**: `API.Events.publish/2` orchestrates validation → policy → persistence → fanout. The web layer never touches the DB directly.
- **Connection handler is a state machine**: `Web.Connection` implements `WebSock` behaviour cleanly — no business logic bleeds in; it delegates to API functions.
- **Configuration is centralised**: `Parrhesia.Config` provides an ETS-backed config cache with runtime environment override support via 50+ env vars.
### Concerns
- **Connection handler size**: `web/connection.ex` is a large module handling all WebSocket message dispatch. While each handler delegates to API, the dispatch surface is wide. Consider extracting message handlers into sub-modules.
- **Fanout is embedded in API.Events**: The `fanout_event/1` call inside `publish/2` couples persistence with subscription delivery. Under extreme load, this means the publisher blocks on fanout. Decoupling into an async step would improve isolation.
### Scalability Assessment
The architecture would scale gracefully to moderate load (thousands of concurrent connections) without a rewrite. For truly high-volume deployments (10k+ connections, 100k+ events/sec), the synchronous fanout path and lack of global rate limiting would need attention. The BEAM's per-process isolation provides natural horizontal scaling if combined with a load balancer.
---
## 3. OTP & Supervision Tree
**Verdict: ✅ Sound — exemplary OTP compliance with zero unsafe practices**
### Supervision Hierarchy
```
Parrhesia.Supervisor (:one_for_one)
├── Telemetry (Supervisor, :one_for_one)
│ ├── TelemetryMetricsPrometheus.Core
│ └── :telemetry_poller
├── Config (GenServer, ETS-backed)
├── Storage.Supervisor (:one_for_one)
│ ├── ModerationCache (GenServer)
│ └── Repo (Ecto.Repo)
├── Subscriptions.Supervisor (:one_for_one)
│ ├── Index (GenServer, 7 ETS tables)
│ ├── Stream.Registry (Registry)
│ ├── Stream.Supervisor (DynamicSupervisor)
│ │ └── [dynamic] Stream.Subscription (GenServer, :temporary)
│ ├── Negentropy.Sessions (GenServer)
│ └── Fanout.MultiNode (GenServer, :pg group)
├── Auth.Supervisor (:one_for_one)
│ ├── Challenges (GenServer)
│ └── Identity.Manager (GenServer)
├── Sync.Supervisor (:one_for_one)
│ ├── WorkerRegistry (Registry)
│ ├── WorkerSupervisor (DynamicSupervisor)
│ │ └── [dynamic] Sync.Worker (GenServer, :transient)
│ └── Sync.Manager (GenServer)
├── Policy.Supervisor (:one_for_one) [empty — stateless modules]
├── Web.Endpoint (Supervisor, :one_for_one)
│ └── [per-listener] Bandit
│ └── [per-connection] Web.Connection (WebSock)
└── Tasks.Supervisor (:one_for_one)
├── ExpirationWorker (GenServer, optional)
└── PartitionRetentionWorker (GenServer)
```
### Findings
| Check | Result |
|-------|--------|
| Bare `spawn()` calls | ✅ None found |
| `Task.start` without supervision | ✅ None found |
| `Task.async` without supervision | ✅ None found |
| Unsupervised `Process.` calls | ✅ None found |
| All `Process.monitor` paired with `Process.demonitor(ref, [:flush])` | ✅ Confirmed in Index, Challenges, Subscription, Sessions |
| Restart strategies appropriate | ✅ All `:one_for_one` — children are independent |
| `:transient` for Sync.Worker | ✅ Correct — no restart on normal shutdown |
| `:temporary` for Stream.Subscription | ✅ Correct — never restart dead subscriptions |
| Crash cascade possibility | ✅ None — all supervisors isolate failures |
| Deadlock potential | ✅ None — no circular GenServer calls detected |
| ETS table ownership | ✅ Owned by supervising GenServer; recreated on restart |
### Minor Observations
- **Policy.Supervisor is empty**: Contains no children (stateless policy modules). Exists as a future extension point — no performance cost.
- **Identity.Manager** silently catches init failures: Returns `{:ok, %{}}` even if identity setup fails. Correct for startup resilience but consider a telemetry event for visibility.
- **4-level max depth** in the tree: Optimal — deep enough for isolation, shallow enough for comprehension.
---
## 4. Backpressure & Load Handling
**Verdict: ⚠️ Concern — per-connection defences are solid; global protection is absent**
### What Exists
| Mechanism | Implementation | Default | Location |
|-----------|---------------|---------|----------|
| Per-connection event rate limit | Window-based counter | 120 events/sec | `connection.ex:1598-1624` |
| Per-connection subscription cap | Hard limit at REQ time | 32 subs | `connection.ex:1098-1109` |
| Per-connection outbound queue | Bounded `:queue` | 256 entries | `connection.ex` state |
| Outbound overflow strategy | `:close` / `:drop_newest` / `:drop_oldest` | `:close` | `connection.ex:987-1013` |
| Negentropy session limits | Per-connection + global cap | 8/conn, 10k global | `negentropy/sessions.ex` |
| Filter count per REQ | Hard limit | 16 filters | `connection.ex` |
| Queue pressure telemetry | Pressure gauge (0-1 scale) | Alert at ≥ 0.75 | `telemetry.ex:57-81` |
### What's Missing
| Gap | Risk | Recommendation |
|-----|------|----------------|
| **No GenStage/Broadway backpressure** | Event ingestion is synchronous — publisher blocks on DB write + fanout | Consider async fanout via GenStage or Task.Supervisor |
| **No global rate limit** | 1000 connections × 120 evt/s = 120k evt/s with no relay-wide cap | Add global event counter with configurable ceiling |
| **No connection cap** | Bandit accepts unlimited connections up to OS limits | Set `max_connections` in Bandit options or add a plug-based cap |
| **No per-IP rate limit** | Single IP can open many connections, each with independent limits | Add IP-level rate limiting or document reverse-proxy requirement |
| **No subscription index size cap** | Global ETS subscription index can grow unbounded | Add global subscription count metric + circuit breaker |
| **Synchronous fanout** | `fanout_event/1` iterates all matching subs in publisher's process | Decouple fanout into async dispatch |
| **No read/write pool separation** | Writes can starve read queries under heavy ingest | Consider separate Ecto repos or query prioritisation |
| **No graceful drain on shutdown** | Listener shutdown doesn't flush queued events | Add drain timer before connection termination |
| **No mailbox depth monitoring** | `Process.info(pid, :message_queue_len)` never checked | Add periodic mailbox depth telemetry for subscription processes |
### Blast Radius: Flood Attack Scenario
A high-volume client flooding the relay with valid events at maximum rate:
1. **120 events/sec per connection** — rate-limited per-connection ✅
2. **But**: attacker opens 100 connections from different IPs → 12k events/sec ⚠️
3. Each event hits the Ecto pool (32 connections) → pool saturation at ~100 concurrent writes
4. Fanout blocks publisher process → connection mailbox grows
5. No global circuit breaker → relay degrades gradually via Ecto queue timeouts
6. Memory grows in subscription process mailboxes → eventual OOM
**Mitigation**: Deploy behind a reverse proxy (nginx/HAProxy) with connection and rate limits. This is a common pattern for BEAM applications but should be documented as a deployment requirement.
---
## 5. Data Integrity & PostgreSQL
**Verdict: ✅ Sound — well-designed schema with strong indexing; minor gaps in constraints and FTS**
### Schema Overview
Events are stored in monthly RANGE-partitioned tables keyed on `created_at`:
```sql
events (PARTITION BY RANGE (created_at))
id bytea (32 bytes, SHA256)
pubkey bytea (32 bytes)
created_at bigint (unix timestamp, partition key)
kind integer
content text
sig bytea (64 bytes)
d_tag text (nullable, for addressable events)
tags jsonb (denormalised tag array)
deleted_at bigint (soft delete)
expires_at bigint (NIP-40)
event_tags (PARTITION BY RANGE (event_created_at))
event_created_at bigint
event_id bytea
name text
value text
idx integer
-- State tables for NIP-16 replaceable/addressable events
replaceable_event_state (pubkey, kind event_id, created_at)
addressable_event_state (pubkey, kind, d_tag event_id, created_at)
```
### Indexing Strategy: ✅ Excellent
| Index | Columns | Purpose | Assessment |
|-------|---------|---------|------------|
| `events_kind_created_at_idx` | (kind, created_at DESC) | Kind filter | ✅ Primary access pattern |
| `events_pubkey_created_at_idx` | (pubkey, created_at DESC) | Author filter | ✅ Primary access pattern |
| `events_created_at_idx` | (created_at DESC) | Time range | ✅ Partition pruning |
| `events_id_idx` | (id) | Direct lookup / dedup | ✅ Essential |
| `events_expires_at_idx` | Partial, WHERE expires_at IS NOT NULL | Expiration worker | ✅ Space-efficient |
| `events_deleted_at_idx` | Partial, WHERE deleted_at IS NOT NULL | Soft-delete filter | ✅ Space-efficient |
| `event_tags_name_value_idx` | (name, value, event_created_at DESC) | Generic tag lookup | ✅ Covers all tag queries |
| `event_tags_i_idx` | Partial (value, event_created_at) WHERE name='i' | Hashtag lookups | ✅ Optimised for common pattern |
| `event_tags_h_idx` | Partial (value, event_created_at) WHERE name='h' | Group/relay hint | ✅ Marmot-specific optimisation |
### Query Construction: ✅ Sound
Nostr filters are translated to Ecto queries via `postgres/events.ex`:
- Single-filter: composed dynamically with `where/3` clauses
- Multi-filter: `UNION ALL` of individual filter queries, then Elixir-level dedup
- Tag filters: primary tag uses index join; secondary tags use `EXISTS` subqueries (properly parameterised)
- All values bound via Ecto's `^` operator — no SQL injection vectors
### N+1 Analysis: ✅ No Critical Issues
- Tag filtering uses `EXISTS` subqueries rather than joins — one query per filter, not per tag
- Multiple filters produce N queries (one per filter), not N per subscription
- `max_filters_per_req: 16` bounds this naturally
- Deduplication is done in Elixir (`MapSet`) after DB fetch — acceptable for bounded result sets
### Issues Found
| Issue | Severity | Detail | Remediation |
|-------|----------|--------|-------------|
| **No CHECK constraints on binary lengths** | Low | `id`, `pubkey`, `sig` have no `octet_length` checks at DB level | Add `CHECK (octet_length(id) = 32)` etc. via migration |
| **No length constraint on `d_tag`** | Low | Unbounded text field; could store arbitrarily large values | Add `CHECK (octet_length(d_tag) <= 256)` |
| **FTS is ILIKE-based** | Medium | NIP-50 search uses `ILIKE '%term%'` — linear scan, no ranking | Implement `tsvector` with GIN index for production-grade search |
| **Replaceable state race condition** | Low | SELECT → INSERT pattern in `insert_state_or_resolve_race/2` has a TOCTOU window | Mitigated by `ON CONFLICT :nothing` + retry, but native `ON CONFLICT DO UPDATE` would be cleaner |
| **Redundant tag storage** | Info | Tags stored in both `event_tags` table and `events.tags` JSONB column | Intentional denormalisation for different access patterns; documented tradeoff |
| **No indexes on state tables** | Low | `replaceable_event_state` and `addressable_event_state` lack explicit indexes | These are likely small tables; monitor query plans under load |
### Partitioning: ✅ Excellent
- Monthly partitions (`events_YYYY_MM`, `event_tags_YYYY_MM`) with lifecycle automation
- Default partition catches edge-case timestamps
- Retention worker pre-creates 2 months ahead; caps drops at 1 partition per run
- `protected_partition?/1` prevents accidental drop of parent/default
- Detach-before-drop pattern for safe concurrent operation
### Connection Pooling: ✅ Well-Configured
- Pool: 32 connections (configurable via `POOL_SIZE`)
- Queue target: 1000ms / Queue interval: 5000ms
- Standard Ecto queue management with graceful degradation
---
## 6. Security
### 6.1 Signature Validation
**Verdict: ✅ Sound**
- **Library**: `lib_secp256k1` (NIF-based Schnorr implementation)
- **Flow**: `API.Events.publish/2``Protocol.validate_event/1``EventValidator.validate_signature/1``Secp256k1.schnorr_valid?/3`
- **Timing**: Validation occurs BEFORE any database write — `persist_event/2` is only called after validation passes (`api/events.ex:41-44`)
- **Hex decoding**: Properly decodes hex sig/id/pubkey to binary before crypto verification
- **Error handling**: Wrapped in rescue clause; returns `{:error, :invalid_signature}`
**Concern**: Feature flag `verify_event_signatures` can disable verification. Currently only `false` in `test.exs`. **Ensure this is never disabled in production config.** Consider adding a startup assertion.
### 6.2 SQL Injection
**Verdict: ✅ Sound — all queries parameterised**
- Only 2 `fragment()` uses in `postgres/events.ex`, both with bound parameters:
```elixir
fragment("EXISTS (SELECT 1 FROM event_tags AS tag WHERE
tag.event_created_at = ? AND tag.event_id = ? AND tag.name = ? AND tag.value = ANY(?))",
event.created_at, event.id, ^tag_name, type(^values, {:array, :string}))
```
- All dynamic query conditions built via Ecto's `dynamic()` macro with `^` bindings
- No string interpolation of user input in any query path
- No calls to `Ecto.Adapters.SQL.query/3` with raw SQL
### 6.3 Resource Exhaustion
**Verdict: ⚠️ Concern — payload size enforced but tag arrays unbounded**
| Control | Status | Default | Location |
|---------|--------|---------|----------|
| Max frame size | ✅ Enforced | 1 MB | `connection.ex:135-140` (before JSON parse) |
| Max event size | ✅ Enforced | 256 KB | `api/events.ex:40` (after JSON parse, before DB) |
| Max filters per REQ | ✅ Enforced | 16 | `connection.ex` |
| Max filter limit | ✅ Enforced | 500 | `filter.ex` |
| Max subscriptions | ✅ Enforced | 32/conn | `connection.ex:1098` |
| **Max tags per event** | **🔴 Missing** | **None** | **Event with 10k tags accepted if within 256KB** |
| **Max tag values per filter** | **🔴 Missing** | **None** | **Filter with 1000 `#p` values generates expensive query** |
| **Max content field size** | **⚠️ Indirect** | **256KB overall** | **No separate limit; large content consumes storage** |
| Max negentropy sessions | ✅ Enforced | 8/conn, 10k global | `negentropy/sessions.ex` |
| Event rate limit | ✅ Enforced | 120/sec/conn | `connection.ex:1598` |
**Remediation**: Add configurable `max_tags_per_event` and `max_tag_values_per_filter` limits.
### 6.4 NIP-42 AUTH Flow
**Verdict: ✅ Sound**
- Challenge generated with 16 bytes of cryptographic randomness (`Auth.Challenges`)
- Challenge is connection-scoped (one per connection, tracked by PID)
- On successful AUTH: challenge is rotated via `rotate_auth_challenge/1`
- Freshness validated: `created_at` must be within `max_age_seconds` (default 600s)
- Relay URL tag must match connection's relay URL
- Process monitoring cleans up challenges when connections die
### 6.5 NIP-98 HTTP Auth
**Verdict: ⚠️ Concern — token replay window**
- **Signature**: Properly verified via same `EventValidator.validate/1` path ✅
- **URL binding**: Exact method + URL match required ✅
- **Freshness**: 60-second window ✅
- **🔴 Token reuse**: Signed NIP-98 events are NOT invalidated after use. Within the 60-second window, the same signed event can be replayed on the same endpoint.
- **Remediation**: Track used event IDs in a TTL-expiring ETS table or similar. Reject events with previously-seen IDs.
### 6.6 Injection Surfaces
**Verdict: ✅ Sound**
- All filter tag values parameterised through Ecto
- LIKE patterns properly escaped via `escape_like_pattern/1` (`%`, `_`, `\` escaped)
- Event field validation is comprehensive:
- IDs/pubkeys: 64-char lowercase hex (32 bytes)
- Signatures: 128-char lowercase hex (64 bytes)
- Kinds: integer 0-65535
- Tags: validated as non-empty string arrays
### 6.7 Other Security Notes
- **No `IO.inspect` of sensitive data**: Only 2 uses, both for config logging
- **No hardcoded secrets**: All credentials from environment variables
- **Binary frames rejected**: WebSocket handler only accepts text frames
- **Subscription ID validated**: Non-empty, max 64 chars
---
## 7. Overall Assessment
Parrhesia is a **well-engineered Nostr relay** that demonstrates expert-level Elixir/OTP knowledge and careful attention to the Nostr protocol. The codebase is clean, layered, and maintainable.
### Strengths
- **Exemplary OTP architecture**: Zero unsafe practices; proper supervision, monitoring, and crash isolation throughout
- **Clean separation of concerns**: Transport → Protocol → API → Policy → Storage layers with no bleed-through
- **Adapter pattern for storage**: PostgreSQL and in-memory backends are interchangeable — excellent for testing
- **Comprehensive protocol support**: 14 of 18 claimed NIPs fully implemented
- **Strong input validation**: All event fields rigorously validated; SQL injection surface is nil
- **Signature-first validation**: Schnorr verification occurs before any persistence — correct ordering
- **Rich telemetry**: Queue depth, pressure, overflow, ingest/query/fanout latency all instrumented
- **Production-ready partitioning**: Monthly partitions with automated lifecycle management
- **Extensive test coverage**: Unit, integration, and E2E test suites with property-based testing
### Weaknesses
- **No global load protection**: Per-connection limits exist but no relay-wide circuit breakers
- **Synchronous fanout**: Publisher blocks on subscription delivery — could bottleneck under high cardinality
- **NIP-50 search is ILIKE-based**: Will not scale; needs tsvector/GIN for production workloads
- **Two ghost NIPs**: NIP-66 and NIP-43 (partial) claimed but not meaningfully implemented
- **NIP-98 token replay**: 60-second replay window without event ID deduplication
- **Unbounded tag arrays**: No limit on tags per event or tag values per filter
---
## 8. Prioritised Remediation List
### 🔴 Critical (address before production)
| # | Issue | Impact | Fix |
|---|-------|--------|-----|
| 1 | **Unbounded tag arrays** | Event with 10k tags causes expensive DB inserts and index bloat; filter with 1000 tag values generates pathologically expensive queries | Add `max_tags_per_event` (e.g., 2000) and `max_tag_values_per_filter` (e.g., 256) to config and enforce in `event_validator.ex` and `filter.ex` |
| 2 | **No global rate limiting** | No relay-wide cap on event throughput; many connections can overwhelm DB pool | Add global event counter (ETS atomic) with configurable ceiling; return `rate-limited:` when exceeded |
| 3 | **NIP-98 token replay** | Signed management API tokens replayable within 60s window | Track used NIP-98 event IDs in TTL-expiring store; reject duplicates in `auth/nip98.ex` |
### ⚠️ Important (address before scale)
| # | Issue | Impact | Fix |
|---|-------|--------|-----|
| 4 | **Synchronous fanout** | Publisher blocks while iterating all matching subscriptions; latency spikes under high subscription cardinality | Decouple `fanout_event/1` into async dispatch via `Task.Supervisor` or dedicated GenServer pool |
| 5 | **No connection cap** | Unlimited WebSocket connections accepted up to OS limits | Set `max_connections` in Bandit listener options; document reverse-proxy requirement |
| 6 | **NIP-50 FTS is ILIKE** | Linear scan of content column; unacceptable at scale | Add `tsvector` column + GIN index; update query to use `@@` operator |
| 7 | **Ghost NIP-66** | Claimed but unimplemented; misleads clients querying relay info | Remove 66 from `supported_nips` in `relay_info.ex` |
| 8 | **No DB pool separation** | Writes can starve reads under heavy ingest | Consider separate Ecto repo for reads, or use Ecto's `:queue_target` tuning |
| 9 | **Missing CHECK constraints** | Binary fields (`id`, `pubkey`, `sig`) lack length constraints at DB level | Add migration with `CHECK (octet_length(id) = 32)` etc. |
### 💡 Nice to Have
| # | Issue | Impact | Fix |
|---|-------|--------|-----|
| 10 | **NIP-43 partial** | Join/leave mechanics unimplemented | Document as partial or implement kinds 28934-28936 |
| 11 | **Signature verification feature flag** | Could be disabled in production config | Add startup assertion or compile-time check |
| 12 | **Mailbox depth monitoring** | No visibility into subscription process queue sizes | Add periodic `Process.info/2` telemetry for dynamic children |
| 13 | **Graceful shutdown drain** | No flush of queued events before connection termination | Add drain timer in listener reload/shutdown path |
| 14 | **Replaceable state race** | SELECT→INSERT TOCTOU in state tables | Refactor to native `ON CONFLICT DO UPDATE` |
| 15 | **Connection handler size** | `web/connection.ex` is large | Extract message handlers into sub-modules for maintainability |
---
*End of audit. Generated 2026-03-17.*