Files

T

self e84840b9d0 docs: clarify Windows daemon posture

Document the current Windows named-pipe daemon as a trusted single-user GUI/session bridge rather than a Windows Service.

2026-06-26 04:40:09 +02:00

18 KiB

Raw Permalink Blame History

Legion Architecture

Status: current overview, last checked 2026-06-26. Target reading time: under 10 minutes.

Legion is an Electron desktop app and headless CLI/daemon for managing Tribes cloud infrastructure. The central design constraint is local-first control: state is encrypted on the operator machine, provider APIs and filesystem access stay in the main process, and the renderer is a client of a typed preload API.

Mermaid diagram sources live under docs/diagrams.

High-level Shape

The main runtime is composed in src/engine/runtime.ts by createLegionEngine(). Electron, the CLI, and the local daemon all use this same engine surface instead of maintaining separate business logic.

The major layers are:

src/renderer/src: Svelte 5 GUI, view state, forms, i18n, and user intents.
src/preload: the only renderer-to-main bridge.
src/shared: contracts shared by main, preload, renderer, CLI, and tests.
src/main: encrypted state, provider adapters, provisioning, deployment, observation, logs, terminals, and recovery supervisors.
src/engine: reusable engine API, CLI entry wiring, daemon protocol, paths, and daemon host.

Generated OpenAPI clients live under src/client/** and should be treated as generated code.

State Model

Legion separates planned intent from tracked and observed reality:

planned resources live in the local scheme and are manipulated by planning actions such as server/domain/DNS plan upserts;
tracked resources are the resources Legion believes it owns or manages;
actual state is refreshed from provider APIs and used for actor state, cleanup evidence, billing, and materialization decisions.

Resource actors, not the encrypted state store, are the read model for Legion's beliefs about managed resources. Lifecycle state, health, current operation, materialization substage, and last error come from ResourceSupervisor.getSnapshot(). The app snapshot assembler in src/main/app-snapshot.ts joins those actor beliefs with raw desired, tracked, and observed records plus pure field diffs. Reads do not create actors; if stored records are visible before ResourceSupervisor.syncFromState() creates the actor, the assembler emits an unknown row until the actor snapshot arrives.

The encrypted local state implementation is in src/main/state/store.ts, wrapped by PersistenceSupervisor in src/main/state/persistence-supervisor.ts. State files are encrypted with an Argon2id-derived key and authenticated encryption in src/main/state/crypto.ts. Unlocking loads decrypted state into the running process; logout drops the in-memory state and password.

Materialization is intentionally explicit. Editing a plan does not mutate cloud resources. The operator must run materialization, which reconciles the plan through the provisioning and node deployment pipeline.

Runtime Composition

createLegionEngine() wires the following long-lived services:

PersistenceSupervisor and SecureStateStore for local encrypted state.
ProviderCatalogStore for embedded and refreshed provider catalogs.
ResourceSupervisor for daemon-scoped resource actors keyed by resource type and identity.
ProviderBroker for same-provider provider API serialization and provider status snapshots.
ProviderPersistence for provider refresh/adoption writes into the state store.
EngineOps for the thin apply, refresh, and server-command engine API.
NodeDeploymentRuntime for managed-node install/reinstall/reconfigure/NBDE operations.
ObservationScheduler for periodic provider/catalog/telemetry/log/node observation/repair polling.
ClusterTelemetryCollector for cluster telemetry inventory and rollups.
StreamingSupervisor for Sender stream session lifecycle and endpoint monitoring.
LegionLogStore and ClusterLogBackfillCollector for local and imported cluster logs.
ServerTerminalManager for SSH terminal sessions.

Most public engine methods call assertUnlocked() before mutating or reading encrypted operational state. Catalog lookup and credential verification can run without a fully unlocked plan when explicit credentials are supplied.

GUI Boundary

The GUI is a Svelte 5 renderer. It talks to main only through the window.api surface exposed in src/preload/index.ts; IPC channel and payload types live in src/shared/channels.ts.

Renderer state is centered on src/renderer/src/state/app-controller.svelte.ts. The controller reduces intents into view state and runs effects through the preload API. The renderer should not import main-process code, perform provider calls, read/write local state files, or bypass preload.

Commands flow from the GUI to the engine and return only domain results or errors; application state flows back exclusively through state:snapshot and activity:update events. bootstrap, unlock, and logout are the lifecycle exception because auth state is outside AppSnapshot and must be returned synchronously.

User-facing text is routed through the renderer i18n layer, and UI styling is DaisyUI-first with reusable component-level classes in src/renderer/src/assets/main.css when local abstraction is needed.

Providers and Catalogs

Provider-specific API behavior lives in src/main/cloud/providers/*.ts; current integrated compute providers are Hetzner, OVH, and Scaleway. Provider registration and capabilities are centralized in src/main/cloud/providers/registry.ts.

The shared orchestration code asks the registry for an adapter instead of calling provider modules directly. Provider resilience logic for ambiguous mutations lives in src/main/cloud/providers/api-resilience.ts: when a provider call fails ambiguously, Legion re-reads provider state and accepts success only when the desired resource state is visible.

Default catalog choices and deployment profiles live in src/shared/provider-catalog-policy.ts and src/shared/provider-deployment-policy.ts.

Resource and XState Actors

Legion's actor model is centered on daemon-scoped resource actors owned by src/main/resources/base/supervisor.ts. The rule is one actor type per real domain entity, with shared behavior imported as libraries instead of a base machine. Current actor types are:

ServerNodeActor for cluster nodes managed by Legion;
the same node actor, flavor-tagged as satellite, for simpler externally reachable nodes that still have cloud lifecycle, SSH/bootstrap, and HTTP health;
DomainActor, DnsZoneActor, and FirewallActor for the provider resources each actually owns.

DNS hosts are not resource actors; they are Legion synthesis inputs that feed DNS zone record reconciliation. Provider SSH keys and floating IPs are server lifecycle side effects, not actors; they are handled by provider broker calls during server provisioning/deletion. Each actor owns the compact facts for that resource:

desired facts from the local scheme;
runtime/tracked facts for cluster nodes and satellites;
actual provider-observation facts for observed provider resources;
current materialization status for the resource, including server substages derived from persisted node execution phases;
node observation facts for cluster node actors.

The resource actors intentionally store summarized facts rather than full provider credentials, SSH keys, or decrypted state records. They are stopped when state is locked or the resource disappears from unlocked state. This makes them the daemon's main place to answer "what does Legion currently believe about this resource, and why?", while keeping secrets inside the existing state store and operation code. Actor snapshot changes are publication-driving: the supervisor subscribes to each actor and coalesces snapshot publication per microtask, so UI updates are not dependent on state-store writes.

EngineOps no longer owns a workflow state machine or a dependency graph. materializeCurrentScheme() is apply = prepare an operation log, ask ResourceSupervisor to ensure actors for the committed plan, send SYNC waves to those actors, and watch their snapshots until resources settle. At most one apply may run concurrently. Provider refresh and server commands do not use that mutex: provider calls are serialized by ProviderBroker, and resource actors serialize their own operations.

Activity snapshots are managed by a small ActivityEmitter owned by EngineOps, but activity is derived from the current apply/command and resource snapshots. The app does not push invented progress percentages from orchestration steps. Materialization progress is a view over actor snapshots: machine/system states, entity states, deployment substages persisted by the deployment runtime, and waiting details such as DNS zones waiting for server endpoints.

Derived resources use the same actor command. Firewall rules and DNS records synthesized from tracked server endpoints are derived from persisted state, so a state-store change event feeds ResourceSupervisor. The supervisor syncs actors from state, sends SYNC to affected actors, and actors no-op when prerequisite facts are not present. DNS hosts are display-only synthesis inputs; they are not actors and they do not gate reconciliation. Server endpoint discovery is persisted as tracked-server state, so DNS-zone actors read endpoints from the blackboard state store and converge on the next state-change or apply wave.

Cluster server nodes and internally provisioned satellites use the unified node actor. The actor has parallel machine and system regions; the XState region value is the source of truth, and snapshots derive machine/system from the actor state rather than from mirrored context fields. The actor owns provider observation, health checks, tracked server actions, state-read reconciliation, provision/install/delete, planned-absence teardown, and current operation labels. Cluster nodes add SSH health, node observation polling, cluster install, and NBDE/cluster repair behavior. A single managed server is a one-node cluster; there is no separate single-node mode. Satellites use the same actor with the shorter satellite setup path, but they are not exposed as a direct GUI or CLI node type. Node provider wire calls still go through ProviderBroker, which queues provider refresh/mutation work per provider and exposes plain provider status records.

Domain, DNS-zone, and firewall actors are separate machines. Their SYNC events carry only optional run context; the actors resolve domain plans, DNS-zone plans and synthesized records, or firewall plans from the state store. Each actor invokes only its own behavior module: domainReconciler, dnsZoneReconciler, or firewallReconciler. EngineOps wires provider and persistence services once, but no longer routes desired resource specs through a generic dispatcher.

Managed cluster firewalls are owned by one firewall-set actor during reconciliation. The desired firewall set is still derived from cluster node state by buildClusterFirewallPlan(state). The state store emits change events after successful mutations; ResourceSupervisor subscribes to those events, syncs resource actors from state, and coalesces relevant server/provider/scheme changes into SYNC intents for FirewallActor. The actor's firewall reconciler owns provider inventory reads, provider adapter upsert/delete calls, operation logging, stale binding cleanup, and its provider-applied firewall cache. If node endpoint facts change while provider calls are in flight, the actor applies the latest full firewall set before reporting convergence. EngineOps, deployment phases, and repair sweep no longer call firewall reconciliation explicitly.

DNS zones can be reached from apply waves and from reactive state changes. DnsZoneActor owns idempotency by recording the last applied state-derived signature, so repeated events for the same zone complete as no-ops without a supervisor-level signature cache. When endpoint-based records cannot yet be resolved, the actor stays in its current state and exposes the waiting reason as a snapshot detail instead of treating the wait as an error.

src/main/deployment/runtime.ts defines the durable deployment phase machine used by cluster node operations. It handles add, reinstall, Tribes reconfigure, NBDE promotion, and NBDE reconcile. Install-like operations route through phases such as provider provisioning, kexec, OS install, installed-system reboot, certificate activation, post-install Tribes configuration, readiness wait, cluster reconciliation, publication-barrier clear, and optional NBDE reconciliation. Cluster node actors invoke the runtime through behavior services: cloud provisioning is split from cluster bootstrap so the actor can persist the tracked server before install; that state mutation is the source for reactive DNS and firewall convergence. Reinstall first asks supported providers to reset the tracked server to a current provider image: the runtime reuses the tracked image id only if it is still present in the latest catalog, otherwise it runs the normal catalog selector and then executes the same kexec/install phases. NodeDeploymentRuntime owns the durable phase machine, phase services, persistence, metrics, and CLI facade. Resource snapshots derive live substages from the persisted execution.phase record, which remains the crash-recovery unit for long-running installs.

Node execution phases are persisted unless tests inject a delay hook. On startup, NodeDeploymentRuntime.initialize() marks previously running node operations as interrupted and retryable, so the operator can resume from a known phase instead of silently assuming success.

ObservationScheduler ticks cluster node actors for node observation, and those actors invoke resources/node/node-observation-collector.ts for their own node. Satellites do not run node observation polling; their current lightweight health path is HTTP probing. Optional XState JSONL tracing is enabled with LEGION_TRACE_XSTATE=1; traces are written under the local log directory.

Recovery and Self-healing

Legion currently favors conservative, explainable recovery over broad automatic mutation.

Observation is split by concern:

ObservationScheduler schedules cloud/provider refresh, catalog refresh, telemetry inventory, telemetry rollups, cluster log polling, node observation, and background repairs.
Cluster node actors poll managed nodes and classify local node readiness, stale metrics, service health, and admin API reachability. Satellites use HTTP health probing.
ClusterLogBackfillCollector backfills cluster logs into the local LegionLogStore. Failure and recovery state is exposed through resource actor snapshots and tracked server execution records.

ObservationScheduler runs the repair tick every five minutes after unlock and records repair status for UI snapshots. ResourceSupervisor.runRepairs() owns repair selection and dispatches per-cluster-node RUN_REPAIR events where the repair is server-scoped. Current safe repairs are:

reconcile NBDE through affected cluster node actors when local boot keys, missing reconciliation timestamps, or quorum/Tang peer state require repair;
reconfigure cluster membership when tracked cluster node public keys diverge from stored cluster membership and no node operation is running.

Materialization uses the same per-node repair dispatch for cluster-derived follow-up work: after the apply watcher settles, ResourceSupervisor computes affected NBDE and cluster target IDs, sends RUN_REPAIR intents to those cluster node actors, and lets each actor call the deployment runtime for its own node. The supervisor owns target selection; cluster node actors own per-node operation state.

Provider cleanup is not handled by a separate recovery sidecar. The current model keeps cleanup grounded in planned/tracked/actual resources, the billing ledger, resource actor state, and operation logs.

Managed Node Security

Managed node bootstrap uses deployment services in src/main/deployment/service.ts. The pipeline creates or derives node identity, cluster bootstrap material, certificates, system admin credentials, and NBDE policy. Internal Tribes sync is deployed on TCP 4413 using the contract in docs/sync-mtls-handoff.md: Legion provides sync URLs, certificate material, and provider firewall rules so peer sync is restricted to cluster peers.

NBDE policy is cluster-derived in src/main/cluster/derived-policy.ts and documented in docs/nbde.md. Small clusters use degraded local unlock; larger healthy managed clusters move to distributed Tang and then quorum unlock.

CLI and Daemon

The CLI entry path is in src/engine/cli-main.ts and src/main/cli.ts. Headless node operations are exposed through src/main/node-cli-service.ts.

Daemon mode is implemented in src/engine/daemon.ts. It runs a local JSON-line RPC server over a Unix domain socket, or over a local Windows named pipe. The endpoint is derived from the state file path. Daemon metadata records only version, endpoint, PID, and creation time; it deliberately does not store the token.

The daemon is intended as a local same-user process accelerator. It lets CLI commands reuse one already-initialized engine instead of repeatedly starting and unlocking state.

Daemon Security Posture

The daemon does not require the unlock password on every request.

The security split is:

the unlock password protects encrypted state at rest and is needed to unlock the engine;
the daemon token authorizes requests to the already-running daemon process.

On daemon start, Legion generates a random 32-byte hex token unless one is supplied explicitly. Clients must pass the token on every daemon request: status, shutdown, calls, and subscriptions. Invalid tokens receive an error and the connection is closed.

Local endpoint hardening is part of the daemon boundary. On Unix-like hosts:

the daemon metadata directory is chmod 0700;
the metadata file is written chmod 0600;
the Unix socket is chmod 0600;

On Windows, named-pipe mode is intended for trusted, single-user GUI sessions and same-user CLI access. It is not installed or supervised as a Windows Service.

Holding LEGION_DAEMON_TOKEN or passing --daemon-token is therefore equivalent to controlling the live unlocked Legion process for the exposed CLI or engine methods. The token is not a replacement for state encryption, and the unlock password is not used as per-request authorization. If the daemon is unlocked, a valid token holder can perform unlocked operations until logout, daemon shutdown, or process exit.

The practical posture is suitable for local same-user automation, not for a multi-user network service. Do not expose the daemon socket or token to other users, shells, CI logs, or remote hosts.

18 KiB Raw Permalink Blame History