Document the current Windows named-pipe daemon as a trusted single-user GUI/session bridge rather than a Windows Service.
18 KiB
Legion Architecture
Status: current overview, last checked 2026-06-26. Target reading time: under 10 minutes.
Legion is an Electron desktop app and headless CLI/daemon for managing Tribes cloud infrastructure. The central design constraint is local-first control: state is encrypted on the operator machine, provider APIs and filesystem access stay in the main process, and the renderer is a client of a typed preload API.
Mermaid diagram sources live under docs/diagrams.
High-level Shape
The main runtime is composed in src/engine/runtime.ts by
createLegionEngine(). Electron, the CLI, and the local daemon all use this
same engine surface instead of maintaining separate business logic.
The major layers are:
src/renderer/src: Svelte 5 GUI, view state, forms, i18n, and user intents.src/preload: the only renderer-to-main bridge.src/shared: contracts shared by main, preload, renderer, CLI, and tests.src/main: encrypted state, provider adapters, provisioning, deployment, observation, logs, terminals, and recovery supervisors.src/engine: reusable engine API, CLI entry wiring, daemon protocol, paths, and daemon host.
Generated OpenAPI clients live under src/client/** and should be treated as
generated code.
State Model
Legion separates planned intent from tracked and observed reality:
- planned resources live in the local scheme and are manipulated by planning actions such as server/domain/DNS plan upserts;
- tracked resources are the resources Legion believes it owns or manages;
- actual state is refreshed from provider APIs and used for actor state, cleanup evidence, billing, and materialization decisions.
Resource actors, not the encrypted state store, are the read model for Legion's
beliefs about managed resources. Lifecycle state, health, current operation,
materialization substage, and last error come from
ResourceSupervisor.getSnapshot(). The app snapshot assembler in
src/main/app-snapshot.ts joins those actor beliefs with raw desired, tracked,
and observed records plus pure field diffs. Reads do not create actors; if stored
records are visible before ResourceSupervisor.syncFromState() creates the
actor, the assembler emits an unknown row until the actor snapshot arrives.
The encrypted local state implementation is in src/main/state/store.ts, wrapped by
PersistenceSupervisor in src/main/state/persistence-supervisor.ts. State files are
encrypted with an Argon2id-derived key and authenticated encryption in
src/main/state/crypto.ts. Unlocking loads decrypted state into the running
process; logout drops the in-memory state and password.
Materialization is intentionally explicit. Editing a plan does not mutate cloud resources. The operator must run materialization, which reconciles the plan through the provisioning and node deployment pipeline.
Runtime Composition
createLegionEngine() wires the following long-lived services:
PersistenceSupervisorandSecureStateStorefor local encrypted state.ProviderCatalogStorefor embedded and refreshed provider catalogs.ResourceSupervisorfor daemon-scoped resource actors keyed by resource type and identity.ProviderBrokerfor same-provider provider API serialization and provider status snapshots.ProviderPersistencefor provider refresh/adoption writes into the state store.EngineOpsfor the thin apply, refresh, and server-command engine API.NodeDeploymentRuntimefor managed-node install/reinstall/reconfigure/NBDE operations.ObservationSchedulerfor periodic provider/catalog/telemetry/log/node observation/repair polling.ClusterTelemetryCollectorfor cluster telemetry inventory and rollups.StreamingSupervisorfor Sender stream session lifecycle and endpoint monitoring.LegionLogStoreandClusterLogBackfillCollectorfor local and imported cluster logs.ServerTerminalManagerfor SSH terminal sessions.
Most public engine methods call assertUnlocked() before mutating or reading
encrypted operational state. Catalog lookup and credential verification can run
without a fully unlocked plan when explicit credentials are supplied.
GUI Boundary
The GUI is a Svelte 5 renderer. It talks to main only through the window.api
surface exposed in src/preload/index.ts; IPC channel and payload types live in
src/shared/channels.ts.
Renderer state is centered on src/renderer/src/state/app-controller.svelte.ts.
The controller reduces intents into view state and runs effects through the
preload API. The renderer should not import main-process code, perform provider
calls, read/write local state files, or bypass preload.
Commands flow from the GUI to the engine and return only domain results or
errors; application state flows back exclusively through state:snapshot and
activity:update events. bootstrap, unlock, and logout are the lifecycle
exception because auth state is outside AppSnapshot and must be returned
synchronously.
User-facing text is routed through the renderer i18n layer, and UI styling is
DaisyUI-first with reusable component-level classes in
src/renderer/src/assets/main.css when local abstraction is needed.
Providers and Catalogs
Provider-specific API behavior lives in src/main/cloud/providers/*.ts; current
integrated compute providers are Hetzner, OVH, and Scaleway. Provider
registration and capabilities are centralized in src/main/cloud/providers/registry.ts.
The shared orchestration code asks the registry for an adapter instead of
calling provider modules directly. Provider resilience logic for ambiguous
mutations lives in src/main/cloud/providers/api-resilience.ts: when a provider call
fails ambiguously, Legion re-reads provider state and accepts success only when
the desired resource state is visible.
Default catalog choices and deployment profiles live in
src/shared/provider-catalog-policy.ts and
src/shared/provider-deployment-policy.ts.
Resource and XState Actors
Legion's actor model is centered on daemon-scoped resource actors owned by
src/main/resources/base/supervisor.ts. The rule is one actor type per real
domain entity, with shared behavior imported as libraries instead of a base
machine. Current actor types are:
ServerNodeActorfor cluster nodes managed by Legion;- the same node actor, flavor-tagged as satellite, for simpler externally reachable nodes that still have cloud lifecycle, SSH/bootstrap, and HTTP health;
DomainActor,DnsZoneActor, andFirewallActorfor the provider resources each actually owns.
DNS hosts are not resource actors; they are Legion synthesis inputs that feed DNS zone record reconciliation. Provider SSH keys and floating IPs are server lifecycle side effects, not actors; they are handled by provider broker calls during server provisioning/deletion. Each actor owns the compact facts for that resource:
- desired facts from the local scheme;
- runtime/tracked facts for cluster nodes and satellites;
- actual provider-observation facts for observed provider resources;
- current materialization status for the resource, including server substages derived from persisted node execution phases;
- node observation facts for cluster node actors.
The resource actors intentionally store summarized facts rather than full provider credentials, SSH keys, or decrypted state records. They are stopped when state is locked or the resource disappears from unlocked state. This makes them the daemon's main place to answer "what does Legion currently believe about this resource, and why?", while keeping secrets inside the existing state store and operation code. Actor snapshot changes are publication-driving: the supervisor subscribes to each actor and coalesces snapshot publication per microtask, so UI updates are not dependent on state-store writes.
EngineOps no longer owns a workflow state machine or a dependency graph.
materializeCurrentScheme() is apply = prepare an operation log, ask
ResourceSupervisor to ensure actors for the committed plan, send SYNC waves
to those actors, and watch their snapshots until resources settle. At most one
apply may run concurrently. Provider refresh and server commands do not use that
mutex: provider calls are serialized by ProviderBroker, and resource actors
serialize their own operations.
Activity snapshots are managed by a small ActivityEmitter owned by
EngineOps, but activity is derived from the current apply/command and resource
snapshots. The app does not push invented progress percentages from orchestration
steps. Materialization progress is a view over actor snapshots: machine/system
states, entity states, deployment substages persisted by the deployment runtime,
and waiting details such as DNS zones waiting for server endpoints.
Derived resources use the same actor command. Firewall rules and DNS records
synthesized from tracked server endpoints are derived from persisted state, so a
state-store change event feeds ResourceSupervisor. The supervisor syncs actors
from state, sends SYNC to affected actors, and actors no-op when prerequisite
facts are not present. DNS hosts are display-only synthesis inputs; they are not
actors and they do not gate reconciliation. Server endpoint discovery is
persisted as tracked-server state, so DNS-zone actors read endpoints from the
blackboard state store and converge on the next state-change or apply wave.
Cluster server nodes and internally provisioned satellites use the unified node actor. The actor has
parallel machine and system regions; the XState region value is the source of
truth, and snapshots derive machine/system from the actor state rather than
from mirrored context fields. The actor owns provider observation, health checks,
tracked server actions, state-read reconciliation, provision/install/delete,
planned-absence teardown, and current operation labels. Cluster nodes add SSH
health, node observation polling, cluster install, and NBDE/cluster repair
behavior. A single managed server is a one-node cluster; there is no separate
single-node mode. Satellites use the same actor with the shorter satellite setup
path, but they are not exposed as a direct GUI or CLI node type.
Node provider wire calls still go through ProviderBroker, which queues provider
refresh/mutation work per provider and exposes plain provider status records.
Domain, DNS-zone, and firewall actors are separate machines. Their SYNC events
carry only optional run context; the actors resolve domain plans, DNS-zone plans
and synthesized records, or firewall plans from the state store. Each actor
invokes only its own behavior module: domainReconciler, dnsZoneReconciler,
or firewallReconciler. EngineOps wires provider and persistence services
once, but no longer routes desired resource specs through a generic dispatcher.
Managed cluster firewalls are owned by one firewall-set actor during reconciliation.
The desired firewall set is still derived from cluster node state by
buildClusterFirewallPlan(state). The state store emits change events after
successful mutations; ResourceSupervisor subscribes to those events, syncs
resource actors from state, and coalesces relevant server/provider/scheme
changes into SYNC intents for FirewallActor. The actor's firewall
reconciler owns provider inventory reads, provider adapter upsert/delete calls,
operation logging, stale binding cleanup, and its provider-applied firewall
cache. If node endpoint facts change while provider calls are in flight, the
actor applies the latest full firewall set before reporting convergence.
EngineOps, deployment phases, and repair sweep no longer call firewall
reconciliation explicitly.
DNS zones can be reached from apply waves and from reactive state changes.
DnsZoneActor owns idempotency by recording the last applied state-derived
signature, so repeated events for the same zone complete as no-ops without a
supervisor-level signature cache. When endpoint-based records cannot yet be
resolved, the actor stays in its current state and exposes the waiting reason as
a snapshot detail instead of treating the wait as an error.
src/main/deployment/runtime.ts defines the durable deployment phase machine
used by cluster node operations. It handles add, reinstall, Tribes reconfigure,
NBDE promotion, and NBDE reconcile. Install-like operations route through phases
such as provider provisioning, kexec, OS install, installed-system reboot,
certificate activation, post-install Tribes configuration, readiness wait,
cluster reconciliation, publication-barrier clear, and optional NBDE
reconciliation. Cluster node actors invoke the runtime through behavior services:
cloud provisioning is split from cluster bootstrap so the actor can persist the
tracked server before install; that state mutation is the source for reactive DNS
and firewall convergence.
Reinstall first asks supported providers to reset the tracked server to a
current provider image: the runtime reuses the tracked image id only if it is
still present in the latest catalog, otherwise it runs the normal catalog
selector and then executes the same kexec/install phases.
NodeDeploymentRuntime owns the durable phase machine, phase services,
persistence, metrics, and CLI facade. Resource snapshots derive live substages
from the persisted execution.phase record, which remains the crash-recovery
unit for long-running installs.
Node execution phases are persisted unless tests inject a delay hook. On startup,
NodeDeploymentRuntime.initialize() marks previously running node operations as
interrupted and retryable, so the operator can resume from a known phase instead
of silently assuming success.
ObservationScheduler ticks cluster node actors for node observation, and those
actors invoke resources/node/node-observation-collector.ts for their own node.
Satellites do not run node observation polling; their current lightweight health
path is HTTP probing. Optional XState JSONL tracing is enabled with
LEGION_TRACE_XSTATE=1; traces are written under the local log directory.
Recovery and Self-healing
Legion currently favors conservative, explainable recovery over broad automatic mutation.
Observation is split by concern:
ObservationSchedulerschedules cloud/provider refresh, catalog refresh, telemetry inventory, telemetry rollups, cluster log polling, node observation, and background repairs.- Cluster node actors poll managed nodes and classify local node readiness, stale metrics, service health, and admin API reachability. Satellites use HTTP health probing.
ClusterLogBackfillCollectorbackfills cluster logs into the localLegionLogStore. Failure and recovery state is exposed through resource actor snapshots and tracked server execution records.
ObservationScheduler runs the repair tick every five minutes after unlock and
records repair status for UI snapshots. ResourceSupervisor.runRepairs() owns
repair selection and dispatches per-cluster-node RUN_REPAIR events where the
repair is server-scoped. Current safe repairs are:
- reconcile NBDE through affected cluster node actors when local boot keys, missing reconciliation timestamps, or quorum/Tang peer state require repair;
- reconfigure cluster membership when tracked cluster node public keys diverge from stored cluster membership and no node operation is running.
Materialization uses the same per-node repair dispatch for cluster-derived
follow-up work: after the apply watcher settles, ResourceSupervisor computes
affected NBDE and cluster target IDs, sends RUN_REPAIR intents to those
cluster node actors, and lets each actor call the deployment runtime for its own
node. The supervisor owns target selection; cluster node actors own per-node
operation state.
Provider cleanup is not handled by a separate recovery sidecar. The current model keeps cleanup grounded in planned/tracked/actual resources, the billing ledger, resource actor state, and operation logs.
Managed Node Security
Managed node bootstrap uses deployment services in
src/main/deployment/service.ts. The pipeline creates or derives node
identity, cluster bootstrap material, certificates, system admin credentials,
and NBDE policy. Internal Tribes sync is deployed on TCP 4413 using the
contract in docs/sync-mtls-handoff.md: Legion provides sync URLs, certificate
material, and provider firewall rules so peer sync is restricted to cluster
peers.
NBDE policy is cluster-derived in src/main/cluster/derived-policy.ts and
documented in docs/nbde.md. Small clusters use degraded local unlock; larger
healthy managed clusters move to distributed Tang and then quorum unlock.
CLI and Daemon
The CLI entry path is in src/engine/cli-main.ts and src/main/cli.ts.
Headless node operations are exposed through src/main/node-cli-service.ts.
Daemon mode is implemented in src/engine/daemon.ts. It runs a local JSON-line
RPC server over a Unix domain socket, or over a local Windows named pipe. The
endpoint is derived from the state file path. Daemon metadata records only
version, endpoint, PID, and creation time; it deliberately does not store the
token.
The daemon is intended as a local same-user process accelerator. It lets CLI commands reuse one already-initialized engine instead of repeatedly starting and unlocking state.
Daemon Security Posture
The daemon does not require the unlock password on every request.
The security split is:
- the unlock password protects encrypted state at rest and is needed to unlock the engine;
- the daemon token authorizes requests to the already-running daemon process.
On daemon start, Legion generates a random 32-byte hex token unless one is supplied explicitly. Clients must pass the token on every daemon request: status, shutdown, calls, and subscriptions. Invalid tokens receive an error and the connection is closed.
Local endpoint hardening is part of the daemon boundary. On Unix-like hosts:
- the daemon metadata directory is chmod
0700; - the metadata file is written chmod
0600; - the Unix socket is chmod
0600;
On Windows, named-pipe mode is intended for trusted, single-user GUI sessions and same-user CLI access. It is not installed or supervised as a Windows Service.
Holding LEGION_DAEMON_TOKEN or passing --daemon-token is therefore
equivalent to controlling the live unlocked Legion process for the exposed CLI
or engine methods. The token is not a replacement for state encryption, and the
unlock password is not used as per-request authorization. If the daemon is
unlocked, a valid token holder can perform unlocked operations until logout,
daemon shutdown, or process exit.
The practical posture is suitable for local same-user automation, not for a multi-user network service. Do not expose the daemon socket or token to other users, shells, CI logs, or remote hosts.