Files

T

self 3b38221abe refactor: simplify node recovery actions

Remove the old restore/operator-task recovery model and make actor/execution state the source of truth. Align GUI and CLI node actions around start, stop, reboot, retry, and reinstall, with provider image reset folded into reinstall when supported.

Update statechart and architecture docs to match the reinstall/repair model.

2026-06-17 14:09:18 +02:00

5.1 KiB

Raw Permalink Blame History

Open Issues

Last checked: 2026-06-17

This file tracks current reminders and unresolved product/engineering work. Completed review notes and stale refactor plans have been removed or squashed into the relevant open item.

Renderer / GUI

Build the actual observation UI:
- cloud/catalog status lines with current operation, last-finished timestamp, and last error
- warning/error log notifications from observation:logs, wired to acknowledgement
- metric summary components on top of the renderer observation metrics view model
- node observation display for ready-divergent, provider-missing, degraded services, stale metrics, and log severity counts
- render ActivityState.deploymentEvents as structured deployment steps with severity/status styling
Continue consolidating reusable Svelte component primitives whose class/API conventions stay CSS-compatible with the Tribes tribes_ui plugin surface. Avoid Legion-only styling patterns that cannot be mirrored in plugin/HEEx UI.
Improve the common “one DNS zone plus optional domain delegation” setup flow.
Consider a reusable dialog wrapper for repeated <dialog> lifecycle patterns.
Add runtime validation for critical IPC payloads, especially large payloads such as AppSnapshot, when the API surface settles.
Add list virtualization only if managed resource lists become large enough to need it.

Map view

Add a map component for planned/active infrastructure geography.
Evaluate MapLibre GL for rendering.
Evaluate Planetiler or another offline tile generation path.
Bundle a low-zoom world map with Legion so the GUI does not depend on a live tile service for basic global/regional placement. High zoom levels are not required initially.

Observation, logs, and telemetry

Decide whether cloud/catalog observation need backend-owned stale thresholds. Logs and metrics intentionally do not: subscribers can decide whether live events have gone quiet.
Make cluster log backfill robust for late/out-of-order sync and switching queried cluster nodes:
- track cursor/watermark per queried cluster node/source
- prefer inserted_at or updated_at API filters/cursors over occurred_at
- keep an overlap window and rely on local dedupe
- add a force-full backfill trigger/CLI when needed
Keep heavier streams separate: metric rollups for time-series summaries, deployment events for selected durable install/recovery steps, logs for warning/error notifications, and plugin-specific methods for plugin details.

Auto-reconciliation and repair

Tighten repair detection and auto-heal policies beyond the current conservative safe repairs.
Cover at least:
- DNS-derived host inputs and node deployment metadata
- sync certificate/config material
- system target/channel generation
- more precise cluster membership/control-plane state
- more precise NBDE mode, Tang peer set, and local boot-key state
- more precise provider firewall/security-group rules
Decide which additional repair needs are auto-reconciled, which require operator confirmation, and which should block further materialization.

Provider cleanup

Keep provider cleanup grounded in canonical planned/tracked/actual resources, actor state, operation logs, and billing evidence; do not reintroduce a separate provider recovery sidecar without a concrete ownership gap.
Extend cleanup and adoption workflows beyond CLI actions:
- richer resource detail pages with mismatch evidence, last refresh, cleanup/adoption attempts, and billing impact
- typed confirmation for destructive GUI cleanup when the resource is especially ambiguous
- resource-specific adoption that promotes safe observed resources into normal tracked state/bindings
Model related billable resources explicitly as provider observations when needed: server/instance, SSH key, firewall/security group, volume/disk, floating IP, DNS zone, domain/order/cart where applicable.
On failure, prioritize billable provider destroy first, then best-effort ancillary cleanup. Surface structured cleanup summaries to CLI/GUI.
Extend cleanup/adoption coverage beyond Hetzner/Scaleway compute to OVH and DNS/domain provisioning, where billing semantics are more sensitive.

Pricing and cost tracking

Extend the initial provider-neutral billing ledger beyond servers, domains, and observed orphans:
- provider-specific volume/disk, floating IP, firewall/security-group, DNS zone, domain order/cart, and renewal semantics
- richer hourly-capped-monthly, annual/fixed, unattached-IP, and stopped/suspended billing rules per provider
- explicit provider invoice/API comparison when billing APIs are available
Improve persisted monthly summaries so end-of-month snapshots are finalized even if Legion is not opened exactly at month boundary.
Promote resource-specific adoption/cleanup actions into the billing ledger so cost and cleanup events share one evidence trail.

Engine / API quality

Continue migrating deeper provider/planning failures to typed LegionEngineError paths when touched.
Runtime payload validation remains intentionally limited; revisit for critical IPC/daemon boundaries once contracts stabilize.

5.1 KiB Raw Permalink Blame History