3b38221abe
Remove the old restore/operator-task recovery model and make actor/execution state the source of truth. Align GUI and CLI node actions around start, stop, reboot, retry, and reinstall, with provider image reset folded into reinstall when supported. Update statechart and architecture docs to match the reinstall/repair model.
5.1 KiB
5.1 KiB
Open Issues
Last checked: 2026-06-17
This file tracks current reminders and unresolved product/engineering work. Completed review notes and stale refactor plans have been removed or squashed into the relevant open item.
Renderer / GUI
- Build the actual observation UI:
- cloud/catalog status lines with current operation, last-finished timestamp, and last error
- warning/error log notifications from
observation:logs, wired to acknowledgement - metric summary components on top of the renderer observation metrics view model
- node observation display for
ready-divergent,provider-missing, degraded services, stale metrics, and log severity counts - render
ActivityState.deploymentEventsas structured deployment steps with severity/status styling
- Continue consolidating reusable Svelte component primitives whose class/API conventions stay CSS-compatible with the Tribes
tribes_uiplugin surface. Avoid Legion-only styling patterns that cannot be mirrored in plugin/HEEx UI. - Improve the common “one DNS zone plus optional domain delegation” setup flow.
- Consider a reusable dialog wrapper for repeated
<dialog>lifecycle patterns. - Add runtime validation for critical IPC payloads, especially large payloads such as
AppSnapshot, when the API surface settles. - Add list virtualization only if managed resource lists become large enough to need it.
Map view
- Add a map component for planned/active infrastructure geography.
- Evaluate MapLibre GL for rendering.
- Evaluate Planetiler or another offline tile generation path.
- Bundle a low-zoom world map with Legion so the GUI does not depend on a live tile service for basic global/regional placement. High zoom levels are not required initially.
Observation, logs, and telemetry
- Decide whether cloud/catalog observation need backend-owned stale thresholds. Logs and metrics intentionally do not: subscribers can decide whether live events have gone quiet.
- Make cluster log backfill robust for late/out-of-order sync and switching queried cluster nodes:
- track cursor/watermark per queried cluster node/source
- prefer
inserted_atorupdated_atAPI filters/cursors overoccurred_at - keep an overlap window and rely on local dedupe
- add a force-full backfill trigger/CLI when needed
- Keep heavier streams separate: metric rollups for time-series summaries, deployment events for selected durable install/recovery steps, logs for warning/error notifications, and plugin-specific methods for plugin details.
Auto-reconciliation and repair
- Tighten repair detection and auto-heal policies beyond the current conservative safe repairs.
- Cover at least:
- DNS-derived host inputs and node deployment metadata
- sync certificate/config material
- system target/channel generation
- more precise cluster membership/control-plane state
- more precise NBDE mode, Tang peer set, and local boot-key state
- more precise provider firewall/security-group rules
- Decide which additional repair needs are auto-reconciled, which require operator confirmation, and which should block further materialization.
Provider cleanup
- Keep provider cleanup grounded in canonical planned/tracked/actual resources, actor state, operation logs, and billing evidence; do not reintroduce a separate provider recovery sidecar without a concrete ownership gap.
- Extend cleanup and adoption workflows beyond CLI actions:
- richer resource detail pages with mismatch evidence, last refresh, cleanup/adoption attempts, and billing impact
- typed confirmation for destructive GUI cleanup when the resource is especially ambiguous
- resource-specific adoption that promotes safe observed resources into normal tracked state/bindings
- Model related billable resources explicitly as provider observations when needed: server/instance, SSH key, firewall/security group, volume/disk, floating IP, DNS zone, domain/order/cart where applicable.
- On failure, prioritize billable provider destroy first, then best-effort ancillary cleanup. Surface structured cleanup summaries to CLI/GUI.
- Extend cleanup/adoption coverage beyond Hetzner/Scaleway compute to OVH and DNS/domain provisioning, where billing semantics are more sensitive.
Pricing and cost tracking
- Extend the initial provider-neutral billing ledger beyond servers, domains, and observed orphans:
- provider-specific volume/disk, floating IP, firewall/security-group, DNS zone, domain order/cart, and renewal semantics
- richer hourly-capped-monthly, annual/fixed, unattached-IP, and stopped/suspended billing rules per provider
- explicit provider invoice/API comparison when billing APIs are available
- Improve persisted monthly summaries so end-of-month snapshots are finalized even if Legion is not opened exactly at month boundary.
- Promote resource-specific adoption/cleanup actions into the billing ledger so cost and cleanup events share one evidence trail.
Engine / API quality
- Continue migrating deeper provider/planning failures to typed
LegionEngineErrorpaths when touched. - Runtime payload validation remains intentionally limited; revisit for critical IPC/daemon boundaries once contracts stabilize.