diff --git a/docs/plan_2.md b/docs/plan_2.md new file mode 100644 index 0000000..48c37c6 --- /dev/null +++ b/docs/plan_2.md @@ -0,0 +1,461 @@ +# Fruix lifecycle plan: node-local management and deploy + +This document is a follow-on to `docs/plan.md`. + +The build / jail / assembly / privileged-policy foundation is now in a good +state for the currently exercised paths. The next step is to make installed +systems feel like real Fruix-managed nodes rather than images that merely boot. + +The core recommendation is: + +1. harden node-local lifecycle first +2. harden deploy second +3. make the pinned Fruix-on-node contract explicit +4. only then add a real upgrade workflow + +## Short version + +Near-term priority order: + +1. `status` +2. `reconfigure` +3. `switch` +4. `rollback` +5. `deploy` +6. pinned Fruix / declaration ownership cleanup +7. `upgrade` + +Reasoning: + +- `status` / `reconfigure` / `switch` / `rollback` are what make an installed + system feel like a Fruix node +- `deploy` is the next natural operator workflow once node-local switching is + trustworthy +- `upgrade` should come later because it is not just an execution feature; it + implies source update policy, pinning, provenance comparison, and rollback + semantics across revisions + +## Current state snapshot + +There is already meaningful lifecycle scaffolding in-tree. + +### Node-local commands already exist in the rendered CLI + +The in-system `fruix` CLI usage already includes: + +- `fruix system status` +- `fruix system build` +- `fruix system build-base` +- `fruix system deploy` +- `fruix system installer-tui` +- `fruix system reconfigure` +- `fruix system switch` +- `fruix system rollback` + +This means the next phase is mostly about hardening, contract cleanup, and +validation rather than inventing the entire lifecycle from scratch. + +### Reconfigure already has the right basic shape + +Current node-side `reconfigure` logic already roughly does: + +- determine the default declaration when no path is given +- invoke the bundled node CLI to run `system build` +- extract `closure_path` +- switch to that closure +- report that a reboot is required + +That is a good starting shape. + +### Switch / rollback / status already have generation mechanics + +Current node-side logic already contains machinery for: + +- current generation tracking +- rollback generation tracking +- generation directories +- current and rollback closure links +- metadata / provenance files per generation +- EFI loader update during switch / rollback + +Again, the next work is about correctness, UX, and validation. + +### Deploy already has a real skeleton + +Current host-side deploy logic already does: + +- remote `fruix` presence check +- remote store item existence checks +- copy of only missing store items +- remote `fruix system switch ...` +- optional reboot +- post-deploy remote `status` collection + +That is enough to justify treating deploy as a hardening phase, not as a green +field feature. + +## Initial review findings + +A first pass over the current node-side lifecycle code in +`modules/fruix/system/freebsd/render.scm`, together with the initial system +layout code in `modules/fruix/system/freebsd/media.scm`, suggests the following. + +### Good news + +- the default declaration path is already explicit and closure-local: + - `/run/current-system/metadata/system-declaration.scm` + - `/run/current-system/metadata/system-declaration-info.scm` + - `/run/current-system/metadata/system-declaration-system` +- node-local `reconfigure` already uses the bundled Fruix node CLI rather than + an external checkout +- generation directories, current generation tracking, rollback tracking, and + gcroots already exist +- `status`, `switch`, and `rollback` already have simple and readable mechanics + +### Gaps worth addressing first + +#### 1. Generation metadata schema is inconsistent + +The initial generation layout written at install time records: + +- `system-declaration-file` +- `system-declaration-info-file` +- `system-declaration-system-file` + +But the node-side generation metadata written by later `switch` / `reconfigure` +paths currently does not record those fields. + +That should be aligned so generation metadata remains consistent before and +after the first node-local transition. + +#### 2. `status` is still a bit too sparse + +Current `status` reports: + +- current generation +- current link +- current closure +- rollback generation +- rollback link +- rollback closure + +That is a good minimum, but it does not yet surface: + +- declaration identity +- generation metadata path +- whether current / rollback metadata files are present +- pinned Fruix/node payload identity + +#### 3. Switch / rollback ordering should be reviewed deliberately + +The current logic is straightforward, but it updates: + +- generation directories +- rollback links/files +- current links/files +- gcroots +- `/run/current-system` +- EFI loader state + +in a stepwise order. + +That may be acceptable for now, but it should be reviewed as an intentional +ordering with known invariants and failure expectations, not just as whatever +sequence happened to emerge first. + +#### 4. The declaration ownership model is explicit, but still not fully +operator-facing + +The current implementation already points the default declaration at the active +closure metadata, which is good for self-containment. + +However, Fruix still needs to make clearer whether the intended operator model +is: + +- reconfigure from the immutable current declaration by default +- pass an explicit new declaration path when changing configuration +- later maintain a node-local editable declaration source of truth + +### Immediate Phase 1 task list + +- align switched-generation metadata with the initial installed generation + schema +- enrich `fruix system status` with a small amount of declaration / generation + identity information +- review and document switch / rollback ordering invariants +- validate `status` / `reconfigure` / `rollback` on a real booted self-hosted + node + +## Phase 1: node-local lifecycle hardening + +Goal: an installed self-hosted Fruix node can manage itself from its own pinned +Fruix payload, without relying on a bootstrap checkout or ambient host paths. + +This is the most important near-term product milestone. + +### Scope + +Focus on: + +- `fruix system status` +- `fruix system reconfigure` +- `fruix system switch` +- `fruix system rollback` + +### Questions to settle + +#### 1. What exactly is the default declaration? + +When the operator runs `fruix system reconfigure` with no explicit declaration, +Fruix should have a crisp answer for: + +- where the current declaration lives +- where the current system name lives +- which Fruix payload on disk owns that declaration +- whether local edits to the declaration are the intended workflow + +This should be explicit, not inferred from bootstrap-era conventions. + +#### 2. What is the reboot contract? + +Right now `reconfigure` reports `reboot_required=true`. + +That is acceptable, but Fruix should decide and document whether: + +- every switch requires a reboot +- some changes can be activated live and others require a reboot +- `switch` means “prepare next boot” or “change active system state now” + +For now, a simple and conservative policy is fine, but it should be explicit. + +#### 3. What makes a generation valid? + +Fruix should clearly define the invariants for a system generation: + +- closure exists +- activation script exists +- shepherd configuration exists +- boot artifacts exist +- generation metadata and provenance are written +- loader / current / rollback links are updated atomically enough + +#### 4. What is rollback allowed to assume? + +Rollback should be very boring and predictable. + +It should clearly state: + +- what is recorded as the previous generation +- when rollback metadata is updated +- what happens if the previous generation closure is missing +- whether rollback always updates EFI loader state + +### Acceptance criteria + +A self-hosted installed system should be able to: + +1. boot successfully +2. run `fruix system status` and show coherent current / rollback information +3. run `fruix system reconfigure` using its default declaration +4. create a new generation with updated metadata +5. run `fruix system rollback` +6. report the earlier generation again via `status` + +### Recommended implementation tasks + +- make the default declaration / system-name ownership explicit +- tighten error messages around missing declaration / missing generation links +- verify generation metadata fields are complete and stable +- review link update ordering for current / rollback / gcroots / `/run/current-system` +- decide whether any metadata should include the pinned Fruix revision or node + CLI identity +- add a small test plan for generation creation and rollback behavior + +### Likely files + +- `modules/fruix/system/freebsd/render.scm` +- `modules/fruix/system/freebsd/media.scm` +- `scripts/fruix.scm` +- `examples/system/self-hosted-dev.scm` + +## Phase 2: deploy hardening + +Goal: build once on one Fruix-capable machine, then deploy closures to another +Fruix node with predictable switching and verification. + +### Scope + +Focus on: + +- `fruix system deploy` +- remote `fruix system switch` +- optional reboot and post-reboot verification + +### Questions to settle + +#### 1. What are the remote preflight requirements? + +Before deploy, Fruix should clearly check: + +- remote `/usr/local/bin/fruix` exists +- remote store directory exists and is writable enough +- remote host has the expected node runtime payload +- remote node can perform `system switch` + +#### 2. What is the transfer contract? + +Current behavior copies missing store items only. That is good. + +This should remain explicit: + +- which references are transferred +- which are skipped because they already exist remotely +- whether closure completeness is validated before switching + +#### 3. What is success? + +A successful deploy should ideally mean: + +- the remote switch command completed +- the intended closure is now current on the target +- optional reboot completed when requested +- remote `status` confirms the expected current generation / closure + +### Acceptance criteria + +From one Fruix-capable machine, an operator should be able to: + +1. build a closure locally +2. deploy it to a second Fruix node +3. confirm only missing store items were transferred +4. switch the remote node to the new closure +5. optionally reboot the remote node +6. confirm remote `status` reports the expected closure after reboot + +### Recommended implementation tasks + +- make remote preflight failures more explicit and operator-friendly +- ensure deploy metadata is complete enough for troubleshooting +- verify closure identity before and after remote switch +- verify reboot / reconnect behavior is reliable enough for normal use +- decide whether deploy should gain a `--no-switch` or `--copy-only` mode later + or whether that is unnecessary complexity for now + +### Likely files + +- `scripts/fruix.scm` +- `modules/fruix/system/freebsd/render.scm` + +## Phase 3: make the installed Fruix contract explicit + +Goal: an installed system should clearly contain and own the Fruix payload it +uses for self-management. + +This is where the `fruix` vs `fruix-bootstrap` boundary becomes operational, +not just architectural. + +### Desired contract + +After installation and first boot: + +- the machine can manage itself using Fruix already present on disk +- node-local commands do not depend on a checkout outside the node +- the installed declaration source of truth is explicit +- the node knows which Fruix payload / revision it is running against + +### Questions to settle + +- where should the installed declaration live long-term? +- should the node preserve an editable declaration path distinct from the + immutable closure metadata? +- how should the pinned Fruix revision be recorded and surfaced? +- how much of bootstrap-era installer state should remain after first boot? + +### Acceptance criteria + +On an installed node: + +- `fruix system status` can report enough information to identify the current + declaration and active closure +- `fruix system reconfigure` works without any external checkout +- the installed Fruix payload can be updated later by a future explicit + Fruix-managed workflow rather than by manual bootstrap-era conventions + +## Phase 4: source / pin / lock / upgrade workflow + +Goal: make `fruix system upgrade` real, not just a thin alias for rebuild. + +This phase should come after Phases 1 through 3. + +### Why upgrade is later + +`upgrade` is not just “run build again.” It implicitly asks: + +- what source identity is current? +- what source identity is the proposed next state? +- who updated the pin / lock? +- how are changes compared and recorded? +- what exactly does rollback mean across source revisions? + +Without that, `upgrade` risks being vague and operationally surprising. + +### Desired workflow + +- inspect current pinned Fruix / source state +- choose updated pin / lock state +- build new closure from that explicit state +- switch or deploy it +- retain enough provenance for rollback and audit + +### Acceptance criteria + +A future `fruix system upgrade` should: + +1. tell the operator what source state is changing +2. build from an explicit new pinned state +3. record enough provenance to explain the transition +4. preserve normal switch / rollback semantics + +## Recommended next concrete milestone + +The strongest next milestone is: + +**Self-hosted Fruix node lifecycle works end-to-end.** + +That means validating all of the following against a real booted self-hosted +system: + +1. install it +2. boot it +3. run `fruix system status` +4. run `fruix system reconfigure` +5. run `fruix system rollback` +6. inspect generation metadata and current / rollback links +7. from another machine, run `fruix system deploy` + +This is a better next milestone than `upgrade` because it makes Fruix feel like +an actual managed system first. + +## Definition of done for this phase + +This lifecycle phase can be considered “good for now” when: + +- node-local `status` / `reconfigure` / `switch` / `rollback` are reliable on a + real installed self-hosted node +- remote `deploy` is reliable enough for normal manual operator use +- the default declaration and pinned Fruix ownership model are explicit +- no bootstrap checkout is required for normal post-install lifecycle +- `upgrade` is intentionally deferred until source / pin / lock design is ready + +## Non-goals for this phase + +To keep this phase bounded, do not require all of the following yet: + +- substitutes / binary cache design +- full publication story +- final source lock UX +- advanced deploy orchestration +- live non-reboot service reconfiguration semantics +- broad package ecosystem polish + +Those can come later.