Document Fruix lifecycle follow-on plan
This commit is contained in:
+461
@@ -0,0 +1,461 @@
|
||||
# Fruix lifecycle plan: node-local management and deploy
|
||||
|
||||
This document is a follow-on to `docs/plan.md`.
|
||||
|
||||
The build / jail / assembly / privileged-policy foundation is now in a good
|
||||
state for the currently exercised paths. The next step is to make installed
|
||||
systems feel like real Fruix-managed nodes rather than images that merely boot.
|
||||
|
||||
The core recommendation is:
|
||||
|
||||
1. harden node-local lifecycle first
|
||||
2. harden deploy second
|
||||
3. make the pinned Fruix-on-node contract explicit
|
||||
4. only then add a real upgrade workflow
|
||||
|
||||
## Short version
|
||||
|
||||
Near-term priority order:
|
||||
|
||||
1. `status`
|
||||
2. `reconfigure`
|
||||
3. `switch`
|
||||
4. `rollback`
|
||||
5. `deploy`
|
||||
6. pinned Fruix / declaration ownership cleanup
|
||||
7. `upgrade`
|
||||
|
||||
Reasoning:
|
||||
|
||||
- `status` / `reconfigure` / `switch` / `rollback` are what make an installed
|
||||
system feel like a Fruix node
|
||||
- `deploy` is the next natural operator workflow once node-local switching is
|
||||
trustworthy
|
||||
- `upgrade` should come later because it is not just an execution feature; it
|
||||
implies source update policy, pinning, provenance comparison, and rollback
|
||||
semantics across revisions
|
||||
|
||||
## Current state snapshot
|
||||
|
||||
There is already meaningful lifecycle scaffolding in-tree.
|
||||
|
||||
### Node-local commands already exist in the rendered CLI
|
||||
|
||||
The in-system `fruix` CLI usage already includes:
|
||||
|
||||
- `fruix system status`
|
||||
- `fruix system build`
|
||||
- `fruix system build-base`
|
||||
- `fruix system deploy`
|
||||
- `fruix system installer-tui`
|
||||
- `fruix system reconfigure`
|
||||
- `fruix system switch`
|
||||
- `fruix system rollback`
|
||||
|
||||
This means the next phase is mostly about hardening, contract cleanup, and
|
||||
validation rather than inventing the entire lifecycle from scratch.
|
||||
|
||||
### Reconfigure already has the right basic shape
|
||||
|
||||
Current node-side `reconfigure` logic already roughly does:
|
||||
|
||||
- determine the default declaration when no path is given
|
||||
- invoke the bundled node CLI to run `system build`
|
||||
- extract `closure_path`
|
||||
- switch to that closure
|
||||
- report that a reboot is required
|
||||
|
||||
That is a good starting shape.
|
||||
|
||||
### Switch / rollback / status already have generation mechanics
|
||||
|
||||
Current node-side logic already contains machinery for:
|
||||
|
||||
- current generation tracking
|
||||
- rollback generation tracking
|
||||
- generation directories
|
||||
- current and rollback closure links
|
||||
- metadata / provenance files per generation
|
||||
- EFI loader update during switch / rollback
|
||||
|
||||
Again, the next work is about correctness, UX, and validation.
|
||||
|
||||
### Deploy already has a real skeleton
|
||||
|
||||
Current host-side deploy logic already does:
|
||||
|
||||
- remote `fruix` presence check
|
||||
- remote store item existence checks
|
||||
- copy of only missing store items
|
||||
- remote `fruix system switch ...`
|
||||
- optional reboot
|
||||
- post-deploy remote `status` collection
|
||||
|
||||
That is enough to justify treating deploy as a hardening phase, not as a green
|
||||
field feature.
|
||||
|
||||
## Initial review findings
|
||||
|
||||
A first pass over the current node-side lifecycle code in
|
||||
`modules/fruix/system/freebsd/render.scm`, together with the initial system
|
||||
layout code in `modules/fruix/system/freebsd/media.scm`, suggests the following.
|
||||
|
||||
### Good news
|
||||
|
||||
- the default declaration path is already explicit and closure-local:
|
||||
- `/run/current-system/metadata/system-declaration.scm`
|
||||
- `/run/current-system/metadata/system-declaration-info.scm`
|
||||
- `/run/current-system/metadata/system-declaration-system`
|
||||
- node-local `reconfigure` already uses the bundled Fruix node CLI rather than
|
||||
an external checkout
|
||||
- generation directories, current generation tracking, rollback tracking, and
|
||||
gcroots already exist
|
||||
- `status`, `switch`, and `rollback` already have simple and readable mechanics
|
||||
|
||||
### Gaps worth addressing first
|
||||
|
||||
#### 1. Generation metadata schema is inconsistent
|
||||
|
||||
The initial generation layout written at install time records:
|
||||
|
||||
- `system-declaration-file`
|
||||
- `system-declaration-info-file`
|
||||
- `system-declaration-system-file`
|
||||
|
||||
But the node-side generation metadata written by later `switch` / `reconfigure`
|
||||
paths currently does not record those fields.
|
||||
|
||||
That should be aligned so generation metadata remains consistent before and
|
||||
after the first node-local transition.
|
||||
|
||||
#### 2. `status` is still a bit too sparse
|
||||
|
||||
Current `status` reports:
|
||||
|
||||
- current generation
|
||||
- current link
|
||||
- current closure
|
||||
- rollback generation
|
||||
- rollback link
|
||||
- rollback closure
|
||||
|
||||
That is a good minimum, but it does not yet surface:
|
||||
|
||||
- declaration identity
|
||||
- generation metadata path
|
||||
- whether current / rollback metadata files are present
|
||||
- pinned Fruix/node payload identity
|
||||
|
||||
#### 3. Switch / rollback ordering should be reviewed deliberately
|
||||
|
||||
The current logic is straightforward, but it updates:
|
||||
|
||||
- generation directories
|
||||
- rollback links/files
|
||||
- current links/files
|
||||
- gcroots
|
||||
- `/run/current-system`
|
||||
- EFI loader state
|
||||
|
||||
in a stepwise order.
|
||||
|
||||
That may be acceptable for now, but it should be reviewed as an intentional
|
||||
ordering with known invariants and failure expectations, not just as whatever
|
||||
sequence happened to emerge first.
|
||||
|
||||
#### 4. The declaration ownership model is explicit, but still not fully
|
||||
operator-facing
|
||||
|
||||
The current implementation already points the default declaration at the active
|
||||
closure metadata, which is good for self-containment.
|
||||
|
||||
However, Fruix still needs to make clearer whether the intended operator model
|
||||
is:
|
||||
|
||||
- reconfigure from the immutable current declaration by default
|
||||
- pass an explicit new declaration path when changing configuration
|
||||
- later maintain a node-local editable declaration source of truth
|
||||
|
||||
### Immediate Phase 1 task list
|
||||
|
||||
- align switched-generation metadata with the initial installed generation
|
||||
schema
|
||||
- enrich `fruix system status` with a small amount of declaration / generation
|
||||
identity information
|
||||
- review and document switch / rollback ordering invariants
|
||||
- validate `status` / `reconfigure` / `rollback` on a real booted self-hosted
|
||||
node
|
||||
|
||||
## Phase 1: node-local lifecycle hardening
|
||||
|
||||
Goal: an installed self-hosted Fruix node can manage itself from its own pinned
|
||||
Fruix payload, without relying on a bootstrap checkout or ambient host paths.
|
||||
|
||||
This is the most important near-term product milestone.
|
||||
|
||||
### Scope
|
||||
|
||||
Focus on:
|
||||
|
||||
- `fruix system status`
|
||||
- `fruix system reconfigure`
|
||||
- `fruix system switch`
|
||||
- `fruix system rollback`
|
||||
|
||||
### Questions to settle
|
||||
|
||||
#### 1. What exactly is the default declaration?
|
||||
|
||||
When the operator runs `fruix system reconfigure` with no explicit declaration,
|
||||
Fruix should have a crisp answer for:
|
||||
|
||||
- where the current declaration lives
|
||||
- where the current system name lives
|
||||
- which Fruix payload on disk owns that declaration
|
||||
- whether local edits to the declaration are the intended workflow
|
||||
|
||||
This should be explicit, not inferred from bootstrap-era conventions.
|
||||
|
||||
#### 2. What is the reboot contract?
|
||||
|
||||
Right now `reconfigure` reports `reboot_required=true`.
|
||||
|
||||
That is acceptable, but Fruix should decide and document whether:
|
||||
|
||||
- every switch requires a reboot
|
||||
- some changes can be activated live and others require a reboot
|
||||
- `switch` means “prepare next boot” or “change active system state now”
|
||||
|
||||
For now, a simple and conservative policy is fine, but it should be explicit.
|
||||
|
||||
#### 3. What makes a generation valid?
|
||||
|
||||
Fruix should clearly define the invariants for a system generation:
|
||||
|
||||
- closure exists
|
||||
- activation script exists
|
||||
- shepherd configuration exists
|
||||
- boot artifacts exist
|
||||
- generation metadata and provenance are written
|
||||
- loader / current / rollback links are updated atomically enough
|
||||
|
||||
#### 4. What is rollback allowed to assume?
|
||||
|
||||
Rollback should be very boring and predictable.
|
||||
|
||||
It should clearly state:
|
||||
|
||||
- what is recorded as the previous generation
|
||||
- when rollback metadata is updated
|
||||
- what happens if the previous generation closure is missing
|
||||
- whether rollback always updates EFI loader state
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
A self-hosted installed system should be able to:
|
||||
|
||||
1. boot successfully
|
||||
2. run `fruix system status` and show coherent current / rollback information
|
||||
3. run `fruix system reconfigure` using its default declaration
|
||||
4. create a new generation with updated metadata
|
||||
5. run `fruix system rollback`
|
||||
6. report the earlier generation again via `status`
|
||||
|
||||
### Recommended implementation tasks
|
||||
|
||||
- make the default declaration / system-name ownership explicit
|
||||
- tighten error messages around missing declaration / missing generation links
|
||||
- verify generation metadata fields are complete and stable
|
||||
- review link update ordering for current / rollback / gcroots / `/run/current-system`
|
||||
- decide whether any metadata should include the pinned Fruix revision or node
|
||||
CLI identity
|
||||
- add a small test plan for generation creation and rollback behavior
|
||||
|
||||
### Likely files
|
||||
|
||||
- `modules/fruix/system/freebsd/render.scm`
|
||||
- `modules/fruix/system/freebsd/media.scm`
|
||||
- `scripts/fruix.scm`
|
||||
- `examples/system/self-hosted-dev.scm`
|
||||
|
||||
## Phase 2: deploy hardening
|
||||
|
||||
Goal: build once on one Fruix-capable machine, then deploy closures to another
|
||||
Fruix node with predictable switching and verification.
|
||||
|
||||
### Scope
|
||||
|
||||
Focus on:
|
||||
|
||||
- `fruix system deploy`
|
||||
- remote `fruix system switch`
|
||||
- optional reboot and post-reboot verification
|
||||
|
||||
### Questions to settle
|
||||
|
||||
#### 1. What are the remote preflight requirements?
|
||||
|
||||
Before deploy, Fruix should clearly check:
|
||||
|
||||
- remote `/usr/local/bin/fruix` exists
|
||||
- remote store directory exists and is writable enough
|
||||
- remote host has the expected node runtime payload
|
||||
- remote node can perform `system switch`
|
||||
|
||||
#### 2. What is the transfer contract?
|
||||
|
||||
Current behavior copies missing store items only. That is good.
|
||||
|
||||
This should remain explicit:
|
||||
|
||||
- which references are transferred
|
||||
- which are skipped because they already exist remotely
|
||||
- whether closure completeness is validated before switching
|
||||
|
||||
#### 3. What is success?
|
||||
|
||||
A successful deploy should ideally mean:
|
||||
|
||||
- the remote switch command completed
|
||||
- the intended closure is now current on the target
|
||||
- optional reboot completed when requested
|
||||
- remote `status` confirms the expected current generation / closure
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
From one Fruix-capable machine, an operator should be able to:
|
||||
|
||||
1. build a closure locally
|
||||
2. deploy it to a second Fruix node
|
||||
3. confirm only missing store items were transferred
|
||||
4. switch the remote node to the new closure
|
||||
5. optionally reboot the remote node
|
||||
6. confirm remote `status` reports the expected closure after reboot
|
||||
|
||||
### Recommended implementation tasks
|
||||
|
||||
- make remote preflight failures more explicit and operator-friendly
|
||||
- ensure deploy metadata is complete enough for troubleshooting
|
||||
- verify closure identity before and after remote switch
|
||||
- verify reboot / reconnect behavior is reliable enough for normal use
|
||||
- decide whether deploy should gain a `--no-switch` or `--copy-only` mode later
|
||||
or whether that is unnecessary complexity for now
|
||||
|
||||
### Likely files
|
||||
|
||||
- `scripts/fruix.scm`
|
||||
- `modules/fruix/system/freebsd/render.scm`
|
||||
|
||||
## Phase 3: make the installed Fruix contract explicit
|
||||
|
||||
Goal: an installed system should clearly contain and own the Fruix payload it
|
||||
uses for self-management.
|
||||
|
||||
This is where the `fruix` vs `fruix-bootstrap` boundary becomes operational,
|
||||
not just architectural.
|
||||
|
||||
### Desired contract
|
||||
|
||||
After installation and first boot:
|
||||
|
||||
- the machine can manage itself using Fruix already present on disk
|
||||
- node-local commands do not depend on a checkout outside the node
|
||||
- the installed declaration source of truth is explicit
|
||||
- the node knows which Fruix payload / revision it is running against
|
||||
|
||||
### Questions to settle
|
||||
|
||||
- where should the installed declaration live long-term?
|
||||
- should the node preserve an editable declaration path distinct from the
|
||||
immutable closure metadata?
|
||||
- how should the pinned Fruix revision be recorded and surfaced?
|
||||
- how much of bootstrap-era installer state should remain after first boot?
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
On an installed node:
|
||||
|
||||
- `fruix system status` can report enough information to identify the current
|
||||
declaration and active closure
|
||||
- `fruix system reconfigure` works without any external checkout
|
||||
- the installed Fruix payload can be updated later by a future explicit
|
||||
Fruix-managed workflow rather than by manual bootstrap-era conventions
|
||||
|
||||
## Phase 4: source / pin / lock / upgrade workflow
|
||||
|
||||
Goal: make `fruix system upgrade` real, not just a thin alias for rebuild.
|
||||
|
||||
This phase should come after Phases 1 through 3.
|
||||
|
||||
### Why upgrade is later
|
||||
|
||||
`upgrade` is not just “run build again.” It implicitly asks:
|
||||
|
||||
- what source identity is current?
|
||||
- what source identity is the proposed next state?
|
||||
- who updated the pin / lock?
|
||||
- how are changes compared and recorded?
|
||||
- what exactly does rollback mean across source revisions?
|
||||
|
||||
Without that, `upgrade` risks being vague and operationally surprising.
|
||||
|
||||
### Desired workflow
|
||||
|
||||
- inspect current pinned Fruix / source state
|
||||
- choose updated pin / lock state
|
||||
- build new closure from that explicit state
|
||||
- switch or deploy it
|
||||
- retain enough provenance for rollback and audit
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
A future `fruix system upgrade` should:
|
||||
|
||||
1. tell the operator what source state is changing
|
||||
2. build from an explicit new pinned state
|
||||
3. record enough provenance to explain the transition
|
||||
4. preserve normal switch / rollback semantics
|
||||
|
||||
## Recommended next concrete milestone
|
||||
|
||||
The strongest next milestone is:
|
||||
|
||||
**Self-hosted Fruix node lifecycle works end-to-end.**
|
||||
|
||||
That means validating all of the following against a real booted self-hosted
|
||||
system:
|
||||
|
||||
1. install it
|
||||
2. boot it
|
||||
3. run `fruix system status`
|
||||
4. run `fruix system reconfigure`
|
||||
5. run `fruix system rollback`
|
||||
6. inspect generation metadata and current / rollback links
|
||||
7. from another machine, run `fruix system deploy`
|
||||
|
||||
This is a better next milestone than `upgrade` because it makes Fruix feel like
|
||||
an actual managed system first.
|
||||
|
||||
## Definition of done for this phase
|
||||
|
||||
This lifecycle phase can be considered “good for now” when:
|
||||
|
||||
- node-local `status` / `reconfigure` / `switch` / `rollback` are reliable on a
|
||||
real installed self-hosted node
|
||||
- remote `deploy` is reliable enough for normal manual operator use
|
||||
- the default declaration and pinned Fruix ownership model are explicit
|
||||
- no bootstrap checkout is required for normal post-install lifecycle
|
||||
- `upgrade` is intentionally deferred until source / pin / lock design is ready
|
||||
|
||||
## Non-goals for this phase
|
||||
|
||||
To keep this phase bounded, do not require all of the following yet:
|
||||
|
||||
- substitutes / binary cache design
|
||||
- full publication story
|
||||
- final source lock UX
|
||||
- advanced deploy orchestration
|
||||
- live non-reboot service reconfiguration semantics
|
||||
- broad package ecosystem polish
|
||||
|
||||
Those can come later.
|
||||
Reference in New Issue
Block a user