523 lines
17 KiB
Markdown
523 lines
17 KiB
Markdown
# Fruix lifecycle plan: node-local management and deploy
|
|
|
|
This document is a follow-on to `docs/plan.md`.
|
|
|
|
The build / jail / assembly / privileged-policy foundation is now in a good
|
|
state for the currently exercised paths. The next step is to make installed
|
|
systems feel like real Fruix-managed nodes rather than images that merely boot.
|
|
|
|
The core recommendation is:
|
|
|
|
1. harden node-local lifecycle first
|
|
2. harden deploy second
|
|
3. make the pinned Fruix-on-node contract explicit
|
|
4. only then add a real upgrade workflow
|
|
|
|
## Short version
|
|
|
|
Near-term priority order:
|
|
|
|
1. `status`
|
|
2. `reconfigure`
|
|
3. `switch`
|
|
4. `rollback`
|
|
5. `deploy`
|
|
6. pinned Fruix / declaration ownership cleanup
|
|
7. `upgrade`
|
|
|
|
Reasoning:
|
|
|
|
- `status` / `reconfigure` / `switch` / `rollback` are what make an installed
|
|
system feel like a Fruix node
|
|
- `deploy` is the next natural operator workflow once node-local switching is
|
|
trustworthy
|
|
- `upgrade` should come later because it is not just an execution feature; it
|
|
implies source update policy, pinning, provenance comparison, and rollback
|
|
semantics across revisions
|
|
|
|
## Current state snapshot
|
|
|
|
There is already meaningful lifecycle scaffolding in-tree.
|
|
|
|
### Node-local commands already exist in the rendered CLI
|
|
|
|
The in-system `fruix` CLI usage already includes:
|
|
|
|
- `fruix system status`
|
|
- `fruix system build`
|
|
- `fruix system build-base`
|
|
- `fruix system deploy`
|
|
- `fruix system installer-tui`
|
|
- `fruix system reconfigure`
|
|
- `fruix system switch`
|
|
- `fruix system rollback`
|
|
|
|
This means the next phase is mostly about hardening, contract cleanup, and
|
|
validation rather than inventing the entire lifecycle from scratch.
|
|
|
|
### Reconfigure already has the right basic shape
|
|
|
|
Current node-side `reconfigure` logic already roughly does:
|
|
|
|
- determine the default declaration when no path is given
|
|
- invoke the bundled node CLI to run `system build`
|
|
- extract `closure_path`
|
|
- switch to that closure
|
|
- report that a reboot is required
|
|
|
|
That is a good starting shape.
|
|
|
|
### Switch / rollback / status already have generation mechanics
|
|
|
|
Current node-side logic already contains machinery for:
|
|
|
|
- current generation tracking
|
|
- rollback generation tracking
|
|
- generation directories
|
|
- current and rollback closure links
|
|
- metadata / provenance files per generation
|
|
- EFI loader update during switch / rollback
|
|
|
|
Again, the next work is about correctness, UX, and validation.
|
|
|
|
### Deploy already has a real skeleton
|
|
|
|
Current host-side deploy logic already does:
|
|
|
|
- remote `fruix` presence check
|
|
- remote store item existence checks
|
|
- copy of only missing store items
|
|
- remote `fruix system switch ...`
|
|
- optional reboot
|
|
- post-deploy remote `status` collection
|
|
|
|
That is enough to justify treating deploy as a hardening phase, not as a green
|
|
field feature.
|
|
|
|
## Initial review findings
|
|
|
|
A first pass over the current node-side lifecycle code in
|
|
`modules/fruix/system/freebsd/render.scm`, together with the initial system
|
|
layout code in `modules/fruix/system/freebsd/media.scm`, suggests the following.
|
|
|
|
### Good news
|
|
|
|
- the default declaration path is already explicit and closure-local:
|
|
- `/run/current-system/metadata/system-declaration.scm`
|
|
- `/run/current-system/metadata/system-declaration-info.scm`
|
|
- `/run/current-system/metadata/system-declaration-system`
|
|
- node-local `reconfigure` already uses the bundled Fruix node CLI rather than
|
|
an external checkout
|
|
- generation directories, current generation tracking, rollback tracking, and
|
|
gcroots already exist
|
|
- `status`, `switch`, and `rollback` already have simple and readable mechanics
|
|
|
|
### Gaps worth addressing first
|
|
|
|
#### 1. Generation metadata schema is inconsistent
|
|
|
|
The initial generation layout written at install time records:
|
|
|
|
- `system-declaration-file`
|
|
- `system-declaration-info-file`
|
|
- `system-declaration-system-file`
|
|
|
|
But the node-side generation metadata written by later `switch` / `reconfigure`
|
|
paths currently does not record those fields.
|
|
|
|
That should be aligned so generation metadata remains consistent before and
|
|
after the first node-local transition.
|
|
|
|
#### 2. `status` is still a bit too sparse
|
|
|
|
Current `status` reports:
|
|
|
|
- current generation
|
|
- current link
|
|
- current closure
|
|
- rollback generation
|
|
- rollback link
|
|
- rollback closure
|
|
|
|
That is a good minimum, but it does not yet surface:
|
|
|
|
- declaration identity
|
|
- generation metadata path
|
|
- whether current / rollback metadata files are present
|
|
- pinned Fruix/node payload identity
|
|
|
|
#### 3. Switch / rollback ordering should be reviewed deliberately
|
|
|
|
The current logic is straightforward, but it updates:
|
|
|
|
- generation directories
|
|
- rollback links/files
|
|
- current links/files
|
|
- gcroots
|
|
- `/run/current-system`
|
|
- EFI loader state
|
|
|
|
in a stepwise order.
|
|
|
|
That may be acceptable for now, but it should be reviewed as an intentional
|
|
ordering with known invariants and failure expectations, not just as whatever
|
|
sequence happened to emerge first.
|
|
|
|
#### 4. The declaration ownership model is explicit, but still not fully
|
|
operator-facing
|
|
|
|
The current implementation already points the default declaration at the active
|
|
closure metadata, which is good for self-containment.
|
|
|
|
However, Fruix still needs to make clearer whether the intended operator model
|
|
is:
|
|
|
|
- reconfigure from the immutable current declaration by default
|
|
- pass an explicit new declaration path when changing configuration
|
|
- later maintain a node-local editable declaration source of truth
|
|
|
|
### Immediate Phase 1 task list
|
|
|
|
Completed:
|
|
|
|
- aligned switched-generation metadata and provenance with the installed
|
|
generation schema for declaration metadata fields
|
|
- enriched `fruix system status` with declaration and generation identity
|
|
fields
|
|
- added a focused render-level lifecycle test covering those outputs
|
|
- added a manual remote integration harness:
|
|
- `tests/run-self-hosted-node-lifecycle.sh`
|
|
for exercising `status` / `reconfigure` / `rollback` against a real booted
|
|
self-hosted node over SSH
|
|
- added an XCP-ng bring-up harness:
|
|
- `tests/run-self-hosted-node-lifecycle-xcpng.sh`
|
|
that reuses the established test VM path to build/import/boot a self-hosted
|
|
image and then run the SSH lifecycle harness
|
|
- got the XCP-ng lifecycle path green again after tightening several real-host
|
|
assumptions:
|
|
- keep the XCP-ng guest on `ifconfig_xn0=SYNCDHCP`
|
|
- use a root/dev-only filesystem layout in the harness-generated guest so a
|
|
temporary-host store path is not hidden by guest tmpfs policy
|
|
- disable `clear_tmp` in that harness so closure/store paths staged under
|
|
`/tmp` survive boot
|
|
- use a fresh per-run host store in the harness to avoid dragging unrelated
|
|
historical store items into the guest image
|
|
- prepare a narrow node-local reuse store for reconfigure/rollback so the
|
|
guest can reuse its already-staged host-staged package stores
|
|
- harden the staged runtime for live-node reuse by:
|
|
- removing the materialized Guile/shepherd host-prefix leakage
|
|
- fixing root home ownership for SSH strict-modes
|
|
- avoiding external `cmp`/`tail` assumptions in node-local operations
|
|
- seeding `/usr/src/sys/sys` from the staged profile for the transitional
|
|
`freebsd-kernel-headers` path
|
|
|
|
Remaining:
|
|
|
|
- execute the real-node validation harness against a booted self-hosted node
|
|
and record the outcome
|
|
- decide whether generation-local `install.scm` should keep its current
|
|
deployment-oriented schema or move closer to the initial install-generation
|
|
shape
|
|
|
|
### Reviewed switch / rollback ordering invariants
|
|
|
|
The current implementation should be treated as intentionally following these
|
|
rules:
|
|
|
|
- prepare the new generation directory and metadata before changing any current
|
|
or rollback pointers
|
|
- record rollback pointers from the previously current generation before moving
|
|
current pointers to the new generation
|
|
- update generation links / files and gcroots before changing
|
|
`/run/current-system`
|
|
- update EFI loader state after the current closure link has moved
|
|
|
|
That gives Fruix a simple current contract:
|
|
|
|
- generation data exists before it becomes current
|
|
- rollback points at the previously current generation
|
|
- `/run/current-system` moves only after generation metadata and gcroots are in
|
|
place
|
|
|
|
Known caveat for later refinement:
|
|
|
|
- EFI loader update happens after current-state links move, so a failure during
|
|
loader copy could leave the running/current generation advanced while boot
|
|
media state still reflects the previous loader payload
|
|
|
|
That is acceptable for now, but it should remain an explicit reviewed tradeoff.
|
|
|
|
## Phase 1: node-local lifecycle hardening
|
|
|
|
Goal: an installed self-hosted Fruix node can manage itself from its own pinned
|
|
Fruix payload, without relying on a bootstrap checkout or ambient host paths.
|
|
|
|
This is the most important near-term product milestone.
|
|
|
|
### Scope
|
|
|
|
Focus on:
|
|
|
|
- `fruix system status`
|
|
- `fruix system reconfigure`
|
|
- `fruix system switch`
|
|
- `fruix system rollback`
|
|
|
|
### Questions to settle
|
|
|
|
#### 1. What exactly is the default declaration?
|
|
|
|
When the operator runs `fruix system reconfigure` with no explicit declaration,
|
|
Fruix should have a crisp answer for:
|
|
|
|
- where the current declaration lives
|
|
- where the current system name lives
|
|
- which Fruix payload on disk owns that declaration
|
|
- whether local edits to the declaration are the intended workflow
|
|
|
|
This should be explicit, not inferred from bootstrap-era conventions.
|
|
|
|
#### 2. What is the reboot contract?
|
|
|
|
Right now `reconfigure` reports `reboot_required=true`.
|
|
|
|
That is acceptable, but Fruix should decide and document whether:
|
|
|
|
- every switch requires a reboot
|
|
- some changes can be activated live and others require a reboot
|
|
- `switch` means “prepare next boot” or “change active system state now”
|
|
|
|
For now, a simple and conservative policy is fine, but it should be explicit.
|
|
|
|
#### 3. What makes a generation valid?
|
|
|
|
Fruix should clearly define the invariants for a system generation:
|
|
|
|
- closure exists
|
|
- activation script exists
|
|
- shepherd configuration exists
|
|
- boot artifacts exist
|
|
- generation metadata and provenance are written
|
|
- loader / current / rollback links are updated atomically enough
|
|
|
|
#### 4. What is rollback allowed to assume?
|
|
|
|
Rollback should be very boring and predictable.
|
|
|
|
It should clearly state:
|
|
|
|
- what is recorded as the previous generation
|
|
- when rollback metadata is updated
|
|
- what happens if the previous generation closure is missing
|
|
- whether rollback always updates EFI loader state
|
|
|
|
### Acceptance criteria
|
|
|
|
A self-hosted installed system should be able to:
|
|
|
|
1. boot successfully
|
|
2. run `fruix system status` and show coherent current / rollback information
|
|
3. run `fruix system reconfigure` using its default declaration
|
|
4. create a new generation with updated metadata
|
|
5. run `fruix system rollback`
|
|
6. report the earlier generation again via `status`
|
|
|
|
### Recommended implementation tasks
|
|
|
|
- make the default declaration / system-name ownership explicit
|
|
- tighten error messages around missing declaration / missing generation links
|
|
- verify generation metadata fields are complete and stable
|
|
- review link update ordering for current / rollback / gcroots / `/run/current-system`
|
|
- decide whether any metadata should include the pinned Fruix revision or node
|
|
CLI identity
|
|
- add a small test plan for generation creation and rollback behavior
|
|
|
|
### Likely files
|
|
|
|
- `modules/fruix/system/freebsd/render.scm`
|
|
- `modules/fruix/system/freebsd/media.scm`
|
|
- `scripts/fruix.scm`
|
|
- `examples/system/self-hosted-dev.scm`
|
|
|
|
## Phase 2: deploy hardening
|
|
|
|
Goal: build once on one Fruix-capable machine, then deploy closures to another
|
|
Fruix node with predictable switching and verification.
|
|
|
|
### Scope
|
|
|
|
Focus on:
|
|
|
|
- `fruix system deploy`
|
|
- remote `fruix system switch`
|
|
- optional reboot and post-reboot verification
|
|
|
|
### Questions to settle
|
|
|
|
#### 1. What are the remote preflight requirements?
|
|
|
|
Before deploy, Fruix should clearly check:
|
|
|
|
- remote `/usr/local/bin/fruix` exists
|
|
- remote store directory exists and is writable enough
|
|
- remote host has the expected node runtime payload
|
|
- remote node can perform `system switch`
|
|
|
|
#### 2. What is the transfer contract?
|
|
|
|
Current behavior copies missing store items only. That is good.
|
|
|
|
This should remain explicit:
|
|
|
|
- which references are transferred
|
|
- which are skipped because they already exist remotely
|
|
- whether closure completeness is validated before switching
|
|
|
|
#### 3. What is success?
|
|
|
|
A successful deploy should ideally mean:
|
|
|
|
- the remote switch command completed
|
|
- the intended closure is now current on the target
|
|
- optional reboot completed when requested
|
|
- remote `status` confirms the expected current generation / closure
|
|
|
|
### Acceptance criteria
|
|
|
|
From one Fruix-capable machine, an operator should be able to:
|
|
|
|
1. build a closure locally
|
|
2. deploy it to a second Fruix node
|
|
3. confirm only missing store items were transferred
|
|
4. switch the remote node to the new closure
|
|
5. optionally reboot the remote node
|
|
6. confirm remote `status` reports the expected closure after reboot
|
|
|
|
### Recommended implementation tasks
|
|
|
|
- make remote preflight failures more explicit and operator-friendly
|
|
- ensure deploy metadata is complete enough for troubleshooting
|
|
- verify closure identity before and after remote switch
|
|
- verify reboot / reconnect behavior is reliable enough for normal use
|
|
- decide whether deploy should gain a `--no-switch` or `--copy-only` mode later
|
|
or whether that is unnecessary complexity for now
|
|
|
|
### Likely files
|
|
|
|
- `scripts/fruix.scm`
|
|
- `modules/fruix/system/freebsd/render.scm`
|
|
|
|
## Phase 3: make the installed Fruix contract explicit
|
|
|
|
Goal: an installed system should clearly contain and own the Fruix payload it
|
|
uses for self-management.
|
|
|
|
This is where the `fruix` vs `fruix-bootstrap` boundary becomes operational,
|
|
not just architectural.
|
|
|
|
### Desired contract
|
|
|
|
After installation and first boot:
|
|
|
|
- the machine can manage itself using Fruix already present on disk
|
|
- node-local commands do not depend on a checkout outside the node
|
|
- the installed declaration source of truth is explicit
|
|
- the node knows which Fruix payload / revision it is running against
|
|
|
|
### Questions to settle
|
|
|
|
- where should the installed declaration live long-term?
|
|
- should the node preserve an editable declaration path distinct from the
|
|
immutable closure metadata?
|
|
- how should the pinned Fruix revision be recorded and surfaced?
|
|
- how much of bootstrap-era installer state should remain after first boot?
|
|
|
|
### Acceptance criteria
|
|
|
|
On an installed node:
|
|
|
|
- `fruix system status` can report enough information to identify the current
|
|
declaration and active closure
|
|
- `fruix system reconfigure` works without any external checkout
|
|
- the installed Fruix payload can be updated later by a future explicit
|
|
Fruix-managed workflow rather than by manual bootstrap-era conventions
|
|
|
|
## Phase 4: source / pin / lock / upgrade workflow
|
|
|
|
Goal: make `fruix system upgrade` real, not just a thin alias for rebuild.
|
|
|
|
This phase should come after Phases 1 through 3.
|
|
|
|
### Why upgrade is later
|
|
|
|
`upgrade` is not just “run build again.” It implicitly asks:
|
|
|
|
- what source identity is current?
|
|
- what source identity is the proposed next state?
|
|
- who updated the pin / lock?
|
|
- how are changes compared and recorded?
|
|
- what exactly does rollback mean across source revisions?
|
|
|
|
Without that, `upgrade` risks being vague and operationally surprising.
|
|
|
|
### Desired workflow
|
|
|
|
- inspect current pinned Fruix / source state
|
|
- choose updated pin / lock state
|
|
- build new closure from that explicit state
|
|
- switch or deploy it
|
|
- retain enough provenance for rollback and audit
|
|
|
|
### Acceptance criteria
|
|
|
|
A future `fruix system upgrade` should:
|
|
|
|
1. tell the operator what source state is changing
|
|
2. build from an explicit new pinned state
|
|
3. record enough provenance to explain the transition
|
|
4. preserve normal switch / rollback semantics
|
|
|
|
## Recommended next concrete milestone
|
|
|
|
The strongest next milestone is:
|
|
|
|
**Self-hosted Fruix node lifecycle works end-to-end.**
|
|
|
|
That means validating all of the following against a real booted self-hosted
|
|
system:
|
|
|
|
1. install it
|
|
2. boot it
|
|
3. run `fruix system status`
|
|
4. run `fruix system reconfigure`
|
|
5. run `fruix system rollback`
|
|
6. inspect generation metadata and current / rollback links
|
|
7. from another machine, run `fruix system deploy`
|
|
|
|
This is a better next milestone than `upgrade` because it makes Fruix feel like
|
|
an actual managed system first.
|
|
|
|
## Definition of done for this phase
|
|
|
|
This lifecycle phase can be considered “good for now” when:
|
|
|
|
- node-local `status` / `reconfigure` / `switch` / `rollback` are reliable on a
|
|
real installed self-hosted node
|
|
- remote `deploy` is reliable enough for normal manual operator use
|
|
- the default declaration and pinned Fruix ownership model are explicit
|
|
- no bootstrap checkout is required for normal post-install lifecycle
|
|
- `upgrade` is intentionally deferred until source / pin / lock design is ready
|
|
|
|
## Non-goals for this phase
|
|
|
|
To keep this phase bounded, do not require all of the following yet:
|
|
|
|
- substitutes / binary cache design
|
|
- full publication story
|
|
- final source lock UX
|
|
- advanced deploy orchestration
|
|
- live non-reboot service reconfiguration semantics
|
|
- broad package ecosystem polish
|
|
|
|
Those can come later.
|