17 KiB
Fruix lifecycle plan: node-local management and deploy
This document is a follow-on to docs/plan.md.
The build / jail / assembly / privileged-policy foundation is now in a good state for the currently exercised paths. The next step is to make installed systems feel like real Fruix-managed nodes rather than images that merely boot.
The core recommendation is:
- harden node-local lifecycle first
- harden deploy second
- make the pinned Fruix-on-node contract explicit
- only then add a real upgrade workflow
Short version
Near-term priority order:
statusreconfigureswitchrollbackdeploy- pinned Fruix / declaration ownership cleanup
upgrade
Reasoning:
status/reconfigure/switch/rollbackare what make an installed system feel like a Fruix nodedeployis the next natural operator workflow once node-local switching is trustworthyupgradeshould come later because it is not just an execution feature; it implies source update policy, pinning, provenance comparison, and rollback semantics across revisions
Current state snapshot
There is already meaningful lifecycle scaffolding in-tree.
Node-local commands already exist in the rendered CLI
The in-system fruix CLI usage already includes:
fruix system statusfruix system buildfruix system build-basefruix system deployfruix system installer-tuifruix system reconfigurefruix system switchfruix system rollback
This means the next phase is mostly about hardening, contract cleanup, and validation rather than inventing the entire lifecycle from scratch.
Reconfigure already has the right basic shape
Current node-side reconfigure logic already roughly does:
- determine the default declaration when no path is given
- invoke the bundled node CLI to run
system build - extract
closure_path - switch to that closure
- report that a reboot is required
That is a good starting shape.
Switch / rollback / status already have generation mechanics
Current node-side logic already contains machinery for:
- current generation tracking
- rollback generation tracking
- generation directories
- current and rollback closure links
- metadata / provenance files per generation
- EFI loader update during switch / rollback
Again, the next work is about correctness, UX, and validation.
Deploy already has a real skeleton
Current host-side deploy logic already does:
- remote
fruixpresence check - remote store item existence checks
- copy of only missing store items
- remote
fruix system switch ... - optional reboot
- post-deploy remote
statuscollection
That is enough to justify treating deploy as a hardening phase, not as a green field feature.
Initial review findings
A first pass over the current node-side lifecycle code in
modules/fruix/system/freebsd/render.scm, together with the initial system
layout code in modules/fruix/system/freebsd/media.scm, suggests the following.
Good news
- the default declaration path is already explicit and closure-local:
/run/current-system/metadata/system-declaration.scm/run/current-system/metadata/system-declaration-info.scm/run/current-system/metadata/system-declaration-system
- node-local
reconfigurealready uses the bundled Fruix node CLI rather than an external checkout - generation directories, current generation tracking, rollback tracking, and gcroots already exist
status,switch, androllbackalready have simple and readable mechanics
Gaps worth addressing first
1. Generation metadata schema is inconsistent
The initial generation layout written at install time records:
system-declaration-filesystem-declaration-info-filesystem-declaration-system-file
But the node-side generation metadata written by later switch / reconfigure
paths currently does not record those fields.
That should be aligned so generation metadata remains consistent before and after the first node-local transition.
2. status is still a bit too sparse
Current status reports:
- current generation
- current link
- current closure
- rollback generation
- rollback link
- rollback closure
That is a good minimum, but it does not yet surface:
- declaration identity
- generation metadata path
- whether current / rollback metadata files are present
- pinned Fruix/node payload identity
3. Switch / rollback ordering should be reviewed deliberately
The current logic is straightforward, but it updates:
- generation directories
- rollback links/files
- current links/files
- gcroots
/run/current-system- EFI loader state
in a stepwise order.
That may be acceptable for now, but it should be reviewed as an intentional ordering with known invariants and failure expectations, not just as whatever sequence happened to emerge first.
4. The declaration ownership model is explicit, but still not fully
operator-facing
The current implementation already points the default declaration at the active closure metadata, which is good for self-containment.
However, Fruix still needs to make clearer whether the intended operator model is:
- reconfigure from the immutable current declaration by default
- pass an explicit new declaration path when changing configuration
- later maintain a node-local editable declaration source of truth
Immediate Phase 1 task list
Completed:
- aligned switched-generation metadata and provenance with the installed generation schema for declaration metadata fields
- enriched
fruix system statuswith declaration and generation identity fields - added a focused render-level lifecycle test covering those outputs
- added a manual remote integration harness:
tests/run-self-hosted-node-lifecycle.shfor exercisingstatus/reconfigure/rollbackagainst a real booted self-hosted node over SSH
- added an XCP-ng bring-up harness:
tests/run-self-hosted-node-lifecycle-xcpng.shthat reuses the established test VM path to build/import/boot a self-hosted image and then run the SSH lifecycle harness
- got the XCP-ng lifecycle path green again after tightening several real-host
assumptions:
- keep the XCP-ng guest on
ifconfig_xn0=SYNCDHCP - use a root/dev-only filesystem layout in the harness-generated guest so a temporary-host store path is not hidden by guest tmpfs policy
- disable
clear_tmpin that harness so closure/store paths staged under/tmpsurvive boot - use a fresh per-run host store in the harness to avoid dragging unrelated historical store items into the guest image
- prepare a narrow node-local reuse store for reconfigure/rollback so the guest can reuse its already-staged host-staged package stores
- harden the staged runtime for live-node reuse by:
- removing the materialized Guile/shepherd host-prefix leakage
- fixing root home ownership for SSH strict-modes
- avoiding external
cmp/tailassumptions in node-local operations - seeding
/usr/src/sys/sysfrom the staged profile for the transitionalfreebsd-kernel-headerspath
- keep the XCP-ng guest on
Remaining:
- execute the real-node validation harness against a booted self-hosted node and record the outcome
- decide whether generation-local
install.scmshould keep its current deployment-oriented schema or move closer to the initial install-generation shape
Reviewed switch / rollback ordering invariants
The current implementation should be treated as intentionally following these rules:
- prepare the new generation directory and metadata before changing any current or rollback pointers
- record rollback pointers from the previously current generation before moving current pointers to the new generation
- update generation links / files and gcroots before changing
/run/current-system - update EFI loader state after the current closure link has moved
That gives Fruix a simple current contract:
- generation data exists before it becomes current
- rollback points at the previously current generation
/run/current-systemmoves only after generation metadata and gcroots are in place
Known caveat for later refinement:
- EFI loader update happens after current-state links move, so a failure during loader copy could leave the running/current generation advanced while boot media state still reflects the previous loader payload
That is acceptable for now, but it should remain an explicit reviewed tradeoff.
Phase 1: node-local lifecycle hardening
Goal: an installed self-hosted Fruix node can manage itself from its own pinned Fruix payload, without relying on a bootstrap checkout or ambient host paths.
This is the most important near-term product milestone.
Scope
Focus on:
fruix system statusfruix system reconfigurefruix system switchfruix system rollback
Questions to settle
1. What exactly is the default declaration?
When the operator runs fruix system reconfigure with no explicit declaration,
Fruix should have a crisp answer for:
- where the current declaration lives
- where the current system name lives
- which Fruix payload on disk owns that declaration
- whether local edits to the declaration are the intended workflow
This should be explicit, not inferred from bootstrap-era conventions.
2. What is the reboot contract?
Right now reconfigure reports reboot_required=true.
That is acceptable, but Fruix should decide and document whether:
- every switch requires a reboot
- some changes can be activated live and others require a reboot
switchmeans “prepare next boot” or “change active system state now”
For now, a simple and conservative policy is fine, but it should be explicit.
3. What makes a generation valid?
Fruix should clearly define the invariants for a system generation:
- closure exists
- activation script exists
- shepherd configuration exists
- boot artifacts exist
- generation metadata and provenance are written
- loader / current / rollback links are updated atomically enough
4. What is rollback allowed to assume?
Rollback should be very boring and predictable.
It should clearly state:
- what is recorded as the previous generation
- when rollback metadata is updated
- what happens if the previous generation closure is missing
- whether rollback always updates EFI loader state
Acceptance criteria
A self-hosted installed system should be able to:
- boot successfully
- run
fruix system statusand show coherent current / rollback information - run
fruix system reconfigureusing its default declaration - create a new generation with updated metadata
- run
fruix system rollback - report the earlier generation again via
status
Recommended implementation tasks
- make the default declaration / system-name ownership explicit
- tighten error messages around missing declaration / missing generation links
- verify generation metadata fields are complete and stable
- review link update ordering for current / rollback / gcroots /
/run/current-system - decide whether any metadata should include the pinned Fruix revision or node CLI identity
- add a small test plan for generation creation and rollback behavior
Likely files
modules/fruix/system/freebsd/render.scmmodules/fruix/system/freebsd/media.scmscripts/fruix.scmexamples/system/self-hosted-dev.scm
Phase 2: deploy hardening
Goal: build once on one Fruix-capable machine, then deploy closures to another Fruix node with predictable switching and verification.
Scope
Focus on:
fruix system deploy- remote
fruix system switch - optional reboot and post-reboot verification
Questions to settle
1. What are the remote preflight requirements?
Before deploy, Fruix should clearly check:
- remote
/usr/local/bin/fruixexists - remote store directory exists and is writable enough
- remote host has the expected node runtime payload
- remote node can perform
system switch
2. What is the transfer contract?
Current behavior copies missing store items only. That is good.
This should remain explicit:
- which references are transferred
- which are skipped because they already exist remotely
- whether closure completeness is validated before switching
3. What is success?
A successful deploy should ideally mean:
- the remote switch command completed
- the intended closure is now current on the target
- optional reboot completed when requested
- remote
statusconfirms the expected current generation / closure
Acceptance criteria
From one Fruix-capable machine, an operator should be able to:
- build a closure locally
- deploy it to a second Fruix node
- confirm only missing store items were transferred
- switch the remote node to the new closure
- optionally reboot the remote node
- confirm remote
statusreports the expected closure after reboot
Recommended implementation tasks
- make remote preflight failures more explicit and operator-friendly
- ensure deploy metadata is complete enough for troubleshooting
- verify closure identity before and after remote switch
- verify reboot / reconnect behavior is reliable enough for normal use
- decide whether deploy should gain a
--no-switchor--copy-onlymode later or whether that is unnecessary complexity for now
Likely files
scripts/fruix.scmmodules/fruix/system/freebsd/render.scm
Phase 3: make the installed Fruix contract explicit
Goal: an installed system should clearly contain and own the Fruix payload it uses for self-management.
This is where the fruix vs fruix-bootstrap boundary becomes operational,
not just architectural.
Desired contract
After installation and first boot:
- the machine can manage itself using Fruix already present on disk
- node-local commands do not depend on a checkout outside the node
- the installed declaration source of truth is explicit
- the node knows which Fruix payload / revision it is running against
Questions to settle
- where should the installed declaration live long-term?
- should the node preserve an editable declaration path distinct from the immutable closure metadata?
- how should the pinned Fruix revision be recorded and surfaced?
- how much of bootstrap-era installer state should remain after first boot?
Acceptance criteria
On an installed node:
fruix system statuscan report enough information to identify the current declaration and active closurefruix system reconfigureworks without any external checkout- the installed Fruix payload can be updated later by a future explicit Fruix-managed workflow rather than by manual bootstrap-era conventions
Phase 4: source / pin / lock / upgrade workflow
Goal: make fruix system upgrade real, not just a thin alias for rebuild.
This phase should come after Phases 1 through 3.
Why upgrade is later
upgrade is not just “run build again.” It implicitly asks:
- what source identity is current?
- what source identity is the proposed next state?
- who updated the pin / lock?
- how are changes compared and recorded?
- what exactly does rollback mean across source revisions?
Without that, upgrade risks being vague and operationally surprising.
Desired workflow
- inspect current pinned Fruix / source state
- choose updated pin / lock state
- build new closure from that explicit state
- switch or deploy it
- retain enough provenance for rollback and audit
Acceptance criteria
A future fruix system upgrade should:
- tell the operator what source state is changing
- build from an explicit new pinned state
- record enough provenance to explain the transition
- preserve normal switch / rollback semantics
Recommended next concrete milestone
The strongest next milestone is:
Self-hosted Fruix node lifecycle works end-to-end.
That means validating all of the following against a real booted self-hosted system:
- install it
- boot it
- run
fruix system status - run
fruix system reconfigure - run
fruix system rollback - inspect generation metadata and current / rollback links
- from another machine, run
fruix system deploy
This is a better next milestone than upgrade because it makes Fruix feel like
an actual managed system first.
Definition of done for this phase
This lifecycle phase can be considered “good for now” when:
- node-local
status/reconfigure/switch/rollbackare reliable on a real installed self-hosted node - remote
deployis reliable enough for normal manual operator use - the default declaration and pinned Fruix ownership model are explicit
- no bootstrap checkout is required for normal post-install lifecycle
upgradeis intentionally deferred until source / pin / lock design is ready
Non-goals for this phase
To keep this phase bounded, do not require all of the following yet:
- substitutes / binary cache design
- full publication story
- final source lock UX
- advanced deploy orchestration
- live non-reboot service reconfiguration semantics
- broad package ecosystem polish
Those can come later.