Files

17 KiB

Fruix lifecycle plan: node-local management and deploy

This document is a follow-on to docs/plan.md.

The build / jail / assembly / privileged-policy foundation is now in a good state for the currently exercised paths. The next step is to make installed systems feel like real Fruix-managed nodes rather than images that merely boot.

The core recommendation is:

  1. harden node-local lifecycle first
  2. harden deploy second
  3. make the pinned Fruix-on-node contract explicit
  4. only then add a real upgrade workflow

Short version

Near-term priority order:

  1. status
  2. reconfigure
  3. switch
  4. rollback
  5. deploy
  6. pinned Fruix / declaration ownership cleanup
  7. upgrade

Reasoning:

  • status / reconfigure / switch / rollback are what make an installed system feel like a Fruix node
  • deploy is the next natural operator workflow once node-local switching is trustworthy
  • upgrade should come later because it is not just an execution feature; it implies source update policy, pinning, provenance comparison, and rollback semantics across revisions

Current state snapshot

There is already meaningful lifecycle scaffolding in-tree.

Node-local commands already exist in the rendered CLI

The in-system fruix CLI usage already includes:

  • fruix system status
  • fruix system build
  • fruix system build-base
  • fruix system deploy
  • fruix system installer-tui
  • fruix system reconfigure
  • fruix system switch
  • fruix system rollback

This means the next phase is mostly about hardening, contract cleanup, and validation rather than inventing the entire lifecycle from scratch.

Reconfigure already has the right basic shape

Current node-side reconfigure logic already roughly does:

  • determine the default declaration when no path is given
  • invoke the bundled node CLI to run system build
  • extract closure_path
  • switch to that closure
  • report that a reboot is required

That is a good starting shape.

Switch / rollback / status already have generation mechanics

Current node-side logic already contains machinery for:

  • current generation tracking
  • rollback generation tracking
  • generation directories
  • current and rollback closure links
  • metadata / provenance files per generation
  • EFI loader update during switch / rollback

Again, the next work is about correctness, UX, and validation.

Deploy already has a real skeleton

Current host-side deploy logic already does:

  • remote fruix presence check
  • remote store item existence checks
  • copy of only missing store items
  • remote fruix system switch ...
  • optional reboot
  • post-deploy remote status collection

That is enough to justify treating deploy as a hardening phase, not as a green field feature.

Initial review findings

A first pass over the current node-side lifecycle code in modules/fruix/system/freebsd/render.scm, together with the initial system layout code in modules/fruix/system/freebsd/media.scm, suggests the following.

Good news

  • the default declaration path is already explicit and closure-local:
    • /run/current-system/metadata/system-declaration.scm
    • /run/current-system/metadata/system-declaration-info.scm
    • /run/current-system/metadata/system-declaration-system
  • node-local reconfigure already uses the bundled Fruix node CLI rather than an external checkout
  • generation directories, current generation tracking, rollback tracking, and gcroots already exist
  • status, switch, and rollback already have simple and readable mechanics

Gaps worth addressing first

1. Generation metadata schema is inconsistent

The initial generation layout written at install time records:

  • system-declaration-file
  • system-declaration-info-file
  • system-declaration-system-file

But the node-side generation metadata written by later switch / reconfigure paths currently does not record those fields.

That should be aligned so generation metadata remains consistent before and after the first node-local transition.

2. status is still a bit too sparse

Current status reports:

  • current generation
  • current link
  • current closure
  • rollback generation
  • rollback link
  • rollback closure

That is a good minimum, but it does not yet surface:

  • declaration identity
  • generation metadata path
  • whether current / rollback metadata files are present
  • pinned Fruix/node payload identity

3. Switch / rollback ordering should be reviewed deliberately

The current logic is straightforward, but it updates:

  • generation directories
  • rollback links/files
  • current links/files
  • gcroots
  • /run/current-system
  • EFI loader state

in a stepwise order.

That may be acceptable for now, but it should be reviewed as an intentional ordering with known invariants and failure expectations, not just as whatever sequence happened to emerge first.

4. The declaration ownership model is explicit, but still not fully

operator-facing

The current implementation already points the default declaration at the active closure metadata, which is good for self-containment.

However, Fruix still needs to make clearer whether the intended operator model is:

  • reconfigure from the immutable current declaration by default
  • pass an explicit new declaration path when changing configuration
  • later maintain a node-local editable declaration source of truth

Immediate Phase 1 task list

Completed:

  • aligned switched-generation metadata and provenance with the installed generation schema for declaration metadata fields
  • enriched fruix system status with declaration and generation identity fields
  • added a focused render-level lifecycle test covering those outputs
  • added a manual remote integration harness:
    • tests/run-self-hosted-node-lifecycle.sh for exercising status / reconfigure / rollback against a real booted self-hosted node over SSH
  • added an XCP-ng bring-up harness:
    • tests/run-self-hosted-node-lifecycle-xcpng.sh that reuses the established test VM path to build/import/boot a self-hosted image and then run the SSH lifecycle harness
  • got the XCP-ng lifecycle path green again after tightening several real-host assumptions:
    • keep the XCP-ng guest on ifconfig_xn0=SYNCDHCP
    • use a root/dev-only filesystem layout in the harness-generated guest so a temporary-host store path is not hidden by guest tmpfs policy
    • disable clear_tmp in that harness so closure/store paths staged under /tmp survive boot
    • use a fresh per-run host store in the harness to avoid dragging unrelated historical store items into the guest image
    • prepare a narrow node-local reuse store for reconfigure/rollback so the guest can reuse its already-staged host-staged package stores
    • harden the staged runtime for live-node reuse by:
      • removing the materialized Guile/shepherd host-prefix leakage
      • fixing root home ownership for SSH strict-modes
      • avoiding external cmp/tail assumptions in node-local operations
      • seeding /usr/src/sys/sys from the staged profile for the transitional freebsd-kernel-headers path

Remaining:

  • execute the real-node validation harness against a booted self-hosted node and record the outcome
  • decide whether generation-local install.scm should keep its current deployment-oriented schema or move closer to the initial install-generation shape

Reviewed switch / rollback ordering invariants

The current implementation should be treated as intentionally following these rules:

  • prepare the new generation directory and metadata before changing any current or rollback pointers
  • record rollback pointers from the previously current generation before moving current pointers to the new generation
  • update generation links / files and gcroots before changing /run/current-system
  • update EFI loader state after the current closure link has moved

That gives Fruix a simple current contract:

  • generation data exists before it becomes current
  • rollback points at the previously current generation
  • /run/current-system moves only after generation metadata and gcroots are in place

Known caveat for later refinement:

  • EFI loader update happens after current-state links move, so a failure during loader copy could leave the running/current generation advanced while boot media state still reflects the previous loader payload

That is acceptable for now, but it should remain an explicit reviewed tradeoff.

Phase 1: node-local lifecycle hardening

Goal: an installed self-hosted Fruix node can manage itself from its own pinned Fruix payload, without relying on a bootstrap checkout or ambient host paths.

This is the most important near-term product milestone.

Scope

Focus on:

  • fruix system status
  • fruix system reconfigure
  • fruix system switch
  • fruix system rollback

Questions to settle

1. What exactly is the default declaration?

When the operator runs fruix system reconfigure with no explicit declaration, Fruix should have a crisp answer for:

  • where the current declaration lives
  • where the current system name lives
  • which Fruix payload on disk owns that declaration
  • whether local edits to the declaration are the intended workflow

This should be explicit, not inferred from bootstrap-era conventions.

2. What is the reboot contract?

Right now reconfigure reports reboot_required=true.

That is acceptable, but Fruix should decide and document whether:

  • every switch requires a reboot
  • some changes can be activated live and others require a reboot
  • switch means “prepare next boot” or “change active system state now”

For now, a simple and conservative policy is fine, but it should be explicit.

3. What makes a generation valid?

Fruix should clearly define the invariants for a system generation:

  • closure exists
  • activation script exists
  • shepherd configuration exists
  • boot artifacts exist
  • generation metadata and provenance are written
  • loader / current / rollback links are updated atomically enough

4. What is rollback allowed to assume?

Rollback should be very boring and predictable.

It should clearly state:

  • what is recorded as the previous generation
  • when rollback metadata is updated
  • what happens if the previous generation closure is missing
  • whether rollback always updates EFI loader state

Acceptance criteria

A self-hosted installed system should be able to:

  1. boot successfully
  2. run fruix system status and show coherent current / rollback information
  3. run fruix system reconfigure using its default declaration
  4. create a new generation with updated metadata
  5. run fruix system rollback
  6. report the earlier generation again via status
  • make the default declaration / system-name ownership explicit
  • tighten error messages around missing declaration / missing generation links
  • verify generation metadata fields are complete and stable
  • review link update ordering for current / rollback / gcroots / /run/current-system
  • decide whether any metadata should include the pinned Fruix revision or node CLI identity
  • add a small test plan for generation creation and rollback behavior

Likely files

  • modules/fruix/system/freebsd/render.scm
  • modules/fruix/system/freebsd/media.scm
  • scripts/fruix.scm
  • examples/system/self-hosted-dev.scm

Phase 2: deploy hardening

Goal: build once on one Fruix-capable machine, then deploy closures to another Fruix node with predictable switching and verification.

Scope

Focus on:

  • fruix system deploy
  • remote fruix system switch
  • optional reboot and post-reboot verification

Questions to settle

1. What are the remote preflight requirements?

Before deploy, Fruix should clearly check:

  • remote /usr/local/bin/fruix exists
  • remote store directory exists and is writable enough
  • remote host has the expected node runtime payload
  • remote node can perform system switch

2. What is the transfer contract?

Current behavior copies missing store items only. That is good.

This should remain explicit:

  • which references are transferred
  • which are skipped because they already exist remotely
  • whether closure completeness is validated before switching

3. What is success?

A successful deploy should ideally mean:

  • the remote switch command completed
  • the intended closure is now current on the target
  • optional reboot completed when requested
  • remote status confirms the expected current generation / closure

Acceptance criteria

From one Fruix-capable machine, an operator should be able to:

  1. build a closure locally
  2. deploy it to a second Fruix node
  3. confirm only missing store items were transferred
  4. switch the remote node to the new closure
  5. optionally reboot the remote node
  6. confirm remote status reports the expected closure after reboot
  • make remote preflight failures more explicit and operator-friendly
  • ensure deploy metadata is complete enough for troubleshooting
  • verify closure identity before and after remote switch
  • verify reboot / reconnect behavior is reliable enough for normal use
  • decide whether deploy should gain a --no-switch or --copy-only mode later or whether that is unnecessary complexity for now

Likely files

  • scripts/fruix.scm
  • modules/fruix/system/freebsd/render.scm

Phase 3: make the installed Fruix contract explicit

Goal: an installed system should clearly contain and own the Fruix payload it uses for self-management.

This is where the fruix vs fruix-bootstrap boundary becomes operational, not just architectural.

Desired contract

After installation and first boot:

  • the machine can manage itself using Fruix already present on disk
  • node-local commands do not depend on a checkout outside the node
  • the installed declaration source of truth is explicit
  • the node knows which Fruix payload / revision it is running against

Questions to settle

  • where should the installed declaration live long-term?
  • should the node preserve an editable declaration path distinct from the immutable closure metadata?
  • how should the pinned Fruix revision be recorded and surfaced?
  • how much of bootstrap-era installer state should remain after first boot?

Acceptance criteria

On an installed node:

  • fruix system status can report enough information to identify the current declaration and active closure
  • fruix system reconfigure works without any external checkout
  • the installed Fruix payload can be updated later by a future explicit Fruix-managed workflow rather than by manual bootstrap-era conventions

Phase 4: source / pin / lock / upgrade workflow

Goal: make fruix system upgrade real, not just a thin alias for rebuild.

This phase should come after Phases 1 through 3.

Why upgrade is later

upgrade is not just “run build again.” It implicitly asks:

  • what source identity is current?
  • what source identity is the proposed next state?
  • who updated the pin / lock?
  • how are changes compared and recorded?
  • what exactly does rollback mean across source revisions?

Without that, upgrade risks being vague and operationally surprising.

Desired workflow

  • inspect current pinned Fruix / source state
  • choose updated pin / lock state
  • build new closure from that explicit state
  • switch or deploy it
  • retain enough provenance for rollback and audit

Acceptance criteria

A future fruix system upgrade should:

  1. tell the operator what source state is changing
  2. build from an explicit new pinned state
  3. record enough provenance to explain the transition
  4. preserve normal switch / rollback semantics

The strongest next milestone is:

Self-hosted Fruix node lifecycle works end-to-end.

That means validating all of the following against a real booted self-hosted system:

  1. install it
  2. boot it
  3. run fruix system status
  4. run fruix system reconfigure
  5. run fruix system rollback
  6. inspect generation metadata and current / rollback links
  7. from another machine, run fruix system deploy

This is a better next milestone than upgrade because it makes Fruix feel like an actual managed system first.

Definition of done for this phase

This lifecycle phase can be considered “good for now” when:

  • node-local status / reconfigure / switch / rollback are reliable on a real installed self-hosted node
  • remote deploy is reliable enough for normal manual operator use
  • the default declaration and pinned Fruix ownership model are explicit
  • no bootstrap checkout is required for normal post-install lifecycle
  • upgrade is intentionally deferred until source / pin / lock design is ready

Non-goals for this phase

To keep this phase bounded, do not require all of the following yet:

  • substitutes / binary cache design
  • full publication story
  • final source lock UX
  • advanced deploy orchestration
  • live non-reboot service reconfiguration semantics
  • broad package ecosystem polish

Those can come later.