self/fruix

Fork 0

Files

T

self 808c280232 Stabilize XCP-ng self-hosted lifecycle validation

2026-04-14 19:00:00 +02:00

17 KiB

Raw Permalink Blame History

Fruix lifecycle plan: node-local management and deploy

This document is a follow-on to docs/plan.md.

The build / jail / assembly / privileged-policy foundation is now in a good state for the currently exercised paths. The next step is to make installed systems feel like real Fruix-managed nodes rather than images that merely boot.

The core recommendation is:

harden node-local lifecycle first
harden deploy second
make the pinned Fruix-on-node contract explicit
only then add a real upgrade workflow

Short version

Near-term priority order:

status
reconfigure
switch
rollback
deploy
pinned Fruix / declaration ownership cleanup
upgrade

Reasoning:

status / reconfigure / switch / rollback are what make an installed system feel like a Fruix node
deploy is the next natural operator workflow once node-local switching is trustworthy
upgrade should come later because it is not just an execution feature; it implies source update policy, pinning, provenance comparison, and rollback semantics across revisions

Current state snapshot

There is already meaningful lifecycle scaffolding in-tree.

Node-local commands already exist in the rendered CLI

The in-system fruix CLI usage already includes:

fruix system status
fruix system build
fruix system build-base
fruix system deploy
fruix system installer-tui
fruix system reconfigure
fruix system switch
fruix system rollback

This means the next phase is mostly about hardening, contract cleanup, and validation rather than inventing the entire lifecycle from scratch.

Reconfigure already has the right basic shape

Current node-side reconfigure logic already roughly does:

determine the default declaration when no path is given
invoke the bundled node CLI to run system build
extract closure_path
switch to that closure
report that a reboot is required

That is a good starting shape.

Switch / rollback / status already have generation mechanics

Current node-side logic already contains machinery for:

current generation tracking
rollback generation tracking
generation directories
current and rollback closure links
metadata / provenance files per generation
EFI loader update during switch / rollback

Again, the next work is about correctness, UX, and validation.

Deploy already has a real skeleton

Current host-side deploy logic already does:

remote fruix presence check
remote store item existence checks
copy of only missing store items
remote fruix system switch ...
optional reboot
post-deploy remote status collection

That is enough to justify treating deploy as a hardening phase, not as a green field feature.

Initial review findings

A first pass over the current node-side lifecycle code in modules/fruix/system/freebsd/render.scm, together with the initial system layout code in modules/fruix/system/freebsd/media.scm, suggests the following.

Good news

the default declaration path is already explicit and closure-local:
- /run/current-system/metadata/system-declaration.scm
- /run/current-system/metadata/system-declaration-info.scm
- /run/current-system/metadata/system-declaration-system
node-local reconfigure already uses the bundled Fruix node CLI rather than an external checkout
generation directories, current generation tracking, rollback tracking, and gcroots already exist
status, switch, and rollback already have simple and readable mechanics

Gaps worth addressing first

1. Generation metadata schema is inconsistent

The initial generation layout written at install time records:

system-declaration-file
system-declaration-info-file
system-declaration-system-file

But the node-side generation metadata written by later switch / reconfigure paths currently does not record those fields.

That should be aligned so generation metadata remains consistent before and after the first node-local transition.

2. `status` is still a bit too sparse

Current status reports:

current generation
current link
current closure
rollback generation
rollback link
rollback closure

That is a good minimum, but it does not yet surface:

declaration identity
generation metadata path
whether current / rollback metadata files are present
pinned Fruix/node payload identity

3. Switch / rollback ordering should be reviewed deliberately

The current logic is straightforward, but it updates:

generation directories
rollback links/files
current links/files
gcroots
/run/current-system
EFI loader state

in a stepwise order.

That may be acceptable for now, but it should be reviewed as an intentional ordering with known invariants and failure expectations, not just as whatever sequence happened to emerge first.

4. The declaration ownership model is explicit, but still not fully

operator-facing

The current implementation already points the default declaration at the active closure metadata, which is good for self-containment.

However, Fruix still needs to make clearer whether the intended operator model is:

reconfigure from the immutable current declaration by default
pass an explicit new declaration path when changing configuration
later maintain a node-local editable declaration source of truth

Immediate Phase 1 task list

Completed:

aligned switched-generation metadata and provenance with the installed generation schema for declaration metadata fields
enriched fruix system status with declaration and generation identity fields
added a focused render-level lifecycle test covering those outputs
added a manual remote integration harness:
- tests/run-self-hosted-node-lifecycle.sh for exercising status / reconfigure / rollback against a real booted self-hosted node over SSH
added an XCP-ng bring-up harness:
- tests/run-self-hosted-node-lifecycle-xcpng.sh that reuses the established test VM path to build/import/boot a self-hosted image and then run the SSH lifecycle harness
got the XCP-ng lifecycle path green again after tightening several real-host assumptions:
- keep the XCP-ng guest on ifconfig_xn0=SYNCDHCP
- use a root/dev-only filesystem layout in the harness-generated guest so a temporary-host store path is not hidden by guest tmpfs policy
- disable clear_tmp in that harness so closure/store paths staged under /tmp survive boot
- use a fresh per-run host store in the harness to avoid dragging unrelated historical store items into the guest image
- prepare a narrow node-local reuse store for reconfigure/rollback so the guest can reuse its already-staged host-staged package stores
- harden the staged runtime for live-node reuse by:
  - removing the materialized Guile/shepherd host-prefix leakage
  - fixing root home ownership for SSH strict-modes
  - avoiding external cmp/tail assumptions in node-local operations
  - seeding /usr/src/sys/sys from the staged profile for the transitional freebsd-kernel-headers path

Remaining:

execute the real-node validation harness against a booted self-hosted node and record the outcome
decide whether generation-local install.scm should keep its current deployment-oriented schema or move closer to the initial install-generation shape

Reviewed switch / rollback ordering invariants

The current implementation should be treated as intentionally following these rules:

prepare the new generation directory and metadata before changing any current or rollback pointers
record rollback pointers from the previously current generation before moving current pointers to the new generation
update generation links / files and gcroots before changing /run/current-system
update EFI loader state after the current closure link has moved

That gives Fruix a simple current contract:

generation data exists before it becomes current
rollback points at the previously current generation
/run/current-system moves only after generation metadata and gcroots are in place

Known caveat for later refinement:

EFI loader update happens after current-state links move, so a failure during loader copy could leave the running/current generation advanced while boot media state still reflects the previous loader payload

That is acceptable for now, but it should remain an explicit reviewed tradeoff.

Phase 1: node-local lifecycle hardening

Goal: an installed self-hosted Fruix node can manage itself from its own pinned Fruix payload, without relying on a bootstrap checkout or ambient host paths.

This is the most important near-term product milestone.

Scope

Focus on:

fruix system status
fruix system reconfigure
fruix system switch
fruix system rollback

Questions to settle

1. What exactly is the default declaration?

When the operator runs fruix system reconfigure with no explicit declaration, Fruix should have a crisp answer for:

where the current declaration lives
where the current system name lives
which Fruix payload on disk owns that declaration
whether local edits to the declaration are the intended workflow

This should be explicit, not inferred from bootstrap-era conventions.

2. What is the reboot contract?

Right now reconfigure reports reboot_required=true.

That is acceptable, but Fruix should decide and document whether:

every switch requires a reboot
some changes can be activated live and others require a reboot
switch means “prepare next boot” or “change active system state now”

For now, a simple and conservative policy is fine, but it should be explicit.

3. What makes a generation valid?

Fruix should clearly define the invariants for a system generation:

closure exists
activation script exists
shepherd configuration exists
boot artifacts exist
generation metadata and provenance are written
loader / current / rollback links are updated atomically enough

4. What is rollback allowed to assume?

Rollback should be very boring and predictable.

It should clearly state:

what is recorded as the previous generation
when rollback metadata is updated
what happens if the previous generation closure is missing
whether rollback always updates EFI loader state

Acceptance criteria

A self-hosted installed system should be able to:

boot successfully
run fruix system status and show coherent current / rollback information
run fruix system reconfigure using its default declaration
create a new generation with updated metadata
run fruix system rollback
report the earlier generation again via status

Recommended implementation tasks

make the default declaration / system-name ownership explicit
tighten error messages around missing declaration / missing generation links
verify generation metadata fields are complete and stable
review link update ordering for current / rollback / gcroots / /run/current-system
decide whether any metadata should include the pinned Fruix revision or node CLI identity
add a small test plan for generation creation and rollback behavior

Likely files

modules/fruix/system/freebsd/render.scm
modules/fruix/system/freebsd/media.scm
scripts/fruix.scm
examples/system/self-hosted-dev.scm

Phase 2: deploy hardening

Goal: build once on one Fruix-capable machine, then deploy closures to another Fruix node with predictable switching and verification.

Scope

Focus on:

fruix system deploy
remote fruix system switch
optional reboot and post-reboot verification

Questions to settle

1. What are the remote preflight requirements?

Before deploy, Fruix should clearly check:

remote /usr/local/bin/fruix exists
remote store directory exists and is writable enough
remote host has the expected node runtime payload
remote node can perform system switch

2. What is the transfer contract?

Current behavior copies missing store items only. That is good.

This should remain explicit:

which references are transferred
which are skipped because they already exist remotely
whether closure completeness is validated before switching

3. What is success?

A successful deploy should ideally mean:

the remote switch command completed
the intended closure is now current on the target
optional reboot completed when requested
remote status confirms the expected current generation / closure

Acceptance criteria

From one Fruix-capable machine, an operator should be able to:

build a closure locally
deploy it to a second Fruix node
confirm only missing store items were transferred
switch the remote node to the new closure
optionally reboot the remote node
confirm remote status reports the expected closure after reboot

Recommended implementation tasks

make remote preflight failures more explicit and operator-friendly
ensure deploy metadata is complete enough for troubleshooting
verify closure identity before and after remote switch
verify reboot / reconnect behavior is reliable enough for normal use
decide whether deploy should gain a --no-switch or --copy-only mode later or whether that is unnecessary complexity for now

Likely files

scripts/fruix.scm
modules/fruix/system/freebsd/render.scm

Phase 3: make the installed Fruix contract explicit

Goal: an installed system should clearly contain and own the Fruix payload it uses for self-management.

This is where the fruix vs fruix-bootstrap boundary becomes operational, not just architectural.

Desired contract

After installation and first boot:

the machine can manage itself using Fruix already present on disk
node-local commands do not depend on a checkout outside the node
the installed declaration source of truth is explicit
the node knows which Fruix payload / revision it is running against

Questions to settle

where should the installed declaration live long-term?
should the node preserve an editable declaration path distinct from the immutable closure metadata?
how should the pinned Fruix revision be recorded and surfaced?
how much of bootstrap-era installer state should remain after first boot?

Acceptance criteria

On an installed node:

fruix system status can report enough information to identify the current declaration and active closure
fruix system reconfigure works without any external checkout
the installed Fruix payload can be updated later by a future explicit Fruix-managed workflow rather than by manual bootstrap-era conventions

Phase 4: source / pin / lock / upgrade workflow

Goal: make fruix system upgrade real, not just a thin alias for rebuild.

This phase should come after Phases 1 through 3.

Why upgrade is later

upgrade is not just “run build again.” It implicitly asks:

what source identity is current?
what source identity is the proposed next state?
who updated the pin / lock?
how are changes compared and recorded?
what exactly does rollback mean across source revisions?

Without that, upgrade risks being vague and operationally surprising.

Desired workflow

inspect current pinned Fruix / source state
choose updated pin / lock state
build new closure from that explicit state
switch or deploy it
retain enough provenance for rollback and audit

Acceptance criteria

A future fruix system upgrade should:

tell the operator what source state is changing
build from an explicit new pinned state
record enough provenance to explain the transition
preserve normal switch / rollback semantics

Recommended next concrete milestone

The strongest next milestone is:

Self-hosted Fruix node lifecycle works end-to-end.

That means validating all of the following against a real booted self-hosted system:

install it
boot it
run fruix system status
run fruix system reconfigure
run fruix system rollback
inspect generation metadata and current / rollback links
from another machine, run fruix system deploy

This is a better next milestone than upgrade because it makes Fruix feel like an actual managed system first.

Definition of done for this phase

This lifecycle phase can be considered “good for now” when:

node-local status / reconfigure / switch / rollback are reliable on a real installed self-hosted node
remote deploy is reliable enough for normal manual operator use
the default declaration and pinned Fruix ownership model are explicit
no bootstrap checkout is required for normal post-install lifecycle
upgrade is intentionally deferred until source / pin / lock design is ready

Non-goals for this phase

To keep this phase bounded, do not require all of the following yet:

substitutes / binary cache design
full publication story
final source lock UX
advanced deploy orchestration
live non-reboot service reconfiguration semantics
broad package ecosystem polish

Those can come later.

17 KiB Raw Permalink Blame History

Fruix lifecycle plan: node-local management and deploy

Short version

Current state snapshot

Node-local commands already exist in the rendered CLI

Reconfigure already has the right basic shape

Switch / rollback / status already have generation mechanics

Deploy already has a real skeleton

Initial review findings

Good news

Gaps worth addressing first

1. Generation metadata schema is inconsistent

2. status is still a bit too sparse

3. Switch / rollback ordering should be reviewed deliberately

4. The declaration ownership model is explicit, but still not fully

Immediate Phase 1 task list

Reviewed switch / rollback ordering invariants

Phase 1: node-local lifecycle hardening

Scope

Questions to settle

1. What exactly is the default declaration?

2. What is the reboot contract?

3. What makes a generation valid?

4. What is rollback allowed to assume?

Acceptance criteria

Recommended implementation tasks

Likely files

Phase 2: deploy hardening

Scope

Questions to settle

1. What are the remote preflight requirements?

2. What is the transfer contract?

3. What is success?

Acceptance criteria

Recommended implementation tasks

Likely files

Phase 3: make the installed Fruix contract explicit

Desired contract

Questions to settle

Acceptance criteria

Phase 4: source / pin / lock / upgrade workflow

Why upgrade is later

Desired workflow

Acceptance criteria

Recommended next concrete milestone

Definition of done for this phase

Non-goals for this phase

17 KiB

Raw Permalink Blame History

2. `status` is still a bit too sparse