Validate Shepherd PID 1 boot on XCP-ng

This commit is contained in:
2026-04-02 13:44:45 +02:00
parent f5ffd111ee
commit 377a6e49ff
4 changed files with 388 additions and 0 deletions

View File

@@ -2486,3 +2486,58 @@ Next recommended step:
1. try the `shepherd-pid1` image on the real XCP-ng VM
2. if it boots there too, decide whether to keep `shepherd-pid1` as an experimental selectable boot mode or advance it further toward the main Fruix boot path
3. continue reducing the remaining Guile / Shepherd compatibility-prefix shims now that the broader `rc.d` boot-manager dependency has been locally bypassed
## 2026-04-02 — Post-Phase-10: Shepherd-as-PID-1 boot also passed on the real XCP-ng VM
Completed work:
- took the locally validated `shepherd-pid1` boot mode and tested it on the real XCP-ng deployment path
- wrote the follow-up report:
- `docs/reports/postphase10-shepherd-pid1-xcpng-freebsd.md`
- expanded the Shepherd-PID-1 operating-system template so the generated guest remains compatible with both local virtio and the real Xen NIC path:
- `tests/system/phase11-shepherd-pid1-operating-system.scm.in`
- now includes:
- `ifconfig_xn0=SYNCDHCP`
- `ifconfig_em0=SYNCDHCP`
- `ifconfig_vtnet0=SYNCDHCP`
- added a dedicated real-VM Shepherd-PID-1 deployment/validation harness:
- `tests/system/run-phase11-shepherd-pid1-xcpng.sh`
Validation:
- `tests/system/run-phase11-shepherd-pid1-xcpng.sh` now passes on the operator-approved VM and existing VDI:
- VM `90490f2e-e8fc-4b7a-388e-5c26f0157289`
- VDI `0f1f90d3-48ca-4fa2-91d8-fc6339b95743`
- passing run workdir:
- `/tmp/pid1-xcpng-1775129768`
- passing real-guest metadata confirmed:
- `ready_marker=ready`
- `run_current_system_target=/frx/store/2940c952e9d35e47f98fe62f296be2b6ab4fceb3eee8248d6a7823decd42a305-fruix-system-fruix-freebsd`
- `pid1_command=[guile]`
- `shepherd_pid=1`
- `shepherd_socket=present`
- `shepherd_status=running`
- `sshd_status=running`
- `init_mode=shepherd-pid1`
Important findings:
- the local QEMU PID 1 prototype was not a simulator-only artifact; the same general boot design also works on the real XCP-ng/Xen guest
- as expected for a Guile-script entry point, the PID 1 process image shows up as Guile, but the meaningful architectural check is that:
- `/var/run/shepherd.pid` contains `1`
- this means Fruix has now validated two distinct real-VM boot architectures on FreeBSD:
- `freebsd-init+rc.d-shepherd`
- `shepherd-pid1`
- however, this still does not remove the current Guile / Shepherd compatibility-prefix shims; those remain a separate runtime-artifact issue rather than an init-manager issue
Current assessment:
- Shepherd-as-PID-1 is now no longer merely a local prototype; it is validated on the real XCP-ng VM as well
- this significantly strengthens the path toward a more Guix-like Fruix system architecture on FreeBSD
- the main remaining native-runtime gap is now the baked-prefix / compatibility-shim problem, not whether Fruix can boot with Shepherd as PID 1
Next recommended step:
1. focus directly on eliminating the remaining Guile / Shepherd compatibility-prefix shims from the guest runtime
2. preserve `shepherd-pid1` as an experimental selectable boot mode while that cleanup proceeds
3. once the runtime-prefix issue is reduced, reassess whether `shepherd-pid1` should replace the older `freebsd-init+rc.d-shepherd` path as the preferred Fruix boot architecture

View File

@@ -0,0 +1,114 @@
# Post-Phase-10: Shepherd-as-PID-1 boot validated on the real XCP-ng FreeBSD VM
Date: 2026-04-02
## Goal
Take the locally validated Shepherd-as-PID-1 Fruix boot prototype and test it on the real operator-approved XCP-ng VM.
Target objects remained the same constrained deployment path used for Phase 9:
- VM: `90490f2e-e8fc-4b7a-388e-5c26f0157289`
- VDI: `0f1f90d3-48ca-4fa2-91d8-fc6339b95743`
The concrete goal for this subphase was to confirm that the new `shepherd-pid1` init mode was not merely a local QEMU curiosity, but could also:
- boot on the real Xen guest,
- reach DHCP and SSH,
- keep Shepherd running as PID 1,
- and still reach the Fruix ready marker.
## Result
The real XCP-ng boot succeeded.
A new deployment/validation harness was added:
- `tests/system/run-phase11-shepherd-pid1-xcpng.sh`
This harness reuses the existing real-VM deployment method:
- build a full-size image matching the existing VDI
- convert it to dynamic VHD
- overwrite the existing VDI
- boot the real VM
- rediscover the guest by MAC/IP
- validate the booted guest over SSH
The new Shepherd-PID-1 image passes that full path.
## Validation
Passing real-VM run:
- `PASS phase11-shepherd-pid1-xcpng`
- workdir: `/tmp/pid1-xcpng-1775129768`
Validated metadata from the real guest:
```text
ready_marker=ready
run_current_system_target=/frx/store/2940c952e9d35e47f98fe62f296be2b6ab4fceb3eee8248d6a7823decd42a305-fruix-system-fruix-freebsd
pid1_command=[guile]
shepherd_pid=1
shepherd_socket=present
shepherd_status=running
sshd_status=running
guest_ip=192.168.213.62
boot_backend=xcp-ng-xo-cli
init_mode=shepherd-pid1
```
The key architectural confirmation is:
- `shepherd_pid=1`
That shows the running Shepherd instance in the real guest is PID 1.
As in the local QEMU prototype, the process image is Guile because Shepherd is launched as a Guile script; however, the service manager itself is the PID 1 process according to Shepherd's own pidfile and control socket state.
## What changed to make the real VM pass
The most important refinement after the first local PID 1 work was making the generated activation path more tolerant of immutable store-backed configuration files during very early boot.
Specifically, the generated activation script now treats these as best-effort:
- `cap_mkdb /etc/login.conf`
- `pwd_mkdb -p /etc/master.passwd`
That matters because on the PID 1 path they happen earlier and should not abort the system if the current `/etc` representation is not suitable for in-place database regeneration.
The Shepherd-PID-1 operating-system template was also expanded to keep the NIC configuration broad enough for both local virtio and the real Xen path:
- `ifconfig_xn0=SYNCDHCP`
- `ifconfig_em0=SYNCDHCP`
- `ifconfig_vtnet0=SYNCDHCP`
## Assessment
This is a stronger result than the earlier local-only prototype.
Fruix now has a real deployment-validated FreeBSD boot mode where:
- FreeBSD `init(8)` hands off immediately via `init_exec`
- the generated Fruix launcher performs the minimal bootstrap
- Shepherd becomes PID 1
- networking and SSH still work on the real XCP-ng VM
- and the system still reaches the Fruix ready marker
That means the project has now validated both of these boot architectures on the real VM:
1. `freebsd-init+rc.d-shepherd`
2. `shepherd-pid1`
## Remaining limitations
This does not yet eliminate the current locally built Guile/Shepherd compatibility-prefix shims.
Those shims are still needed because the locally staged runtime artifacts continue to embed historical build prefixes. The current result proves that the broader init/boot-manager dependency can be removed, but it does not yet fully solve the store-native runtime-prefix problem.
## Conclusion
The Shepherd-as-PID-1 Fruix boot mode now works not only under local QEMU/UEFI, but also on the real operator-approved XCP-ng VM.
This substantially strengthens the case that Fruix can move beyond the transitional `rc.d` bridge design and toward a more Guix-like PID-1-centered system architecture on FreeBSD.