diff --git a/docs/PROGRESS.md b/docs/PROGRESS.md index d608d1f..4989ce3 100644 --- a/docs/PROGRESS.md +++ b/docs/PROGRESS.md @@ -2486,3 +2486,58 @@ Next recommended step: 1. try the `shepherd-pid1` image on the real XCP-ng VM 2. if it boots there too, decide whether to keep `shepherd-pid1` as an experimental selectable boot mode or advance it further toward the main Fruix boot path 3. continue reducing the remaining Guile / Shepherd compatibility-prefix shims now that the broader `rc.d` boot-manager dependency has been locally bypassed + +## 2026-04-02 — Post-Phase-10: Shepherd-as-PID-1 boot also passed on the real XCP-ng VM + +Completed work: + +- took the locally validated `shepherd-pid1` boot mode and tested it on the real XCP-ng deployment path +- wrote the follow-up report: + - `docs/reports/postphase10-shepherd-pid1-xcpng-freebsd.md` +- expanded the Shepherd-PID-1 operating-system template so the generated guest remains compatible with both local virtio and the real Xen NIC path: + - `tests/system/phase11-shepherd-pid1-operating-system.scm.in` + - now includes: + - `ifconfig_xn0=SYNCDHCP` + - `ifconfig_em0=SYNCDHCP` + - `ifconfig_vtnet0=SYNCDHCP` +- added a dedicated real-VM Shepherd-PID-1 deployment/validation harness: + - `tests/system/run-phase11-shepherd-pid1-xcpng.sh` + +Validation: + +- `tests/system/run-phase11-shepherd-pid1-xcpng.sh` now passes on the operator-approved VM and existing VDI: + - VM `90490f2e-e8fc-4b7a-388e-5c26f0157289` + - VDI `0f1f90d3-48ca-4fa2-91d8-fc6339b95743` +- passing run workdir: + - `/tmp/pid1-xcpng-1775129768` +- passing real-guest metadata confirmed: + - `ready_marker=ready` + - `run_current_system_target=/frx/store/2940c952e9d35e47f98fe62f296be2b6ab4fceb3eee8248d6a7823decd42a305-fruix-system-fruix-freebsd` + - `pid1_command=[guile]` + - `shepherd_pid=1` + - `shepherd_socket=present` + - `shepherd_status=running` + - `sshd_status=running` + - `init_mode=shepherd-pid1` + +Important findings: + +- the local QEMU PID 1 prototype was not a simulator-only artifact; the same general boot design also works on the real XCP-ng/Xen guest +- as expected for a Guile-script entry point, the PID 1 process image shows up as Guile, but the meaningful architectural check is that: + - `/var/run/shepherd.pid` contains `1` +- this means Fruix has now validated two distinct real-VM boot architectures on FreeBSD: + - `freebsd-init+rc.d-shepherd` + - `shepherd-pid1` +- however, this still does not remove the current Guile / Shepherd compatibility-prefix shims; those remain a separate runtime-artifact issue rather than an init-manager issue + +Current assessment: + +- Shepherd-as-PID-1 is now no longer merely a local prototype; it is validated on the real XCP-ng VM as well +- this significantly strengthens the path toward a more Guix-like Fruix system architecture on FreeBSD +- the main remaining native-runtime gap is now the baked-prefix / compatibility-shim problem, not whether Fruix can boot with Shepherd as PID 1 + +Next recommended step: + +1. focus directly on eliminating the remaining Guile / Shepherd compatibility-prefix shims from the guest runtime +2. preserve `shepherd-pid1` as an experimental selectable boot mode while that cleanup proceeds +3. once the runtime-prefix issue is reduced, reassess whether `shepherd-pid1` should replace the older `freebsd-init+rc.d-shepherd` path as the preferred Fruix boot architecture diff --git a/docs/reports/postphase10-shepherd-pid1-xcpng-freebsd.md b/docs/reports/postphase10-shepherd-pid1-xcpng-freebsd.md new file mode 100644 index 0000000..5109e2c --- /dev/null +++ b/docs/reports/postphase10-shepherd-pid1-xcpng-freebsd.md @@ -0,0 +1,114 @@ +# Post-Phase-10: Shepherd-as-PID-1 boot validated on the real XCP-ng FreeBSD VM + +Date: 2026-04-02 + +## Goal + +Take the locally validated Shepherd-as-PID-1 Fruix boot prototype and test it on the real operator-approved XCP-ng VM. + +Target objects remained the same constrained deployment path used for Phase 9: + +- VM: `90490f2e-e8fc-4b7a-388e-5c26f0157289` +- VDI: `0f1f90d3-48ca-4fa2-91d8-fc6339b95743` + +The concrete goal for this subphase was to confirm that the new `shepherd-pid1` init mode was not merely a local QEMU curiosity, but could also: + +- boot on the real Xen guest, +- reach DHCP and SSH, +- keep Shepherd running as PID 1, +- and still reach the Fruix ready marker. + +## Result + +The real XCP-ng boot succeeded. + +A new deployment/validation harness was added: + +- `tests/system/run-phase11-shepherd-pid1-xcpng.sh` + +This harness reuses the existing real-VM deployment method: + +- build a full-size image matching the existing VDI +- convert it to dynamic VHD +- overwrite the existing VDI +- boot the real VM +- rediscover the guest by MAC/IP +- validate the booted guest over SSH + +The new Shepherd-PID-1 image passes that full path. + +## Validation + +Passing real-VM run: + +- `PASS phase11-shepherd-pid1-xcpng` +- workdir: `/tmp/pid1-xcpng-1775129768` + +Validated metadata from the real guest: + +```text +ready_marker=ready +run_current_system_target=/frx/store/2940c952e9d35e47f98fe62f296be2b6ab4fceb3eee8248d6a7823decd42a305-fruix-system-fruix-freebsd +pid1_command=[guile] +shepherd_pid=1 +shepherd_socket=present +shepherd_status=running +sshd_status=running +guest_ip=192.168.213.62 +boot_backend=xcp-ng-xo-cli +init_mode=shepherd-pid1 +``` + +The key architectural confirmation is: + +- `shepherd_pid=1` + +That shows the running Shepherd instance in the real guest is PID 1. + +As in the local QEMU prototype, the process image is Guile because Shepherd is launched as a Guile script; however, the service manager itself is the PID 1 process according to Shepherd's own pidfile and control socket state. + +## What changed to make the real VM pass + +The most important refinement after the first local PID 1 work was making the generated activation path more tolerant of immutable store-backed configuration files during very early boot. + +Specifically, the generated activation script now treats these as best-effort: + +- `cap_mkdb /etc/login.conf` +- `pwd_mkdb -p /etc/master.passwd` + +That matters because on the PID 1 path they happen earlier and should not abort the system if the current `/etc` representation is not suitable for in-place database regeneration. + +The Shepherd-PID-1 operating-system template was also expanded to keep the NIC configuration broad enough for both local virtio and the real Xen path: + +- `ifconfig_xn0=SYNCDHCP` +- `ifconfig_em0=SYNCDHCP` +- `ifconfig_vtnet0=SYNCDHCP` + +## Assessment + +This is a stronger result than the earlier local-only prototype. + +Fruix now has a real deployment-validated FreeBSD boot mode where: + +- FreeBSD `init(8)` hands off immediately via `init_exec` +- the generated Fruix launcher performs the minimal bootstrap +- Shepherd becomes PID 1 +- networking and SSH still work on the real XCP-ng VM +- and the system still reaches the Fruix ready marker + +That means the project has now validated both of these boot architectures on the real VM: + +1. `freebsd-init+rc.d-shepherd` +2. `shepherd-pid1` + +## Remaining limitations + +This does not yet eliminate the current locally built Guile/Shepherd compatibility-prefix shims. + +Those shims are still needed because the locally staged runtime artifacts continue to embed historical build prefixes. The current result proves that the broader init/boot-manager dependency can be removed, but it does not yet fully solve the store-native runtime-prefix problem. + +## Conclusion + +The Shepherd-as-PID-1 Fruix boot mode now works not only under local QEMU/UEFI, but also on the real operator-approved XCP-ng VM. + +This substantially strengthens the case that Fruix can move beyond the transitional `rc.d` bridge design and toward a more Guix-like PID-1-centered system architecture on FreeBSD. diff --git a/tests/system/phase11-shepherd-pid1-operating-system.scm.in b/tests/system/phase11-shepherd-pid1-operating-system.scm.in index 65ca18c..aefa371 100644 --- a/tests/system/phase11-shepherd-pid1-operating-system.scm.in +++ b/tests/system/phase11-shepherd-pid1-operating-system.scm.in @@ -70,6 +70,8 @@ ("hostid_enable" . "NO") ("sendmail_enable" . "NONE") ("sshd_enable" . "YES") + ("ifconfig_xn0" . "SYNCDHCP") + ("ifconfig_em0" . "SYNCDHCP") ("ifconfig_vtnet0" . "SYNCDHCP")) #:init-mode 'shepherd-pid1 #:ready-marker "/var/lib/fruix/ready" diff --git a/tests/system/run-phase11-shepherd-pid1-xcpng.sh b/tests/system/run-phase11-shepherd-pid1-xcpng.sh new file mode 100755 index 0000000..bc2c9c4 --- /dev/null +++ b/tests/system/run-phase11-shepherd-pid1-xcpng.sh @@ -0,0 +1,217 @@ +#!/bin/sh +set -eu + +repo_root=$(CDPATH= cd -- "$(dirname "$0")/../.." && pwd) +vm_id=90490f2e-e8fc-4b7a-388e-5c26f0157289 +os_template=${OS_TEMPLATE:-$repo_root/tests/system/phase11-shepherd-pid1-operating-system.scm.in} +system_name=${SYSTEM_NAME:-phase11-operating-system} +metadata_target=${METADATA_OUT:-} +root_authorized_key_file=${ROOT_AUTHORIZED_KEY_FILE:-$HOME/.ssh/id_ed25519.pub} +root_ssh_private_key_file=${ROOT_SSH_PRIVATE_KEY_FILE:-$HOME/.ssh/id_ed25519} +requested_disk_capacity=${DISK_CAPACITY:-} + +cleanup=0 +if [ -n "${WORKDIR:-}" ]; then + workdir=$WORKDIR + mkdir -p "$workdir" +else + workdir=$(mktemp -d /tmp/fruix-phase11-xcpng.XXXXXX) + cleanup=1 +fi +if [ "${KEEP_WORKDIR:-0}" -eq 1 ]; then + cleanup=0 +fi + +phase11_os_file=$workdir/phase11-shepherd-pid1-operating-system.scm +phase8_log=$workdir/phase8-system-image.log +phase8_metadata=$workdir/phase8-system-image-metadata.txt +arp_scan_log=$workdir/arp-scan.log +ssh_stdout=$workdir/ssh.out +ssh_stderr=$workdir/ssh.err +metadata_file=$workdir/phase11-shepherd-pid1-xcpng-metadata.txt +vdi_info_json=$workdir/vdi-info.json +vm_info_json=$workdir/vm-info.json +upload_image=$workdir/disk.vhd + +cleanup_workdir() { + if [ "$cleanup" -eq 1 ]; then + rm -rf "$workdir" + fi +} +trap cleanup_workdir EXIT INT TERM + +[ -f "$root_authorized_key_file" ] || { + echo "missing root authorized key file: $root_authorized_key_file" >&2 + exit 1 +} +[ -f "$root_ssh_private_key_file" ] || { + echo "missing root SSH private key file: $root_ssh_private_key_file" >&2 + exit 1 +} +root_authorized_key=$(tr -d '\n' < "$root_authorized_key_file") + +xo-cli list-objects id=$vm_id >"$vm_info_json" +vdi_id=$(xo-cli list-objects type=VBD | jq -r '.[] | select(.VM=="'$vm_id'" and .is_cd_drive==false and .position=="0") | .VDI' | head -n 1) +[ -n "$vdi_id" ] || { echo "failed to discover target VDI for VM $vm_id" >&2; exit 1; } +xo-cli list-objects type=VDI | jq '[.[] | select(.id=="'$vdi_id'")]' >"$vdi_info_json" +vdi_size=$(jq -r '.[0].size' "$vdi_info_json") +[ -n "$vdi_size" ] || { echo "failed to discover VDI size for $vdi_id" >&2; exit 1; } + +if [ -n "$requested_disk_capacity" ] && [ "$requested_disk_capacity" != "$vdi_size" ]; then + echo "existing XCP-ng import path requires an image that matches the target VDI size; use DISK_CAPACITY=$vdi_size or leave it unset" >&2 + exit 1 +fi +disk_capacity=$vdi_size +requested_disk_bytes=$vdi_size + +sed "s|__ROOT_AUTHORIZED_KEY__|$root_authorized_key|g" "$os_template" > "$phase11_os_file" + +KEEP_WORKDIR=1 WORKDIR=$workdir/phase8-build OS_FILE=$phase11_os_file SYSTEM_NAME=$system_name DISK_CAPACITY=$disk_capacity \ + METADATA_OUT=$phase8_metadata "$repo_root/tests/system/run-phase8-system-image.sh" \ + >"$phase8_log" 2>&1 + +disk_image=$(sed -n 's/^disk_image=//p' "$phase8_metadata") +closure_path=$(sed -n 's/^closure_path=//p' "$phase8_metadata") +closure_base=$(basename "$closure_path") +raw_sha256=$(sed -n 's/^raw_sha256=//p' "$phase8_metadata") +image_store_path=$(sed -n 's/^image_store_path=//p' "$phase8_metadata") + +command -v qemu-img >/dev/null 2>&1 || { + echo "qemu-img is required to convert the raw Fruix image to XCP-ng-compatible VHD" >&2 + exit 1 +} +qemu-img convert -f raw -O vpc -o subformat=dynamic,force_size=on "$disk_image" "$upload_image" +upload_sha256=$(sha256 -q "$upload_image") +upload_size_bytes=$(stat -f '%z' "$upload_image") + +xo-cli vm.stop id=$vm_id force=true >/dev/null 2>&1 || true +xo-cli disk.importContent id=$vdi_id @=$upload_image >"$workdir/disk-import.out" +xo-cli vm.setBootOrder vm=$vm_id order=dcn >"$workdir/set-boot-order.out" +xo-cli vm.start id=$vm_id >"$workdir/vm-start.out" + +vm_mac=$(jq -r '.[0].VIFs[0]' "$vm_info_json") +if [ -n "$vm_mac" ] && [ "$vm_mac" != null ]; then + vm_mac=$(xo-cli list-objects type=VIF | jq -r '.[] | select(.id=="'$vm_mac'") | .MAC' | tr 'A-Z' 'a-z') +else + vm_mac= +fi +host_interface=$(route -n get default | awk '/interface:/{print $2; exit}') +host_ip=$(ifconfig "$host_interface" | awk '/inet /{print $2; exit}') +subnet_prefix=${host_ip%.*} + +ssh_guest() { + ssh -i "$root_ssh_private_key_file" \ + -o BatchMode=yes \ + -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null \ + -o ConnectTimeout=5 \ + root@"$guest_ip" "$@" +} + +guest_ip= +for attempt in $(jot 90 1 90); do + : >"$arp_scan_log" + for host in $(jot 254 1 254); do + ip=$subnet_prefix.$host + ( + ping -c 1 -W 1000 "$ip" >/dev/null 2>&1 && echo "$ip" >>"$arp_scan_log" + ) & + done + wait + if [ -n "$vm_mac" ]; then + guest_ip=$(arp -an | awk -v mac="$vm_mac" 'tolower($4)==mac {gsub(/[()]/,"",$2); print $2; exit}') + fi + if [ -n "$guest_ip" ]; then + if ssh -i "$root_ssh_private_key_file" \ + -o BatchMode=yes \ + -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null \ + -o ConnectTimeout=3 \ + root@"$guest_ip" 'test -f /var/lib/fruix/ready' >"$ssh_stdout" 2>"$ssh_stderr"; then + break + fi + fi + sleep 5 +done + +[ -n "$guest_ip" ] || { + echo "guest IP was not discovered; manual console inspection is likely required" >&2 + exit 1 +} + +ready_marker=$(ssh_guest 'cat /var/lib/fruix/ready') +run_current_system_target=$(ssh_guest 'readlink /run/current-system') +pid1_command=$(ssh_guest 'ps -p 1 -o command= | sed "s/^ *//"') +shepherd_pid=$(ssh_guest 'cat /var/run/shepherd.pid') +shepherd_socket=$(ssh_guest 'test -S /var/run/shepherd.sock && echo present || echo missing') +shepherd_status=$(ssh_guest 'test -f /var/run/shepherd.pid && kill -0 "$(cat /var/run/shepherd.pid)" >/dev/null 2>&1 && echo running || echo stopped') +logger_log=$(ssh_guest 'cat /var/log/fruix-shepherd.log' | tr '\n' ' ') +sshd_status=$(ssh_guest 'service sshd onestatus >/dev/null 2>&1 && echo running || echo stopped') +uname_output=$(ssh_guest 'uname -sr') +operator_home_listing=$(ssh_guest 'ls -d /home/operator') +activate_preview=$(ssh_guest 'head -n 5 /run/current-system/activate' | tr '\n' ' ') + +[ "$ready_marker" = ready ] || { echo "unexpected ready marker contents: $ready_marker" >&2; exit 1; } +[ "$shepherd_pid" = 1 ] || { echo "shepherd is not PID 1: pid=$shepherd_pid command=$pid1_command" >&2; exit 1; } +[ "$shepherd_socket" = present ] || { echo "shepherd socket is missing" >&2; exit 1; } +[ "$shepherd_status" = running ] || { echo "shepherd is not running" >&2; exit 1; } +[ "$sshd_status" = running ] || { echo "sshd is not running" >&2; exit 1; } +[ "$run_current_system_target" = "/frx/store/$closure_base" ] || { + echo "unexpected /run/current-system target in guest: $run_current_system_target" >&2 + exit 1 +} +[ "$operator_home_listing" = /home/operator ] || { echo "operator home missing" >&2; exit 1; } + +cat >"$metadata_file" <} +requested_disk_bytes=$requested_disk_bytes +phase11_os_file=$phase11_os_file +phase8_log=$phase8_log +phase8_metadata=$phase8_metadata +image_store_path=$image_store_path +disk_image=$disk_image +upload_image=$upload_image +upload_format=vhd-dynamic +upload_sha256=$upload_sha256 +upload_size_bytes=$upload_size_bytes +closure_path=$closure_path +closure_base=$closure_base +raw_sha256=$raw_sha256 +guest_ip=$guest_ip +vm_mac=$vm_mac +ready_marker=$ready_marker +run_current_system_target=$run_current_system_target +pid1_command=$pid1_command +shepherd_pid=$shepherd_pid +shepherd_socket=$shepherd_socket +shepherd_status=$shepherd_status +sshd_status=$sshd_status +logger_log=$logger_log +uname_output=$uname_output +operator_home_listing=$operator_home_listing +activate_preview=$activate_preview +boot_backend=xcp-ng-xo-cli +init_mode=shepherd-pid1 +operator_access=ssh-root-key +root_authorized_key_file=$root_authorized_key_file +root_ssh_private_key_file=$root_ssh_private_key_file +EOF + +if [ -n "$metadata_target" ]; then + mkdir -p "$(dirname "$metadata_target")" + cp "$metadata_file" "$metadata_target" +fi + +printf 'PASS phase11-shepherd-pid1-xcpng\n' +printf 'Work directory: %s\n' "$workdir" +printf 'Metadata file: %s\n' "$metadata_file" +if [ -n "$metadata_target" ]; then + printf 'Copied metadata to: %s\n' "$metadata_target" +fi +printf '%s\n' '--- metadata ---' +cat "$metadata_file"