Validate Shepherd PID 1 boot on XCP-ng
This commit is contained in:
@@ -2486,3 +2486,58 @@ Next recommended step:
|
||||
1. try the `shepherd-pid1` image on the real XCP-ng VM
|
||||
2. if it boots there too, decide whether to keep `shepherd-pid1` as an experimental selectable boot mode or advance it further toward the main Fruix boot path
|
||||
3. continue reducing the remaining Guile / Shepherd compatibility-prefix shims now that the broader `rc.d` boot-manager dependency has been locally bypassed
|
||||
|
||||
## 2026-04-02 — Post-Phase-10: Shepherd-as-PID-1 boot also passed on the real XCP-ng VM
|
||||
|
||||
Completed work:
|
||||
|
||||
- took the locally validated `shepherd-pid1` boot mode and tested it on the real XCP-ng deployment path
|
||||
- wrote the follow-up report:
|
||||
- `docs/reports/postphase10-shepherd-pid1-xcpng-freebsd.md`
|
||||
- expanded the Shepherd-PID-1 operating-system template so the generated guest remains compatible with both local virtio and the real Xen NIC path:
|
||||
- `tests/system/phase11-shepherd-pid1-operating-system.scm.in`
|
||||
- now includes:
|
||||
- `ifconfig_xn0=SYNCDHCP`
|
||||
- `ifconfig_em0=SYNCDHCP`
|
||||
- `ifconfig_vtnet0=SYNCDHCP`
|
||||
- added a dedicated real-VM Shepherd-PID-1 deployment/validation harness:
|
||||
- `tests/system/run-phase11-shepherd-pid1-xcpng.sh`
|
||||
|
||||
Validation:
|
||||
|
||||
- `tests/system/run-phase11-shepherd-pid1-xcpng.sh` now passes on the operator-approved VM and existing VDI:
|
||||
- VM `90490f2e-e8fc-4b7a-388e-5c26f0157289`
|
||||
- VDI `0f1f90d3-48ca-4fa2-91d8-fc6339b95743`
|
||||
- passing run workdir:
|
||||
- `/tmp/pid1-xcpng-1775129768`
|
||||
- passing real-guest metadata confirmed:
|
||||
- `ready_marker=ready`
|
||||
- `run_current_system_target=/frx/store/2940c952e9d35e47f98fe62f296be2b6ab4fceb3eee8248d6a7823decd42a305-fruix-system-fruix-freebsd`
|
||||
- `pid1_command=[guile]`
|
||||
- `shepherd_pid=1`
|
||||
- `shepherd_socket=present`
|
||||
- `shepherd_status=running`
|
||||
- `sshd_status=running`
|
||||
- `init_mode=shepherd-pid1`
|
||||
|
||||
Important findings:
|
||||
|
||||
- the local QEMU PID 1 prototype was not a simulator-only artifact; the same general boot design also works on the real XCP-ng/Xen guest
|
||||
- as expected for a Guile-script entry point, the PID 1 process image shows up as Guile, but the meaningful architectural check is that:
|
||||
- `/var/run/shepherd.pid` contains `1`
|
||||
- this means Fruix has now validated two distinct real-VM boot architectures on FreeBSD:
|
||||
- `freebsd-init+rc.d-shepherd`
|
||||
- `shepherd-pid1`
|
||||
- however, this still does not remove the current Guile / Shepherd compatibility-prefix shims; those remain a separate runtime-artifact issue rather than an init-manager issue
|
||||
|
||||
Current assessment:
|
||||
|
||||
- Shepherd-as-PID-1 is now no longer merely a local prototype; it is validated on the real XCP-ng VM as well
|
||||
- this significantly strengthens the path toward a more Guix-like Fruix system architecture on FreeBSD
|
||||
- the main remaining native-runtime gap is now the baked-prefix / compatibility-shim problem, not whether Fruix can boot with Shepherd as PID 1
|
||||
|
||||
Next recommended step:
|
||||
|
||||
1. focus directly on eliminating the remaining Guile / Shepherd compatibility-prefix shims from the guest runtime
|
||||
2. preserve `shepherd-pid1` as an experimental selectable boot mode while that cleanup proceeds
|
||||
3. once the runtime-prefix issue is reduced, reassess whether `shepherd-pid1` should replace the older `freebsd-init+rc.d-shepherd` path as the preferred Fruix boot architecture
|
||||
|
||||
114
docs/reports/postphase10-shepherd-pid1-xcpng-freebsd.md
Normal file
114
docs/reports/postphase10-shepherd-pid1-xcpng-freebsd.md
Normal file
@@ -0,0 +1,114 @@
|
||||
# Post-Phase-10: Shepherd-as-PID-1 boot validated on the real XCP-ng FreeBSD VM
|
||||
|
||||
Date: 2026-04-02
|
||||
|
||||
## Goal
|
||||
|
||||
Take the locally validated Shepherd-as-PID-1 Fruix boot prototype and test it on the real operator-approved XCP-ng VM.
|
||||
|
||||
Target objects remained the same constrained deployment path used for Phase 9:
|
||||
|
||||
- VM: `90490f2e-e8fc-4b7a-388e-5c26f0157289`
|
||||
- VDI: `0f1f90d3-48ca-4fa2-91d8-fc6339b95743`
|
||||
|
||||
The concrete goal for this subphase was to confirm that the new `shepherd-pid1` init mode was not merely a local QEMU curiosity, but could also:
|
||||
|
||||
- boot on the real Xen guest,
|
||||
- reach DHCP and SSH,
|
||||
- keep Shepherd running as PID 1,
|
||||
- and still reach the Fruix ready marker.
|
||||
|
||||
## Result
|
||||
|
||||
The real XCP-ng boot succeeded.
|
||||
|
||||
A new deployment/validation harness was added:
|
||||
|
||||
- `tests/system/run-phase11-shepherd-pid1-xcpng.sh`
|
||||
|
||||
This harness reuses the existing real-VM deployment method:
|
||||
|
||||
- build a full-size image matching the existing VDI
|
||||
- convert it to dynamic VHD
|
||||
- overwrite the existing VDI
|
||||
- boot the real VM
|
||||
- rediscover the guest by MAC/IP
|
||||
- validate the booted guest over SSH
|
||||
|
||||
The new Shepherd-PID-1 image passes that full path.
|
||||
|
||||
## Validation
|
||||
|
||||
Passing real-VM run:
|
||||
|
||||
- `PASS phase11-shepherd-pid1-xcpng`
|
||||
- workdir: `/tmp/pid1-xcpng-1775129768`
|
||||
|
||||
Validated metadata from the real guest:
|
||||
|
||||
```text
|
||||
ready_marker=ready
|
||||
run_current_system_target=/frx/store/2940c952e9d35e47f98fe62f296be2b6ab4fceb3eee8248d6a7823decd42a305-fruix-system-fruix-freebsd
|
||||
pid1_command=[guile]
|
||||
shepherd_pid=1
|
||||
shepherd_socket=present
|
||||
shepherd_status=running
|
||||
sshd_status=running
|
||||
guest_ip=192.168.213.62
|
||||
boot_backend=xcp-ng-xo-cli
|
||||
init_mode=shepherd-pid1
|
||||
```
|
||||
|
||||
The key architectural confirmation is:
|
||||
|
||||
- `shepherd_pid=1`
|
||||
|
||||
That shows the running Shepherd instance in the real guest is PID 1.
|
||||
|
||||
As in the local QEMU prototype, the process image is Guile because Shepherd is launched as a Guile script; however, the service manager itself is the PID 1 process according to Shepherd's own pidfile and control socket state.
|
||||
|
||||
## What changed to make the real VM pass
|
||||
|
||||
The most important refinement after the first local PID 1 work was making the generated activation path more tolerant of immutable store-backed configuration files during very early boot.
|
||||
|
||||
Specifically, the generated activation script now treats these as best-effort:
|
||||
|
||||
- `cap_mkdb /etc/login.conf`
|
||||
- `pwd_mkdb -p /etc/master.passwd`
|
||||
|
||||
That matters because on the PID 1 path they happen earlier and should not abort the system if the current `/etc` representation is not suitable for in-place database regeneration.
|
||||
|
||||
The Shepherd-PID-1 operating-system template was also expanded to keep the NIC configuration broad enough for both local virtio and the real Xen path:
|
||||
|
||||
- `ifconfig_xn0=SYNCDHCP`
|
||||
- `ifconfig_em0=SYNCDHCP`
|
||||
- `ifconfig_vtnet0=SYNCDHCP`
|
||||
|
||||
## Assessment
|
||||
|
||||
This is a stronger result than the earlier local-only prototype.
|
||||
|
||||
Fruix now has a real deployment-validated FreeBSD boot mode where:
|
||||
|
||||
- FreeBSD `init(8)` hands off immediately via `init_exec`
|
||||
- the generated Fruix launcher performs the minimal bootstrap
|
||||
- Shepherd becomes PID 1
|
||||
- networking and SSH still work on the real XCP-ng VM
|
||||
- and the system still reaches the Fruix ready marker
|
||||
|
||||
That means the project has now validated both of these boot architectures on the real VM:
|
||||
|
||||
1. `freebsd-init+rc.d-shepherd`
|
||||
2. `shepherd-pid1`
|
||||
|
||||
## Remaining limitations
|
||||
|
||||
This does not yet eliminate the current locally built Guile/Shepherd compatibility-prefix shims.
|
||||
|
||||
Those shims are still needed because the locally staged runtime artifacts continue to embed historical build prefixes. The current result proves that the broader init/boot-manager dependency can be removed, but it does not yet fully solve the store-native runtime-prefix problem.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The Shepherd-as-PID-1 Fruix boot mode now works not only under local QEMU/UEFI, but also on the real operator-approved XCP-ng VM.
|
||||
|
||||
This substantially strengthens the case that Fruix can move beyond the transitional `rc.d` bridge design and toward a more Guix-like PID-1-centered system architecture on FreeBSD.
|
||||
@@ -70,6 +70,8 @@
|
||||
("hostid_enable" . "NO")
|
||||
("sendmail_enable" . "NONE")
|
||||
("sshd_enable" . "YES")
|
||||
("ifconfig_xn0" . "SYNCDHCP")
|
||||
("ifconfig_em0" . "SYNCDHCP")
|
||||
("ifconfig_vtnet0" . "SYNCDHCP"))
|
||||
#:init-mode 'shepherd-pid1
|
||||
#:ready-marker "/var/lib/fruix/ready"
|
||||
|
||||
217
tests/system/run-phase11-shepherd-pid1-xcpng.sh
Executable file
217
tests/system/run-phase11-shepherd-pid1-xcpng.sh
Executable file
@@ -0,0 +1,217 @@
|
||||
#!/bin/sh
|
||||
set -eu
|
||||
|
||||
repo_root=$(CDPATH= cd -- "$(dirname "$0")/../.." && pwd)
|
||||
vm_id=90490f2e-e8fc-4b7a-388e-5c26f0157289
|
||||
os_template=${OS_TEMPLATE:-$repo_root/tests/system/phase11-shepherd-pid1-operating-system.scm.in}
|
||||
system_name=${SYSTEM_NAME:-phase11-operating-system}
|
||||
metadata_target=${METADATA_OUT:-}
|
||||
root_authorized_key_file=${ROOT_AUTHORIZED_KEY_FILE:-$HOME/.ssh/id_ed25519.pub}
|
||||
root_ssh_private_key_file=${ROOT_SSH_PRIVATE_KEY_FILE:-$HOME/.ssh/id_ed25519}
|
||||
requested_disk_capacity=${DISK_CAPACITY:-}
|
||||
|
||||
cleanup=0
|
||||
if [ -n "${WORKDIR:-}" ]; then
|
||||
workdir=$WORKDIR
|
||||
mkdir -p "$workdir"
|
||||
else
|
||||
workdir=$(mktemp -d /tmp/fruix-phase11-xcpng.XXXXXX)
|
||||
cleanup=1
|
||||
fi
|
||||
if [ "${KEEP_WORKDIR:-0}" -eq 1 ]; then
|
||||
cleanup=0
|
||||
fi
|
||||
|
||||
phase11_os_file=$workdir/phase11-shepherd-pid1-operating-system.scm
|
||||
phase8_log=$workdir/phase8-system-image.log
|
||||
phase8_metadata=$workdir/phase8-system-image-metadata.txt
|
||||
arp_scan_log=$workdir/arp-scan.log
|
||||
ssh_stdout=$workdir/ssh.out
|
||||
ssh_stderr=$workdir/ssh.err
|
||||
metadata_file=$workdir/phase11-shepherd-pid1-xcpng-metadata.txt
|
||||
vdi_info_json=$workdir/vdi-info.json
|
||||
vm_info_json=$workdir/vm-info.json
|
||||
upload_image=$workdir/disk.vhd
|
||||
|
||||
cleanup_workdir() {
|
||||
if [ "$cleanup" -eq 1 ]; then
|
||||
rm -rf "$workdir"
|
||||
fi
|
||||
}
|
||||
trap cleanup_workdir EXIT INT TERM
|
||||
|
||||
[ -f "$root_authorized_key_file" ] || {
|
||||
echo "missing root authorized key file: $root_authorized_key_file" >&2
|
||||
exit 1
|
||||
}
|
||||
[ -f "$root_ssh_private_key_file" ] || {
|
||||
echo "missing root SSH private key file: $root_ssh_private_key_file" >&2
|
||||
exit 1
|
||||
}
|
||||
root_authorized_key=$(tr -d '\n' < "$root_authorized_key_file")
|
||||
|
||||
xo-cli list-objects id=$vm_id >"$vm_info_json"
|
||||
vdi_id=$(xo-cli list-objects type=VBD | jq -r '.[] | select(.VM=="'$vm_id'" and .is_cd_drive==false and .position=="0") | .VDI' | head -n 1)
|
||||
[ -n "$vdi_id" ] || { echo "failed to discover target VDI for VM $vm_id" >&2; exit 1; }
|
||||
xo-cli list-objects type=VDI | jq '[.[] | select(.id=="'$vdi_id'")]' >"$vdi_info_json"
|
||||
vdi_size=$(jq -r '.[0].size' "$vdi_info_json")
|
||||
[ -n "$vdi_size" ] || { echo "failed to discover VDI size for $vdi_id" >&2; exit 1; }
|
||||
|
||||
if [ -n "$requested_disk_capacity" ] && [ "$requested_disk_capacity" != "$vdi_size" ]; then
|
||||
echo "existing XCP-ng import path requires an image that matches the target VDI size; use DISK_CAPACITY=$vdi_size or leave it unset" >&2
|
||||
exit 1
|
||||
fi
|
||||
disk_capacity=$vdi_size
|
||||
requested_disk_bytes=$vdi_size
|
||||
|
||||
sed "s|__ROOT_AUTHORIZED_KEY__|$root_authorized_key|g" "$os_template" > "$phase11_os_file"
|
||||
|
||||
KEEP_WORKDIR=1 WORKDIR=$workdir/phase8-build OS_FILE=$phase11_os_file SYSTEM_NAME=$system_name DISK_CAPACITY=$disk_capacity \
|
||||
METADATA_OUT=$phase8_metadata "$repo_root/tests/system/run-phase8-system-image.sh" \
|
||||
>"$phase8_log" 2>&1
|
||||
|
||||
disk_image=$(sed -n 's/^disk_image=//p' "$phase8_metadata")
|
||||
closure_path=$(sed -n 's/^closure_path=//p' "$phase8_metadata")
|
||||
closure_base=$(basename "$closure_path")
|
||||
raw_sha256=$(sed -n 's/^raw_sha256=//p' "$phase8_metadata")
|
||||
image_store_path=$(sed -n 's/^image_store_path=//p' "$phase8_metadata")
|
||||
|
||||
command -v qemu-img >/dev/null 2>&1 || {
|
||||
echo "qemu-img is required to convert the raw Fruix image to XCP-ng-compatible VHD" >&2
|
||||
exit 1
|
||||
}
|
||||
qemu-img convert -f raw -O vpc -o subformat=dynamic,force_size=on "$disk_image" "$upload_image"
|
||||
upload_sha256=$(sha256 -q "$upload_image")
|
||||
upload_size_bytes=$(stat -f '%z' "$upload_image")
|
||||
|
||||
xo-cli vm.stop id=$vm_id force=true >/dev/null 2>&1 || true
|
||||
xo-cli disk.importContent id=$vdi_id @=$upload_image >"$workdir/disk-import.out"
|
||||
xo-cli vm.setBootOrder vm=$vm_id order=dcn >"$workdir/set-boot-order.out"
|
||||
xo-cli vm.start id=$vm_id >"$workdir/vm-start.out"
|
||||
|
||||
vm_mac=$(jq -r '.[0].VIFs[0]' "$vm_info_json")
|
||||
if [ -n "$vm_mac" ] && [ "$vm_mac" != null ]; then
|
||||
vm_mac=$(xo-cli list-objects type=VIF | jq -r '.[] | select(.id=="'$vm_mac'") | .MAC' | tr 'A-Z' 'a-z')
|
||||
else
|
||||
vm_mac=
|
||||
fi
|
||||
host_interface=$(route -n get default | awk '/interface:/{print $2; exit}')
|
||||
host_ip=$(ifconfig "$host_interface" | awk '/inet /{print $2; exit}')
|
||||
subnet_prefix=${host_ip%.*}
|
||||
|
||||
ssh_guest() {
|
||||
ssh -i "$root_ssh_private_key_file" \
|
||||
-o BatchMode=yes \
|
||||
-o StrictHostKeyChecking=no \
|
||||
-o UserKnownHostsFile=/dev/null \
|
||||
-o ConnectTimeout=5 \
|
||||
root@"$guest_ip" "$@"
|
||||
}
|
||||
|
||||
guest_ip=
|
||||
for attempt in $(jot 90 1 90); do
|
||||
: >"$arp_scan_log"
|
||||
for host in $(jot 254 1 254); do
|
||||
ip=$subnet_prefix.$host
|
||||
(
|
||||
ping -c 1 -W 1000 "$ip" >/dev/null 2>&1 && echo "$ip" >>"$arp_scan_log"
|
||||
) &
|
||||
done
|
||||
wait
|
||||
if [ -n "$vm_mac" ]; then
|
||||
guest_ip=$(arp -an | awk -v mac="$vm_mac" 'tolower($4)==mac {gsub(/[()]/,"",$2); print $2; exit}')
|
||||
fi
|
||||
if [ -n "$guest_ip" ]; then
|
||||
if ssh -i "$root_ssh_private_key_file" \
|
||||
-o BatchMode=yes \
|
||||
-o StrictHostKeyChecking=no \
|
||||
-o UserKnownHostsFile=/dev/null \
|
||||
-o ConnectTimeout=3 \
|
||||
root@"$guest_ip" 'test -f /var/lib/fruix/ready' >"$ssh_stdout" 2>"$ssh_stderr"; then
|
||||
break
|
||||
fi
|
||||
fi
|
||||
sleep 5
|
||||
done
|
||||
|
||||
[ -n "$guest_ip" ] || {
|
||||
echo "guest IP was not discovered; manual console inspection is likely required" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
ready_marker=$(ssh_guest 'cat /var/lib/fruix/ready')
|
||||
run_current_system_target=$(ssh_guest 'readlink /run/current-system')
|
||||
pid1_command=$(ssh_guest 'ps -p 1 -o command= | sed "s/^ *//"')
|
||||
shepherd_pid=$(ssh_guest 'cat /var/run/shepherd.pid')
|
||||
shepherd_socket=$(ssh_guest 'test -S /var/run/shepherd.sock && echo present || echo missing')
|
||||
shepherd_status=$(ssh_guest 'test -f /var/run/shepherd.pid && kill -0 "$(cat /var/run/shepherd.pid)" >/dev/null 2>&1 && echo running || echo stopped')
|
||||
logger_log=$(ssh_guest 'cat /var/log/fruix-shepherd.log' | tr '\n' ' ')
|
||||
sshd_status=$(ssh_guest 'service sshd onestatus >/dev/null 2>&1 && echo running || echo stopped')
|
||||
uname_output=$(ssh_guest 'uname -sr')
|
||||
operator_home_listing=$(ssh_guest 'ls -d /home/operator')
|
||||
activate_preview=$(ssh_guest 'head -n 5 /run/current-system/activate' | tr '\n' ' ')
|
||||
|
||||
[ "$ready_marker" = ready ] || { echo "unexpected ready marker contents: $ready_marker" >&2; exit 1; }
|
||||
[ "$shepherd_pid" = 1 ] || { echo "shepherd is not PID 1: pid=$shepherd_pid command=$pid1_command" >&2; exit 1; }
|
||||
[ "$shepherd_socket" = present ] || { echo "shepherd socket is missing" >&2; exit 1; }
|
||||
[ "$shepherd_status" = running ] || { echo "shepherd is not running" >&2; exit 1; }
|
||||
[ "$sshd_status" = running ] || { echo "sshd is not running" >&2; exit 1; }
|
||||
[ "$run_current_system_target" = "/frx/store/$closure_base" ] || {
|
||||
echo "unexpected /run/current-system target in guest: $run_current_system_target" >&2
|
||||
exit 1
|
||||
}
|
||||
[ "$operator_home_listing" = /home/operator ] || { echo "operator home missing" >&2; exit 1; }
|
||||
|
||||
cat >"$metadata_file" <<EOF
|
||||
workdir=$workdir
|
||||
vm_id=$vm_id
|
||||
vdi_id=$vdi_id
|
||||
vdi_size=$vdi_size
|
||||
disk_capacity=$disk_capacity
|
||||
requested_disk_capacity=${requested_disk_capacity:-<auto>}
|
||||
requested_disk_bytes=$requested_disk_bytes
|
||||
phase11_os_file=$phase11_os_file
|
||||
phase8_log=$phase8_log
|
||||
phase8_metadata=$phase8_metadata
|
||||
image_store_path=$image_store_path
|
||||
disk_image=$disk_image
|
||||
upload_image=$upload_image
|
||||
upload_format=vhd-dynamic
|
||||
upload_sha256=$upload_sha256
|
||||
upload_size_bytes=$upload_size_bytes
|
||||
closure_path=$closure_path
|
||||
closure_base=$closure_base
|
||||
raw_sha256=$raw_sha256
|
||||
guest_ip=$guest_ip
|
||||
vm_mac=$vm_mac
|
||||
ready_marker=$ready_marker
|
||||
run_current_system_target=$run_current_system_target
|
||||
pid1_command=$pid1_command
|
||||
shepherd_pid=$shepherd_pid
|
||||
shepherd_socket=$shepherd_socket
|
||||
shepherd_status=$shepherd_status
|
||||
sshd_status=$sshd_status
|
||||
logger_log=$logger_log
|
||||
uname_output=$uname_output
|
||||
operator_home_listing=$operator_home_listing
|
||||
activate_preview=$activate_preview
|
||||
boot_backend=xcp-ng-xo-cli
|
||||
init_mode=shepherd-pid1
|
||||
operator_access=ssh-root-key
|
||||
root_authorized_key_file=$root_authorized_key_file
|
||||
root_ssh_private_key_file=$root_ssh_private_key_file
|
||||
EOF
|
||||
|
||||
if [ -n "$metadata_target" ]; then
|
||||
mkdir -p "$(dirname "$metadata_target")"
|
||||
cp "$metadata_file" "$metadata_target"
|
||||
fi
|
||||
|
||||
printf 'PASS phase11-shepherd-pid1-xcpng\n'
|
||||
printf 'Work directory: %s\n' "$workdir"
|
||||
printf 'Metadata file: %s\n' "$metadata_file"
|
||||
if [ -n "$metadata_target" ]; then
|
||||
printf 'Copied metadata to: %s\n' "$metadata_target"
|
||||
fi
|
||||
printf '%s\n' '--- metadata ---'
|
||||
cat "$metadata_file"
|
||||
Reference in New Issue
Block a user