Files
fruix/docs/reports/phase1-guile-subprocess-crash.md

183 lines
5.9 KiB
Markdown

# Phase 1.1 follow-up: Guile subprocess crash on FreeBSD
Date: 2026-04-01
## Summary
`guile3` on this `FreeBSD 15.0-STABLE` amd64 host crashes when Guile tries to create subprocesses through:
- `system*`
- `spawn`
- `open-pipe*`
The crash is reproducible and is **not** caused by FreeBSD's native `posix_spawn(3)` implementation by itself. The evidence points to an **upstream Guile/gnulib integration bug on FreeBSD**:
- gnulib decides to replace `posix_spawn`/`posix_spawnp` on this platform
- Guile still calls the native FreeBSD extension `posix_spawn_file_actions_addclosefrom_np`
- that function receives a gnulib replacement `posix_spawn_file_actions_t` object with an incompatible ABI
- the process crashes inside libc when `addclosefrom_np` interprets gnulib's struct header as a native pointer
## Repro artifacts added
- `tests/guile/posix-spawn-freebsd-diagnostics.c`
- `tests/guile/run-subprocess-diagnostics.sh`
Run with:
```sh
./tests/guile/run-subprocess-diagnostics.sh
```
Expected output on the current host includes:
```text
native-spawn-closefrom=ok
adddup2-invalid-fd-accepted=yes
addopen-invalid-fd-accepted=yes
posix_spawn-secure-exec-result=0
posix_spawnp-secure-exec-result=3
issue-profile-match=yes
system-star exit=139
spawn exit=139
open-pipe-star exit=139
```
## Minimal Guile reproducers
```sh
guile3 -c '(system* "/usr/bin/true")'
guile3 -c '(spawn "/usr/bin/true" (list "/usr/bin/true"))'
guile3 -c '(use-modules (ice-9 popen)) (open-pipe* OPEN_READ "/usr/bin/true")'
```
All three terminate with `SIGSEGV` (`exit 139`) on this machine.
## Native FreeBSD `posix_spawn` is not the direct problem
A standalone C test using FreeBSD's native APIs works correctly:
- `posix_spawn_file_actions_init`
- `posix_spawn_file_actions_adddup2`
- `posix_spawn_file_actions_addclosefrom_np`
- `posix_spawn`
The diagnostic program in `tests/guile/posix-spawn-freebsd-diagnostics.c` confirms this with:
```text
native-spawn-closefrom=ok
```
So the crash is above libc, in how Guile/gnulib prepares the file-actions object.
## Why gnulib replaces `posix_spawn` on this host
Upstream Guile 3.0.10 vendors gnulib logic in `m4/posix_spawn.m4`.
Two FreeBSD-relevant observations from the local diagnostics match gnulib's replacement logic:
1. `posix_spawnp` is considered insecure by gnulib's test because it accepts a script without a shebang and ends up running it successfully instead of rejecting it with `ENOEXEC`.
2. FreeBSD's `posix_spawn_file_actions_adddup2` and `posix_spawn_file_actions_addopen` accept obviously invalid file descriptors in the gnulib probe cases, so gnulib also wants wrapper/replacement behavior there.
Observed locally:
```text
adddup2-invalid-fd-accepted=yes
addopen-invalid-fd-accepted=yes
posix_spawnp-secure-exec-result=3
```
That strongly indicates `REPLACE_POSIX_SPAWN=1` in the Guile build on this system.
## Root cause hypothesis
### 1. Guile uses `addclosefrom_np` when the symbol exists
In upstream Guile 3.0.10, `libguile/posix.c` contains:
- `#ifdef HAVE_POSIX_SPAWN_FILE_ACTIONS_ADDCLOSEFROM_NP`
- `#define HAVE_ADDCLOSEFROM 1`
- later in `do_spawn(...)`:
```c
#ifdef HAVE_ADDCLOSEFROM
posix_spawn_file_actions_addclosefrom_np (&actions, 3);
#else
close_inherited_fds (&actions, max_fd);
#endif
```
### 2. But gnulib can replace the `posix_spawn` ABI
In upstream gnulib's `lib/spawn.in.h`, when `REPLACE_POSIX_SPAWN=1`, `posix_spawn_file_actions_t` becomes a gnulib-defined struct instead of the native FreeBSD opaque-pointer type.
FreeBSD's native `/usr/include/spawn.h` defines:
```c
typedef struct __posix_spawn_file_actions *posix_spawn_file_actions_t;
```
So native FreeBSD expects `posix_spawn_file_actions_t` to be pointer-like, while gnulib replacement mode uses an in-memory struct.
### 3. The crash signature matches that ABI mismatch exactly
The lldb backtrace from the core file shows the crash in:
```text
libc.so.7`posix_spawn_file_actions_addclosefrom_np
```
with:
```text
*fa = 0x0000000600000008
```
That value matches the first two 32-bit fields of gnulib's replacement file-actions struct interpreted as a pointer:
- `_allocated = 8`
- `_used = 6`
Those values are exactly plausible after Guile schedules six `dup2` actions in `do_spawn(...)`.
In other words, libc is reading gnulib's struct header as though it were a native pointer to `struct __posix_spawn_file_actions`, which explains the segmentation fault.
## Assessment
This looks like an **upstream Guile bug on FreeBSD-family systems where**:
- gnulib decides `REPLACE_POSIX_SPAWN=1`, **and**
- the platform exposes native `posix_spawn_file_actions_addclosefrom_np`
It does **not** look like a Guix-specific bug, nor primarily a local packaging mistake.
## Recommended fix direction
The safest fix is in Guile's `libguile/posix.c`:
- only use `posix_spawn_file_actions_addclosefrom_np` when Guile is using the **native** `posix_spawn` / `posix_spawn_file_actions_t` ABI
- if gnulib replacement `posix_spawn` is active, fall back to `close_inherited_fds(&actions, max_fd)` instead
In practice that likely means guarding the `HAVE_ADDCLOSEFROM` path with an additional condition equivalent to:
```c
#if defined(HAVE_POSIX_SPAWN_FILE_ACTIONS_ADDCLOSEFROM_NP) && !defined(REPLACE_POSIX_SPAWN)
```
or another build-time condition that guarantees ABI compatibility.
## Impact on the Guix-on-FreeBSD port
This is an important blocker because Guix and Guile code frequently depend on subprocess creation helpers.
However, the investigation also confirms:
- lower-level process primitives still work (`primitive-fork`, `waitpid`)
- sockets, file I/O, and FFI still work
- the problem is narrow enough to patch or work around
So the Guix port remains viable, but robust subprocess handling on FreeBSD will likely require either:
1. a local Guile patch, or
2. an upstream fix to Guile/gnulib integration, or
3. temporary Guix-side avoidance of the crashing subprocess helpers while bootstrapping the port