Files
guix-tribes/docs/LOCAL_CONTROL_API.md

223 lines
9.4 KiB
Markdown

# Tribes local-control API
The local-control broker is a small Guile daemon listening on a Unix-domain
socket. It fronts every operator action that a Tribes deployment can take on
its own host:
- **resolve** a `SystemTarget` into a build plan.
- **prepare** a build (pull channels + `guix system build`) without
activating it.
- **commit** a previously-prepared generation (`guix system
switch-generation`).
- **rollback** to a retained store path or, failing that, rebuild from a
plan and switch.
- **abort** an in-flight job.
- inspect **status** and **generations**.
This document specifies the wire schema. The BEAM client at
`tribes/lib/tribes/local_control.ex` should be updated to match it.
## Transport
- HTTP/1.1 over a Unix-domain socket. The path is configurable via
`TRIBES_LOCAL_CONTROL_SOCKET` (default `/var/run/tribes/local-control.sock`).
- Permissions: socket owned by `root:tribes`, mode `0660`.
- Request bodies are JSON (`Content-Type: application/json`).
- Responses are JSON.
## Concurrency model
The broker runs a single POSIX worker thread. The HTTP request thread is
never blocked on a long-running Guix call: any operation that may exceed
about a second (`prepare`, `commit`, `rollback`) is enqueued on the worker
and returns `202 Accepted` immediately. The caller then polls
`GET /v1/deployment/status` for completion.
There is at most one job in flight at any time. A new submission with the
same `plan_hash` as the running job is **idempotent**: the broker returns
the in-flight snapshot rather than queuing a duplicate. A submission with a
different `plan_hash` while another job runs returns `409 busy`.
## Endpoints
### `GET /v1/deployment` and `GET /v1/deployment/status`
Returns a status snapshot. Polling interval recommendation: 1 s during an
active job, with linear back-off to 5 s after the first minute of polling.
Snapshot fields:
- `schemaVersion` — string, currently `"2"`.
- `ok` — boolean.
- `status` — high-level state. One of:
`idle | queued | running | pulling | building | switching | completed |
failed | aborted`.
- `phase` — fine-grained phase identical to `status` for in-flight jobs;
`ready` after a successful `prepare`, `active` after a successful
`commit`/`rollback`.
- `job_id` — opaque identifier of the in-flight or last-completed job.
`"job-N"` where N is monotonic for the broker process lifetime.
- `plan_hash` — the plan hash this job is operating on.
- `started_at`, `last_event_at` — RFC 3339 timestamps.
- `store_path` — the deployment target's `/gnu/store/...-system` path:
the prepared store path after `prepare`, or the selected profile store path
after `commit`/`rollback`.
- `selectedSystem` — canonical `/gnu/store/...-system` path currently selected
by `/var/guix/profiles/system`.
- `runningSystem` — canonical `/gnu/store/...-system` path currently exposed by
`/run/current-system`.
- `generation_number` — the system profile generation number.
- `gc_pinned` — boolean. `true` when the broker holds a GC root via
`--root=` so the prepared system is not collected before a `commit`.
- `built_at`, `activated_at` — RFC 3339 timestamps when present.
- `code` — typed error code on failure (see *Error taxonomy*).
- `reason` — human-readable error message on failure.
- `plugins` — array of plugin names in the deployed plan.
### `GET /v1/deployment/generations`
Returns the list of recorded generations in newest-first order. Each entry:
```json
{
"store_path": "/gnu/store/...-system",
"generation_number": 42,
"plan_hash": "plan-abcd...",
"status": "active" | "ready" | "superseded",
"gc_pinned": true,
"built_at": "2026-04-25T13:01:02Z",
"activated_at": "2026-04-25T13:01:42Z"
}
```
### `POST /v1/deployment/resolve`
Synchronous. Body: a `SystemTarget` JSON object. Response:
- `200` with `{ "schemaVersion": "2", "ok": true, "plan": { ... } }` on
success. The `plan` object includes a `plan_hash` and is suitable for
feeding into `prepare`.
- `409` with the resolver error envelope on capability/manifest/trust
failures.
### `POST /v1/deployment/prepare`
Asynchronous. Body: a plan object containing `plan_hash` and
`resolved_plugins`.
- `202` with `{ "schemaVersion": "2", "status": "queued", "job_id": "...",
"plan_hash": "...", "started_at": "..." }` on accept (or on idempotent
re-submit of the running job).
- `409` with `{ "ok": false, "status": "busy", "reason": "deployment already in
progress", "job_id": "...", "plan_hash": "...", ... }` when another
plan is already in flight.
- `400` on validation error.
The job pulls channels, runs `guix system build --root=...`, registers the
resulting GC root, and records a `ready` generation. The final snapshot is
visible at `GET /v1/deployment/status`.
### `POST /v1/deployment/commit`
Asynchronous. Body: `{ "plan_hash": "..." }`.
- `202` on accept. The job switches the system profile to the
previously-prepared generation, then re-runs activation and Guix's normal
Shepherd service-definition upgrade step inside the pulled/current Guix
profile used for the prepare build. Activation runs with `GUIX_NEW_SYSTEM`
set to the selected generation so `/run/current-system` follows the
profile. Like upstream `guix system reconfigure`, this does not imply
that every already-running service process was restarted. Tribes may then
schedule an asynchronous `tribes` service restart as part of higher-level
rollout convergence, while `tribes-local-control` self-update remains a
separate deferred concern. On success the snapshot reaches `phase:
"active"` with `status: "completed"`.
- `409` if no generation is prepared for that `plan_hash`. The snapshot's
error code is `generation_not_prepared`.
- `409 busy` if another job is in flight.
### `POST /v1/deployment/rollback`
Asynchronous. Body:
```json
{
"store_path": "/gnu/store/...-system",
"plan": { ...optional fallback plan... }
}
```
The broker walks these cases in order:
1. The requested `store_path` is the selected system → just record the
activation, no build, no switch.
2. We have a recorded local-control generation number for that `store_path`
→ switch to it directly.
3. The `store_path` appears in Guix's system profile links
(`/var/guix/profiles/system-*-link`), even if local-control did not record
it → switch to that profile generation directly. This covers the installed
baseline generation used by emergency/public rollback.
4. The store path is gone but `plan` is supplied → re-prepare and commit.
If none apply the snapshot reports `code: "rollback_infeasible"`.
Current limitation: rollback does not run core/plugin down migrations. The
public Tribes admin rollback flow currently omits the fallback `plan` on
purpose so explicit rollback to a baseline generation cannot replay the rollout
being rolled back.
### `POST /v1/deployment/abort`
Synchronous. Marks the in-flight job as aborted and writes a snapshot with
`status: "aborted"`. (v1: does not yet SIGTERM a running helper subprocess —
the operation completes when the helper next checks back in.)
## Error taxonomy
Every failed operation returns a `code` matching one of these tokens:
- `channel_untrusted` — channel references a signer not in the
`TrustedSigner` table.
- `signature_invalid` — a channel's commit signature failed verification.
- `channel_commit_unreachable` — the configured commit cannot be fetched
from the channel URL.
- `missing_capability` — a plugin requires a capability that no other
plugin provides.
- `capability_cycle` — the plugin capability graph contains a cycle.
- `duplicate_plugin` — the system target lists the same plugin twice.
- `manifest_invalid` — a requested plugin name is unknown to the channel
registry.
- `host_api_mismatch` — the resolved plan needs a host API version the
node cannot honour.
- `migration_target_conflict` — two plugins disagree about a migration
target version.
- `build_failed` — `guix system build` returned non-zero.
- `switch_failed` — `guix system switch-generation` returned non-zero.
- `rollback_infeasible` — the broker cannot reach the requested store
path by either retained generation or rebuild.
- `helper_crashed` — `tribes-guix-helper` exited without emitting a
structured terminal frame.
- `busy` — another job is in flight; the request was rejected.
- `invalid_request` — payload missed a required field or violated a limit.
## Helper protocol (internal)
The broker spawns `tribes-guix-helper` for every long operation and parses
its stdout as NDJSON. The helper emits one of:
```json
{"event":"phase","phase":"pulling","ts":"..."}
{"event":"phase","phase":"building","ts":"...","derivation":"/gnu/store/..."}
{"event":"done","store_path":"/gnu/store/...","generation_number":42,"ts":"..."}
{"event":"error","code":"channel_commit_unreachable","message":"...","details":{...},"ts":"..."}
```
The broker uses the last `event: "phase"` frame to update its snapshot in
real time, and the final `done` or `error` frame to compute the operation
result. If the helper exits without a terminal frame the broker synthesizes
`{ "code": "helper_crashed", "details": { "exit_status": N, "signal": S } }`.
This protocol is not part of the public API; it exists so the broker can
stay small while still surfacing typed errors instead of regex-parsing
`guix` stderr.