11 KiB
Plugin Rollout/Rollback Cross-Repo Progress
Status: single-node supertest plugin rollout/rollback is green on signed guix-tribes master; next target is clustered plugin sync validation.
Last updated: 2026-05-03
Scope repos: tribes-supertest, tribes-plugin-supertest, legion_kk, tribes, guix-tribes
1) Current Baseline
Green live validation
single-node-plugin-rollout-rollback has passed on fresh infra through the public admin API using the real pinned supertest plugin:
- run:
2026-05-01t133844554z-single-node-plugin-rollout-rollback - node:
st-20260501-820549-hx-a - consumed
guix-tribescommit:5a348e7c5427b99c84755aa12c30c37c2de7a4ca - baseline selected/running:
/gnu/store/39hsfwaf39h5h91gwd80ildrsjsdr85b-system - plugin selected/running:
/gnu/store/w51f5b2a5ynrpfky6qyawbnjhmgx0mxs-system - rollback selected/running:
/gnu/store/39hsfwaf39h5h91gwd80ildrsjsdr85b-system - validation: plugin API health/schema/write/read passed
- rollback validation:
supertest_eventsandsupertest_caseswere removed by down migrations - cleanup: Hetzner server destroyed; final node list
[]
Current guix-tribes master baseline
guix-tribes master has moved beyond the first green run and now includes the follow-up fixes and plugin substitute baseline:
39b1ed8 fix: skip no-op pulls and stabilize generation diagnostics8849107 fix: resolve herd for rollback migrationse13c136 test: harden local-control worker state2950278 chore: Bump tribes05c493b test: avoid running guile suites on import0b4d3a7 fix: compile bundled tribes_ui pluginfa4753a feat: build channel plugins in substitute baseline
fa4753a is the current intended guix-tribes master baseline for the next test runs. The substitute manifest now covers the channel plugin registry packages, including tribes-plugin-supertest, tribes-plugin-aether, and tribes-plugin-sender plus ffmpeg for sender.
Current plugin fixture
tribes-plugin-supertest commit e042f3265db7a40d4d558132800238c6d466e8dd provides:
Supertest.CaseandSupertest.Event- reversible migrations for
supertest_casesandsupertest_events AshNostrSyncregistration for both resources- a local-only JSON API under
/plugins-api/supertest
The API is enough for black-box clustered checks because the runner can call it over SSH on each node's 127.0.0.1:4000.
2) Locked Decisions
- Scenario assertions use the normal operator/automation surface:
POST /api/admin/managementwith NIP-98 auth. tribes-deploy-execremains diagnostic-only for these scenarios.supertestis the lean migration-bearing rollout fixture;aetherandsenderremain real packaged plugins for broader package/substitute coverage.- Selected generation is
/var/guix/profiles/system; running generation is/run/current-system. - Guix-sensitive work must run through the pulled/current Guix profile/module universe.
- Direct explicit generation rollback must stay async/restart-tolerant and must not replay the current rollout plan.
tribes-local-controlservice-definition updates should be treated as restart/reboot-required work, not restarted in-band while local-control is serving the switch/rollback request.- Plugin disable is lower priority than clustered install/sync/rollback validation.
3) What Works Now
tribes-supertest
- Public-admin-API rollout flow for
single-node-plugin-rollout-rollback. - Rollout assertions for preview/start/status, selected/running convergence, plugin API health/schema/write/read, rollback convergence, table removal, and provider cleanup.
- Strong diagnostics around local-control readiness, socket/listener readiness, migrations, Shepherd logs, certificates, system generation comparison, Guix build logs, and daemon logs.
guix-tribes
- External plugin pinning and registry-backed plugin packaging.
- Target-node plugin builds with the lean Mix/OTP baseline.
- Runtime plugin loading in release mode.
tribes-migrationswrapper logging with real output/exit status.- Current-Guix delegation for service upgrade, generation comparison, config evaluation, and derivation realization.
- Rollback to installed baseline profile generations.
- Plugin rollback down migrations for direct rollback from a plugin generation to a no-plugin baseline.
- Target-generation plugin state restore before rollback.
- No-op
guix pullskip when rollout plans explicitly resolve no channel delta. - Generation diagnostic JSON stability.
- Channel plugin substitute manifest coverage.
tribes
- Public admin-management rollout methods are the scenario contract.
- Startup readiness returns
503whileTribes.Repois unavailable. - Explicit rollback generation is async and does not replay the current rollout plan.
- AshNostrSync publishes and consumes host and plugin resources through registered namespaces.
legion_kk
- Generic admin management calls are available through Legion-owned CLI/service/API wrappers.
- Provider cleanup is materially more stable across Hetzner, Scaleway, and OVH.
- Installed-system Guix checkout-cache transplant is in place.
- Admin API retry behavior handles explicit
503readiness responses.
4) Remaining Gaps Before Clustered Plugin Sync Test
The next clustered test should not need plugin disable/uninstall semantics. The important pre-flight items are:
-
Use the current published
guix-tribesmaster baseline. The cluster nodes should consumefa4753aor a signed equivalent that includes the plugin substitute manifest and thetribes_uiGuix build fix. -
Confirm substitute availability before provisioning. The builder should already have substitutes for
tribes,tribes-plugin-supertest,tribes-plugin-aether,tribes-plugin-sender, andffmpeg. Missing substitutes should abort cleanly or fail early; they should not leave nodes half-switched after a long silent build. -
Handle the
guix-forkchannel-auth rewrite on the substitute builder path. The pinnedguix-forkcommit906f6b2d3a4f9f80c5ad6f9e5f6369706a1a301dis not a descendant of the old introduction commit6f9c3cd1761f0a3f8b70223cb0e0f47e29582d90. For the builder sync job that consumes explicitpins/base-channels.sexp, either allow downgrades there or refresh the channel introduction to the new signed lineage. Keep that scoped to the builder/pinned-channel path. -
Run the new clustered plugin scenario.
cluster-plugin-rollout-sync-split-brainnow builds one Hetzner init node, plans Scaleway and OVH join nodes, materializes the two joins in one Legion graph run, rolls outsupertest, partitions one node from the sync mesh, heals it, forces full resync, validates plugin table convergence, and rolls back all nodes. -
Assert control-plane convergence before plugin writes. Before testing plugin data sync, capture and compare per-node:
cluster_nodes.listcluster_status- sync server/runtime state
- active node transport addresses
-
Exercise bidirectional plugin table sync. Minimal useful check:
- roll out
supertestto all active nodes - wait for selected/running convergence and plugin schema readiness on every node
- create a
Supertest.Caseon node A - poll
/plugins-api/supertest/state?run_id=...on node B until it appears - create a linked
Supertest.Eventon node B - poll node A until the event appears
- update/increment the case on one node and verify the other node sees the latest value
- roll out
-
Run rollback after clustered sync. Roll back to the no-plugin baseline and verify on every active node:
- selected/running convergence back to baseline
- plugin API is gone
supertest_casesandsupertest_eventsare absent- cluster status returns healthy after services restart
5) Open Implementation Work
Migration lifecycle
- Run target-generation up migrations before commit, instead of relying only on the post-switch
tribes-migrationsShepherd startup path. - Broaden rollback/down migration orchestration for core schema rollback, plugin version rollback, and plugin uninstall beyond the direct no-plugin baseline path.
- Add automated ordering tests for core up, plugin up dependency order, plugin down reverse dependency order, and migration failure handling.
- Model disable vs uninstall semantics when it becomes a product priority:
- disable: remove/disable runtime, keep data, no down migrations
- uninstall: remove runtime and run down migrations, with destructive data loss surfaced before confirmation
Local-control / rollout hardening
- Add dedicated local-control integration tests for socket bind/readiness, status endpoint behavior, responsiveness during long jobs, and abort behavior under load.
- Treat
tribes-local-controlservice-definition updates as restart/reboot-required work in planning and status output. - Add or extend
tribestests around admin-management rollout response shapes and machine-readable failures. - Keep a fresh-node regression check for generation comparison after the current
guix-tribesbaseline is published.
Cluster scenario coverage
- Add management probes and snapshot capture for
cluster_nodes.list,cluster_status, and sync runtime/server state. - Upgrade
cluster-lifecycleto assert membership convergence, not only node existence. - Add clustered
supertestplugin rollout/sync/rollback scenario wiring. - Run
cluster-plugin-rollout-sync-split-brainon fresh infra and fix discovered cluster rollout/sync issues. - Add trust-expansion replay coverage for newly joined or reactivated nodes.
Future metadata
- Define plugin ABI/dependency compatibility metadata:
- host framework deps are part of the versioned Tribes plugin API/ABI
requires/providesremain for plugin-to-plugin or optional service capabilities- Guix plugin packaging validates plugin lock compatibility against the host lock/API metadata
- compiled plugin artifacts carry compatibility metadata
- runtime checks remain defense-in-depth for manually installed artifacts
- Optional typed Legion rollout convenience wrappers.
- Future reboot-planning metadata such as
rebootRequiredandrebootReasons.
6) Working Notes
- Run per-repo commands from
devenv shellwhere applicable. - Keep supertest scenarios black-box and contract-oriented.
- Prefer published
guix-tribescommits and fresh provisioned nodes over local source-only validation. - Capture enough per-node artifacts that a failed clustered run can be diagnosed without SSH state still being live.
- The next high-value scenario is clustered
supertestplugin rollout with bidirectional synced table R/W, followed by rollback on every active node.