Files
tribes-supertest/docs/ROLLOUT_CROSS_REPO_PLAN_PROGRESS.md

11 KiB

Plugin Rollout/Rollback Cross-Repo Progress

Status: single-node supertest plugin rollout/rollback is green on signed guix-tribes master; next target is clustered plugin sync validation. Last updated: 2026-05-03 Scope repos: tribes-supertest, tribes-plugin-supertest, legion_kk, tribes, guix-tribes


1) Current Baseline

Green live validation

single-node-plugin-rollout-rollback has passed on fresh infra through the public admin API using the real pinned supertest plugin:

  • run: 2026-05-01t133844554z-single-node-plugin-rollout-rollback
  • node: st-20260501-820549-hx-a
  • consumed guix-tribes commit: 5a348e7c5427b99c84755aa12c30c37c2de7a4ca
  • baseline selected/running: /gnu/store/39hsfwaf39h5h91gwd80ildrsjsdr85b-system
  • plugin selected/running: /gnu/store/w51f5b2a5ynrpfky6qyawbnjhmgx0mxs-system
  • rollback selected/running: /gnu/store/39hsfwaf39h5h91gwd80ildrsjsdr85b-system
  • validation: plugin API health/schema/write/read passed
  • rollback validation: supertest_events and supertest_cases were removed by down migrations
  • cleanup: Hetzner server destroyed; final node list []

Current guix-tribes master baseline

guix-tribes master has moved beyond the first green run and now includes the follow-up fixes and plugin substitute baseline:

  • 39b1ed8 fix: skip no-op pulls and stabilize generation diagnostics
  • 8849107 fix: resolve herd for rollback migrations
  • e13c136 test: harden local-control worker state
  • 2950278 chore: Bump tribes
  • 05c493b test: avoid running guile suites on import
  • 0b4d3a7 fix: compile bundled tribes_ui plugin
  • fa4753a feat: build channel plugins in substitute baseline

fa4753a is the current intended guix-tribes master baseline for the next test runs. The substitute manifest now covers the channel plugin registry packages, including tribes-plugin-supertest, tribes-plugin-aether, and tribes-plugin-sender plus ffmpeg for sender.

Current plugin fixture

tribes-plugin-supertest commit e042f3265db7a40d4d558132800238c6d466e8dd provides:

  • Supertest.Case and Supertest.Event
  • reversible migrations for supertest_cases and supertest_events
  • AshNostrSync registration for both resources
  • a local-only JSON API under /plugins-api/supertest

The API is enough for black-box clustered checks because the runner can call it over SSH on each node's 127.0.0.1:4000.


2) Locked Decisions

  • Scenario assertions use the normal operator/automation surface: POST /api/admin/management with NIP-98 auth.
  • tribes-deploy-exec remains diagnostic-only for these scenarios.
  • supertest is the lean migration-bearing rollout fixture; aether and sender remain real packaged plugins for broader package/substitute coverage.
  • Selected generation is /var/guix/profiles/system; running generation is /run/current-system.
  • Guix-sensitive work must run through the pulled/current Guix profile/module universe.
  • Direct explicit generation rollback must stay async/restart-tolerant and must not replay the current rollout plan.
  • tribes-local-control service-definition updates should be treated as restart/reboot-required work, not restarted in-band while local-control is serving the switch/rollback request.
  • Plugin disable is lower priority than clustered install/sync/rollback validation.

3) What Works Now

tribes-supertest

  • Public-admin-API rollout flow for single-node-plugin-rollout-rollback.
  • Rollout assertions for preview/start/status, selected/running convergence, plugin API health/schema/write/read, rollback convergence, table removal, and provider cleanup.
  • Strong diagnostics around local-control readiness, socket/listener readiness, migrations, Shepherd logs, certificates, system generation comparison, Guix build logs, and daemon logs.

guix-tribes

  • External plugin pinning and registry-backed plugin packaging.
  • Target-node plugin builds with the lean Mix/OTP baseline.
  • Runtime plugin loading in release mode.
  • tribes-migrations wrapper logging with real output/exit status.
  • Current-Guix delegation for service upgrade, generation comparison, config evaluation, and derivation realization.
  • Rollback to installed baseline profile generations.
  • Plugin rollback down migrations for direct rollback from a plugin generation to a no-plugin baseline.
  • Target-generation plugin state restore before rollback.
  • No-op guix pull skip when rollout plans explicitly resolve no channel delta.
  • Generation diagnostic JSON stability.
  • Channel plugin substitute manifest coverage.

tribes

  • Public admin-management rollout methods are the scenario contract.
  • Startup readiness returns 503 while Tribes.Repo is unavailable.
  • Explicit rollback generation is async and does not replay the current rollout plan.
  • AshNostrSync publishes and consumes host and plugin resources through registered namespaces.

legion_kk

  • Generic admin management calls are available through Legion-owned CLI/service/API wrappers.
  • Provider cleanup is materially more stable across Hetzner, Scaleway, and OVH.
  • Installed-system Guix checkout-cache transplant is in place.
  • Admin API retry behavior handles explicit 503 readiness responses.

4) Remaining Gaps Before Clustered Plugin Sync Test

The next clustered test should not need plugin disable/uninstall semantics. The important pre-flight items are:

  1. Use the current published guix-tribes master baseline. The cluster nodes should consume fa4753a or a signed equivalent that includes the plugin substitute manifest and the tribes_ui Guix build fix.

  2. Confirm substitute availability before provisioning. The builder should already have substitutes for tribes, tribes-plugin-supertest, tribes-plugin-aether, tribes-plugin-sender, and ffmpeg. Missing substitutes should abort cleanly or fail early; they should not leave nodes half-switched after a long silent build.

  3. Handle the guix-fork channel-auth rewrite on the substitute builder path. The pinned guix-fork commit 906f6b2d3a4f9f80c5ad6f9e5f6369706a1a301d is not a descendant of the old introduction commit 6f9c3cd1761f0a3f8b70223cb0e0f47e29582d90. For the builder sync job that consumes explicit pins/base-channels.sexp, either allow downgrades there or refresh the channel introduction to the new signed lineage. Keep that scoped to the builder/pinned-channel path.

  4. Run the new clustered plugin scenario. cluster-plugin-rollout-sync-split-brain now builds one Hetzner init node, plans Scaleway and OVH join nodes, materializes the two joins in one Legion graph run, rolls out supertest, partitions one node from the sync mesh, heals it, forces full resync, validates plugin table convergence, and rolls back all nodes.

  5. Assert control-plane convergence before plugin writes. Before testing plugin data sync, capture and compare per-node:

    • cluster_nodes.list
    • cluster_status
    • sync server/runtime state
    • active node transport addresses
  6. Exercise bidirectional plugin table sync. Minimal useful check:

    • roll out supertest to all active nodes
    • wait for selected/running convergence and plugin schema readiness on every node
    • create a Supertest.Case on node A
    • poll /plugins-api/supertest/state?run_id=... on node B until it appears
    • create a linked Supertest.Event on node B
    • poll node A until the event appears
    • update/increment the case on one node and verify the other node sees the latest value
  7. Run rollback after clustered sync. Roll back to the no-plugin baseline and verify on every active node:

    • selected/running convergence back to baseline
    • plugin API is gone
    • supertest_cases and supertest_events are absent
    • cluster status returns healthy after services restart

5) Open Implementation Work

Migration lifecycle

  • Run target-generation up migrations before commit, instead of relying only on the post-switch tribes-migrations Shepherd startup path.
  • Broaden rollback/down migration orchestration for core schema rollback, plugin version rollback, and plugin uninstall beyond the direct no-plugin baseline path.
  • Add automated ordering tests for core up, plugin up dependency order, plugin down reverse dependency order, and migration failure handling.
  • Model disable vs uninstall semantics when it becomes a product priority:
    • disable: remove/disable runtime, keep data, no down migrations
    • uninstall: remove runtime and run down migrations, with destructive data loss surfaced before confirmation

Local-control / rollout hardening

  • Add dedicated local-control integration tests for socket bind/readiness, status endpoint behavior, responsiveness during long jobs, and abort behavior under load.
  • Treat tribes-local-control service-definition updates as restart/reboot-required work in planning and status output.
  • Add or extend tribes tests around admin-management rollout response shapes and machine-readable failures.
  • Keep a fresh-node regression check for generation comparison after the current guix-tribes baseline is published.

Cluster scenario coverage

  • Add management probes and snapshot capture for cluster_nodes.list, cluster_status, and sync runtime/server state.
  • Upgrade cluster-lifecycle to assert membership convergence, not only node existence.
  • Add clustered supertest plugin rollout/sync/rollback scenario wiring.
  • Run cluster-plugin-rollout-sync-split-brain on fresh infra and fix discovered cluster rollout/sync issues.
  • Add trust-expansion replay coverage for newly joined or reactivated nodes.

Future metadata

  • Define plugin ABI/dependency compatibility metadata:
    • host framework deps are part of the versioned Tribes plugin API/ABI
    • requires / provides remain for plugin-to-plugin or optional service capabilities
    • Guix plugin packaging validates plugin lock compatibility against the host lock/API metadata
    • compiled plugin artifacts carry compatibility metadata
    • runtime checks remain defense-in-depth for manually installed artifacts
  • Optional typed Legion rollout convenience wrappers.
  • Future reboot-planning metadata such as rebootRequired and rebootReasons.

6) Working Notes

  • Run per-repo commands from devenv shell where applicable.
  • Keep supertest scenarios black-box and contract-oriented.
  • Prefer published guix-tribes commits and fresh provisioned nodes over local source-only validation.
  • Capture enough per-node artifacts that a failed clustered run can be diagnosed without SSH state still being live.
  • The next high-value scenario is clustered supertest plugin rollout with bidirectional synced table R/W, followed by rollback on every active node.