Allow sanitized Legion state snapshots to represent syncing cluster members and document the expected syncing-to-active convergence model.
4.6 KiB
Refactor Cluster Plan
This plan aligns tribes-supertest with the new cluster/control-plane contract in TRIBES-NOSTRSYNC.
Current Mismatches
- The current scenarios prove that nodes provision, bootstrap, and expose expected services, but they do not verify the new orchestrator-managed
cluster_nodesinvariants on each node. - The local Legion state model now includes
clusterMembership, so snapshots can preserve transientsyncingrows and archiveddecommissionedrows, but the runner still does not compare that state to node-reported control-plane views. - The cross-provider lifecycle scenario destroys and replaces nodes, but it does not assert the spec’s required outcomes:
- every healthy node converges on the same
cluster_nodesview, with new peers moving fromsyncingtoactivebefore orchestration proceeds - decommissioned nodes remain archived with the expected status and timestamps
- replacement/reactivation updates activation timestamps and derived sync transport state
- every healthy node converges on the same
- There is no scenario coverage for the new trust-expansion replay rule or for mesh-derived sync server configuration.
Refactoring Steps
-
Add management/API probes for the new control-plane surface. Files:
src/runner.ts, supporting helpers. Capture from every node:cluster_nodes.list()- cluster/sync status
- derived sync server list or equivalent runtime health payload
-
Extend artifact capture with cluster-control snapshots. Files:
src/runner.ts, snapshot writers, README. For each snapshot, persist:- per-node
cluster_nodesviews - per-node sync server/runtime summaries
- any membership diff detected across nodes
- per-node
-
Update scenario assertions to reflect the new control-plane contract. Files:
src/runner.ts. Replace “node exists and is clustered” as the only signal with assertions that:- all active nodes agree on membership
- each membership row has the expected
transport_address,scope, andstatus(syncing,active, ordecommissioned) - removed nodes are present as
decommissionedif that is the intended lifecycle
-
Strengthen the cross-provider lifecycle scenario around membership transitions. Files:
src/runner.ts,src/scenarios.ts. The scenario should explicitly check:- initial init node membership
- membership convergence after each join
- membership convergence after removal/decommission
- membership convergence after replacement join
- derived sync connectivity after each transition
-
Add checks for trust-expansion and replay behavior. Files: new scenario steps or a dedicated scenario. At minimum, validate that a newly joined or re-activated node eventually catches up to the expected singleton tribe/membership state after the orchestrator updates control-plane trust.
-
Verify transport-address and sync endpoint assumptions from the outside. Files:
src/runner.ts. For each tracked node, assert that the sync endpoint exposed by Legion matches thecluster_nodes.transport_addressvalues reported by peers and that the endpoint is reachable using the expected scheme/port. -
Consider a dedicated failure-oriented scenario. Files:
src/scenarios.ts,src/runner.ts. A useful follow-up scenario would intentionally exercise a partial membership change or publish outage and confirm the system either repairs itself or refuses the unsafe operation, depending on the layer being tested. -
Extend the docs and snapshots now that local state includes membership archives. Files:
README.md, tests. KeepclusterMembershipvisible in sanitized snapshots so scenario failures can be inspected against the syncing/active/decommissioned membership shape. -
Expand tests for the runner itself. Files:
tests/*.test.ts. Add unit coverage for:- control-plane snapshot parsing
- membership-diff reporting
- new scenario assertions around syncing vs active vs decommissioned nodes
Suggested Delivery Order
- Add management probes and snapshot capture for
cluster_nodesand sync status. - Upgrade the existing cluster lifecycle scenario to assert membership convergence.
- Add a follow-up scenario for trust-expansion replay or decommission/reactivation edge cases.
Notes For Coordination
tribes-supertestshould stay black-box: prefer asserting on Legion-managed behavior and node-reported admin/status surfaces, not internal database queries.- Legion now uses the
cluster_nodescontrol plane and decommission archives, so the remaining work here is mostly assertion/probe coverage rather than adapting to a moving API surface.