Files
tribes-supertest/docs/REFACTOR_CLUSTER.md
self f3ae495c0f docs: align supertest with syncing membership
Allow sanitized Legion state snapshots to represent syncing cluster members and document the expected syncing-to-active convergence model.
2026-05-23 14:16:28 +02:00

80 lines
4.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Refactor Cluster Plan
This plan aligns `tribes-supertest` with the new cluster/control-plane contract in [TRIBES-NOSTRSYNC](../../tribes/docs/TRIBES-NOSTRSYNC.md).
## Current Mismatches
- The current scenarios prove that nodes provision, bootstrap, and expose expected services, but they do not verify the new orchestrator-managed `cluster_nodes` invariants on each node.
- The local Legion state model now includes `clusterMembership`, so snapshots can preserve transient `syncing` rows and archived `decommissioned` rows, but the runner still does not compare that state to node-reported control-plane views.
- The cross-provider lifecycle scenario destroys and replaces nodes, but it does not assert the specs required outcomes:
- every healthy node converges on the same `cluster_nodes` view, with new peers moving from `syncing` to `active` before orchestration proceeds
- decommissioned nodes remain archived with the expected status and timestamps
- replacement/reactivation updates activation timestamps and derived sync transport state
- There is no scenario coverage for the new trust-expansion replay rule or for mesh-derived sync server configuration.
## Refactoring Steps
1. Add management/API probes for the new control-plane surface.
Files: `src/runner.ts`, supporting helpers.
Capture from every node:
- `cluster_nodes.list()`
- cluster/sync status
- derived sync server list or equivalent runtime health payload
2. Extend artifact capture with cluster-control snapshots.
Files: `src/runner.ts`, snapshot writers, README.
For each snapshot, persist:
- per-node `cluster_nodes` views
- per-node sync server/runtime summaries
- any membership diff detected across nodes
3. Update scenario assertions to reflect the new control-plane contract.
Files: `src/runner.ts`.
Replace “node exists and is clustered” as the only signal with assertions that:
- all active nodes agree on membership
- each membership row has the expected `transport_address`, `scope`, and `status` (`syncing`, `active`, or `decommissioned`)
- removed nodes are present as `decommissioned` if that is the intended lifecycle
4. Strengthen the cross-provider lifecycle scenario around membership transitions.
Files: `src/runner.ts`, `src/scenarios.ts`.
The scenario should explicitly check:
- initial init node membership
- membership convergence after each join
- membership convergence after removal/decommission
- membership convergence after replacement join
- derived sync connectivity after each transition
5. Add checks for trust-expansion and replay behavior.
Files: new scenario steps or a dedicated scenario.
At minimum, validate that a newly joined or re-activated node eventually catches up to the expected singleton tribe/membership state after the orchestrator updates control-plane trust.
6. Verify transport-address and sync endpoint assumptions from the outside.
Files: `src/runner.ts`.
For each tracked node, assert that the sync endpoint exposed by Legion matches the `cluster_nodes.transport_address` values reported by peers and that the endpoint is reachable using the expected scheme/port.
7. Consider a dedicated failure-oriented scenario.
Files: `src/scenarios.ts`, `src/runner.ts`.
A useful follow-up scenario would intentionally exercise a partial membership change or publish outage and confirm the system either repairs itself or refuses the unsafe operation, depending on the layer being tested.
8. Extend the docs and snapshots now that local state includes membership archives.
Files: `README.md`, tests.
Keep `clusterMembership` visible in sanitized snapshots so scenario failures can be inspected against the syncing/active/decommissioned membership shape.
9. Expand tests for the runner itself.
Files: `tests/*.test.ts`.
Add unit coverage for:
- control-plane snapshot parsing
- membership-diff reporting
- new scenario assertions around syncing vs active vs decommissioned nodes
## Suggested Delivery Order
1. Add management probes and snapshot capture for `cluster_nodes` and sync status.
2. Upgrade the existing cluster lifecycle scenario to assert membership convergence.
3. Add a follow-up scenario for trust-expansion replay or decommission/reactivation edge cases.
## Notes For Coordination
- `tribes-supertest` should stay black-box: prefer asserting on Legion-managed behavior and node-reported admin/status surfaces, not internal database queries.
- Legion now uses the `cluster_nodes` control plane and decommission archives, so the remaining work here is mostly assertion/probe coverage rather than adapting to a moving API surface.