You've already forked tribes-supertest
f3ae495c0f
Allow sanitized Legion state snapshots to represent syncing cluster members and document the expected syncing-to-active convergence model.
80 lines
4.6 KiB
Markdown
80 lines
4.6 KiB
Markdown
# Refactor Cluster Plan
|
||
|
||
This plan aligns `tribes-supertest` with the new cluster/control-plane contract in [TRIBES-NOSTRSYNC](../../tribes/docs/TRIBES-NOSTRSYNC.md).
|
||
|
||
## Current Mismatches
|
||
|
||
- The current scenarios prove that nodes provision, bootstrap, and expose expected services, but they do not verify the new orchestrator-managed `cluster_nodes` invariants on each node.
|
||
- The local Legion state model now includes `clusterMembership`, so snapshots can preserve transient `syncing` rows and archived `decommissioned` rows, but the runner still does not compare that state to node-reported control-plane views.
|
||
- The cross-provider lifecycle scenario destroys and replaces nodes, but it does not assert the spec’s required outcomes:
|
||
- every healthy node converges on the same `cluster_nodes` view, with new peers moving from `syncing` to `active` before orchestration proceeds
|
||
- decommissioned nodes remain archived with the expected status and timestamps
|
||
- replacement/reactivation updates activation timestamps and derived sync transport state
|
||
- There is no scenario coverage for the new trust-expansion replay rule or for mesh-derived sync server configuration.
|
||
|
||
## Refactoring Steps
|
||
|
||
1. Add management/API probes for the new control-plane surface.
|
||
Files: `src/runner.ts`, supporting helpers.
|
||
Capture from every node:
|
||
- `cluster_nodes.list()`
|
||
- cluster/sync status
|
||
- derived sync server list or equivalent runtime health payload
|
||
|
||
2. Extend artifact capture with cluster-control snapshots.
|
||
Files: `src/runner.ts`, snapshot writers, README.
|
||
For each snapshot, persist:
|
||
- per-node `cluster_nodes` views
|
||
- per-node sync server/runtime summaries
|
||
- any membership diff detected across nodes
|
||
|
||
3. Update scenario assertions to reflect the new control-plane contract.
|
||
Files: `src/runner.ts`.
|
||
Replace “node exists and is clustered” as the only signal with assertions that:
|
||
- all active nodes agree on membership
|
||
- each membership row has the expected `transport_address`, `scope`, and `status` (`syncing`, `active`, or `decommissioned`)
|
||
- removed nodes are present as `decommissioned` if that is the intended lifecycle
|
||
|
||
4. Strengthen the cross-provider lifecycle scenario around membership transitions.
|
||
Files: `src/runner.ts`, `src/scenarios.ts`.
|
||
The scenario should explicitly check:
|
||
- initial init node membership
|
||
- membership convergence after each join
|
||
- membership convergence after removal/decommission
|
||
- membership convergence after replacement join
|
||
- derived sync connectivity after each transition
|
||
|
||
5. Add checks for trust-expansion and replay behavior.
|
||
Files: new scenario steps or a dedicated scenario.
|
||
At minimum, validate that a newly joined or re-activated node eventually catches up to the expected singleton tribe/membership state after the orchestrator updates control-plane trust.
|
||
|
||
6. Verify transport-address and sync endpoint assumptions from the outside.
|
||
Files: `src/runner.ts`.
|
||
For each tracked node, assert that the sync endpoint exposed by Legion matches the `cluster_nodes.transport_address` values reported by peers and that the endpoint is reachable using the expected scheme/port.
|
||
|
||
7. Consider a dedicated failure-oriented scenario.
|
||
Files: `src/scenarios.ts`, `src/runner.ts`.
|
||
A useful follow-up scenario would intentionally exercise a partial membership change or publish outage and confirm the system either repairs itself or refuses the unsafe operation, depending on the layer being tested.
|
||
|
||
8. Extend the docs and snapshots now that local state includes membership archives.
|
||
Files: `README.md`, tests.
|
||
Keep `clusterMembership` visible in sanitized snapshots so scenario failures can be inspected against the syncing/active/decommissioned membership shape.
|
||
|
||
9. Expand tests for the runner itself.
|
||
Files: `tests/*.test.ts`.
|
||
Add unit coverage for:
|
||
- control-plane snapshot parsing
|
||
- membership-diff reporting
|
||
- new scenario assertions around syncing vs active vs decommissioned nodes
|
||
|
||
## Suggested Delivery Order
|
||
|
||
1. Add management probes and snapshot capture for `cluster_nodes` and sync status.
|
||
2. Upgrade the existing cluster lifecycle scenario to assert membership convergence.
|
||
3. Add a follow-up scenario for trust-expansion replay or decommission/reactivation edge cases.
|
||
|
||
## Notes For Coordination
|
||
|
||
- `tribes-supertest` should stay black-box: prefer asserting on Legion-managed behavior and node-reported admin/status surfaces, not internal database queries.
|
||
- Legion now uses the `cluster_nodes` control plane and decommission archives, so the remaining work here is mostly assertion/probe coverage rather than adapting to a moving API surface.
|