Files
tribes-supertest/docs/REFACTOR_CLUSTER.md
self f3ae495c0f docs: align supertest with syncing membership
Allow sanitized Legion state snapshots to represent syncing cluster members and document the expected syncing-to-active convergence model.
2026-05-23 14:16:28 +02:00

4.6 KiB
Raw Permalink Blame History

Refactor Cluster Plan

This plan aligns tribes-supertest with the new cluster/control-plane contract in TRIBES-NOSTRSYNC.

Current Mismatches

  • The current scenarios prove that nodes provision, bootstrap, and expose expected services, but they do not verify the new orchestrator-managed cluster_nodes invariants on each node.
  • The local Legion state model now includes clusterMembership, so snapshots can preserve transient syncing rows and archived decommissioned rows, but the runner still does not compare that state to node-reported control-plane views.
  • The cross-provider lifecycle scenario destroys and replaces nodes, but it does not assert the specs required outcomes:
    • every healthy node converges on the same cluster_nodes view, with new peers moving from syncing to active before orchestration proceeds
    • decommissioned nodes remain archived with the expected status and timestamps
    • replacement/reactivation updates activation timestamps and derived sync transport state
  • There is no scenario coverage for the new trust-expansion replay rule or for mesh-derived sync server configuration.

Refactoring Steps

  1. Add management/API probes for the new control-plane surface. Files: src/runner.ts, supporting helpers. Capture from every node:

    • cluster_nodes.list()
    • cluster/sync status
    • derived sync server list or equivalent runtime health payload
  2. Extend artifact capture with cluster-control snapshots. Files: src/runner.ts, snapshot writers, README. For each snapshot, persist:

    • per-node cluster_nodes views
    • per-node sync server/runtime summaries
    • any membership diff detected across nodes
  3. Update scenario assertions to reflect the new control-plane contract. Files: src/runner.ts. Replace “node exists and is clustered” as the only signal with assertions that:

    • all active nodes agree on membership
    • each membership row has the expected transport_address, scope, and status (syncing, active, or decommissioned)
    • removed nodes are present as decommissioned if that is the intended lifecycle
  4. Strengthen the cross-provider lifecycle scenario around membership transitions. Files: src/runner.ts, src/scenarios.ts. The scenario should explicitly check:

    • initial init node membership
    • membership convergence after each join
    • membership convergence after removal/decommission
    • membership convergence after replacement join
    • derived sync connectivity after each transition
  5. Add checks for trust-expansion and replay behavior. Files: new scenario steps or a dedicated scenario. At minimum, validate that a newly joined or re-activated node eventually catches up to the expected singleton tribe/membership state after the orchestrator updates control-plane trust.

  6. Verify transport-address and sync endpoint assumptions from the outside. Files: src/runner.ts. For each tracked node, assert that the sync endpoint exposed by Legion matches the cluster_nodes.transport_address values reported by peers and that the endpoint is reachable using the expected scheme/port.

  7. Consider a dedicated failure-oriented scenario. Files: src/scenarios.ts, src/runner.ts. A useful follow-up scenario would intentionally exercise a partial membership change or publish outage and confirm the system either repairs itself or refuses the unsafe operation, depending on the layer being tested.

  8. Extend the docs and snapshots now that local state includes membership archives. Files: README.md, tests. Keep clusterMembership visible in sanitized snapshots so scenario failures can be inspected against the syncing/active/decommissioned membership shape.

  9. Expand tests for the runner itself. Files: tests/*.test.ts. Add unit coverage for:

    • control-plane snapshot parsing
    • membership-diff reporting
    • new scenario assertions around syncing vs active vs decommissioned nodes

Suggested Delivery Order

  1. Add management probes and snapshot capture for cluster_nodes and sync status.
  2. Upgrade the existing cluster lifecycle scenario to assert membership convergence.
  3. Add a follow-up scenario for trust-expansion replay or decommission/reactivation edge cases.

Notes For Coordination

  • tribes-supertest should stay black-box: prefer asserting on Legion-managed behavior and node-reported admin/status surfaces, not internal database queries.
  • Legion now uses the cluster_nodes control plane and decommission archives, so the remaining work here is mostly assertion/probe coverage rather than adapting to a moving API surface.