Files
legion_kk/docs/adding-provider.md
self 3b38221abe refactor: simplify node recovery actions
Remove the old restore/operator-task recovery model and make actor/execution state the source of truth. Align GUI and CLI node actions around start, stop, reboot, retry, and reinstall, with provider image reset folded into reinstall when supported.

Update statechart and architecture docs to match the reinstall/repair model.
2026-06-17 14:09:18 +02:00

20 KiB

Adding a Provider

This document explains how to add a new provider to Legion's current architecture.

It is written against the current codebase, not an idealized future state. That means it includes the boring synchronization work too: shared types, provider-module wiring, catalog policy, headless CLI, optional renderer support, tests, and docs.

The fastest way to get this right is:

  1. Treat OVH as the current reference for a "full integrated provider".
  2. Treat Hetzner as the simpler reference for a "well-exercised provider".
  3. Reuse the shared deployment/runtime path. Do not create a separate deployment implementation for the new provider.

Scope

Legion has provider modules with different surfaces:

  • API-backed providers expose cloud surfaces such as compute, DNS, domain, or firewall adapters.
  • manual is also a provider module. Its implementation is user-mediated instead of API-mediated, so it currently has metadata, manual server import, and user-task firewall capabilities but no cloud adapter.

This guide is mostly for adding a new API-backed cloud provider, meaning:

  • real credentials
  • compute catalog support
  • compute provisioning
  • optional DNS service integration
  • shared node deployment via the existing NBDE path

Current architecture

The important boundary is:

  • provider-specific API logic lives in src/main/cloud/providers/*.ts
  • each provider registers through a ProviderModule from src/main/cloud/providers/*-module.ts
  • shared provider mutation recovery lives in src/main/cloud/providers/api-resilience.ts
  • provider lookup, ordering, and enable/disable filtering live in src/main/cloud/providers/registry.ts
  • shared orchestration stays generic in:
    • src/main/ops/engine-ops.ts
    • src/main/resources/*/reconcile.ts
    • src/main/cloud/provider-broker.ts
    • src/main/cloud/provider-persistence.ts
    • src/main/deployment/runtime.ts
    • src/main/deployment/service.ts
  • shared types and provider unions live in src/shared/app.ts
  • catalog policy and deployment defaults are attached to the provider module
  • catalog policy types live in src/shared/provider-catalog-policy.ts
  • deployment profile types live in src/shared/provider-deployment-policy.ts
  • the headless engine surface lives in src/engine/runtime.ts
  • CLI wiring currently goes through:
    • src/main/cli/cli.ts
    • src/main/cli/node-cli-service.ts

That means adding a provider should mostly be:

  • one new provider API implementation file, if it talks to an external API
  • one new provider module file
  • adding that module to the registry's module list
  • provider credential surface wiring
  • tests

The deployment flow after the VM exists should remain shared.

Step 1. Extend the shared provider model

Start in src/shared/app.ts.

Update the provider unions first:

  • ProviderKind
  • IntegratedProviderKind
  • any derived unions that should include the new provider:
    • ServerProviderKind
    • StandardDnsProviderKind

Then add the new provider credential interface and include it in ProviderCredentials.

Use OVH and Hetzner as the shape references:

  • HetznerProviderCredentials
  • OvhProviderCredentials

You will also want to review any request types that implicitly assume the existing kinds, especially:

  • ProductCatalogRequest
  • ProviderConfig
  • ProviderDescriptor

If the provider needs additional project-level data like projectId, keep it explicit and stable in the shared type.

Step 2. Add the provider module contract implementation

Create a provider-module file in src/main/cloud/providers.

Current examples:

  • src/main/cloud/providers/hetzner-module.ts
  • src/main/cloud/providers/ovh-module.ts
  • src/main/cloud/providers/scaleway-module.ts
  • src/main/cloud/providers/mock-module.ts
  • src/main/cloud/providers/manual-module.ts

The module implements ProviderModule from src/main/cloud/providers/provider-module.ts.

It owns:

  • kind
  • descriptor
  • capabilities
  • credential defaults, normalization, resolution, and configured checks
  • optional catalogPolicy
  • optional deploymentProfile
  • optional API-backed adapter surfaces
  • optional enabled(env) gate

If the provider is disabled by enabled(env), Legion hides its persisted configuration and resources from app snapshots without deleting them.

Step 3. Implement provider API logic in src/main/cloud/providers

For API-backed providers, create a provider implementation file following the current pattern:

  • src/main/cloud/providers/hetzner.ts
  • src/main/cloud/providers/ovh.ts
  • src/main/cloud/providers/scaleway.ts

At minimum, the provider implementation should export functions covering the operations used by its module adapter:

  • descriptor
  • credential validation
  • compute catalog loading
  • DNS catalog loading if supported
  • compute inventory
  • compute provisioning
  • compute start/stop/reboot/destroy if supported
  • reinstall from image if supported
  • DNS zone and record operations if supported

The registry currently expects functions in the same style as:

  • getHetznerProviderDescriptor
  • validateHetznerCredentials
  • getHetznerCatalog
  • provisionHetznerServer
  • destroyHetznerServer
  • getOvhProviderDescriptor
  • validateOvhCredentials
  • getOvhCatalog
  • provisionOvhServer
  • destroyOvhServer

Important rule:

  • Normalize the provider's API output into Legion's shared catalog and observed-resource contracts in the provider module.
  • Do not leak provider SDK shapes into the rest of the app.

In practice that means your provider module is responsible for mapping upstream data into:

  • ProductCatalog
  • DnsProductCatalog
  • ProviderObservedServer
  • ProviderObservedZone

Provider API writes and ambiguous failures

Each provider module should keep a small provider-specific write wrapper, following the current pattern:

  • hetznerWriteRequest
  • ovhWriteRequest
  • scalewayWriteRequest

Every provider mutation should pass through that wrapper:

  • POST
  • PUT
  • PATCH
  • DELETE

The wrapper is responsible for:

  • logging the write through OperationLogger.logApiWrite
  • normalizing provider errors into ApiErrorDetails
  • using recoverAmbiguousProviderMutation from src/main/cloud/providers/api-resilience.ts
  • returning the recovered value when a read-back proves that the provider already reached the desired state

Do not blindly retry non-idempotent writes after network errors, timeouts, fetch failed, or 5xx responses. Those errors are ambiguous: the provider may have accepted the request and failed while returning the response. Prefer read-after-failure verification.

The verifier passed as verifyAfterAmbiguousError should prove the desired final state, not just prove that a similar resource exists:

  • create SSH key: find a key with the expected name and public key
  • create server: find a server with the expected local Legion identity, name, and tags
  • update firewall: read the firewall and compare the expected metadata/rules
  • delete resource: treat already-gone/not-found as success

If a write cannot be verified with provider state, leave verifyAfterAmbiguousError out and let the original error bubble. Add a short code comment only when the lack of verification is non-obvious.

Cleanup and delete paths should be absence-driven and tolerant of resources that are already gone. They should not wait for cluster decommission when local provisioning state shows that the node never successfully joined the cluster.

Step 4. Register the provider module

Update src/main/cloud/providers/registry.ts.

The registry should stay small. Do not add provider-specific branches there.

You need to:

  1. Import the new provider module.
  2. Add it to PROVIDER_MODULES.

Provider-specific capabilities, credentials, catalog policy, deployment profile, and adapter methods belong in the provider module, not in the registry.

If the provider talks to an API, its adapter must implement CloudProviderAdapter from src/main/cloud/providers/contracts.ts.

That means providing:

  • verifyCredentials
  • compute
  • dns

Read the contract in src/main/cloud/providers/contracts.ts carefully. It is the actual integration boundary.

Development/test-only providers

If the provider is intentionally fake, keep that explicit instead of making it impersonate a real provider.

The current example is mock, displayed as MockKing24. Its provider module is enabled only when the GUI mock network is active:

  • LEGION_GUI_MOCK_NETWORK=1

For this kind of provider:

  • still add shared provider types, a provider module, catalog policy, deployment profile, and renderer support if the UI should show it
  • do not add live catalog or billed E2E coverage
  • keep credentials local and deterministic if they configure failure modes such as error rate, latency, or seed
  • make the provider module disabled by default
  • keep stored mock resources hidden, not deleted, when the mock gate is inactive
  • keep real-provider tests for real-provider semantics instead of treating the mock as a substitute for Hetzner/OVH/Scaleway coverage

Step 5. Add provider catalog policy

Add the provider's ProviderCatalogPolicy and attach it to the provider module.

Every integrated provider needs a ProviderCatalogPolicy entry:

  • defaultImage
  • optional defaultRegion
  • defaultProductSelector
  • any catalog metadata enrichment rules

This drives:

  • default offer selection
  • default image choice
  • region defaults
  • quote generation expectations

Do not leave the new provider out here. selectProvisioningOffer() relies on the normalized catalog shape, but the app also expects a sane default offer for provider-driven flows and tests.

If your provider needs a preferred starter offer, encode it here.

Legion's compute model now treats all instance lines as offers:

  • fixed Hetzner-style instances are offers with zero variables
  • configurable providers expose the same offer shape, but may add variables such as diskGb

Step 6. Add provider deployment profile

Add the provider's ProviderDeploymentProfile and attach it to the provider module.

This is intentionally small right now. Today it carries:

  • defaultBootMode

NodeDeploymentRuntime uses this policy when nodes add does not explicitly pass --boot-mode.

That means a new compute provider must define its deployment profile before the headless node path is usable.

Current examples:

  • Hetzner: bios
  • OVH: bios
  • Scaleway: efi

Keep this provider-specific and minimal. The actual deploy pipeline stays shared.

Step 7. Make the headless engine accept the provider

Most of the engine is already generic once the registry knows about the provider.

Still review these files for hardcoded provider assumptions:

  • src/engine/runtime.ts
  • src/main/product-catalog.ts
  • src/main/ops/engine-ops.ts
  • src/main/cloud/provider-broker.ts
  • src/main/cloud/provider-persistence.ts
  • src/main/deployment/runtime.ts

Things to verify:

  • getProviderAdapter(providerKind) works with the new kind
  • catalog fetches work through ProviderCatalogStore
  • provisioning can resolve the new provider configuration and offer
  • the node runtime can choose a deployment profile for the provider

The main rule here is:

  • if you find a provider-specific branch in orchestration code, remove it if possible
  • if it must stay, make the new provider explicit there

Step 8. Add CLI support for provider configure

The headless CLI still contains provider-specific credential parsing.

Update:

  • src/main/cli/node-cli-service.ts
  • src/main/cli/cli.ts

Specifically:

  • extend ProviderConfigureRequest handling
  • add CLI flags and env var resolution for the new provider's credentials
  • extend resolveProviderUpsert

Keep the CLI contract explicit and simple. Follow the current style:

  • direct flag support
  • --...-env support
  • stable default env var names

If the provider is intended to support headless node deployment, this step is mandatory.

For development/test-only UI providers, CLI support is optional. If you skip it, make sure CLI parsing rejects the provider clearly instead of accepting a half-configured provider.

Step 9. Update renderer support if the provider should appear in the app UI

If the provider is headless-only for now, you can skip this section.

If it should be visible and editable in the Electron UI, the provider descriptor should do most of the work.

The renderer currently uses:

  • ProviderDescriptor.credentialFields for the settings provider form
  • ProviderDescriptor.capabilities for provider lists such as server creation
  • ProviderDescriptor.defaultProviderId for the provider's stable persisted id
  • renderer translations keyed by provider kind and credential field id

For a provider with ordinary credential fields, you should not add a provider-specific settings card.

Descriptor-driven provider forms

Review:

  • src/renderer/src/forms/provider-forms.ts
  • src/renderer/src/state/settings-actions.ts
  • src/renderer/src/components/SettingsDialog.svelte
  • src/renderer/src/state/model.ts
  • src/renderer/src/state/view-flows.ts

These should remain generic:

  • settings drafts are stored as ProviderConfigDrafts
  • credential values are keyed by descriptor field id
  • the renderer sends { kind, ...fieldValues } as credentials
  • the main-process provider module normalizes credentials and decides whether they are configured

Only add custom renderer code if the provider needs unusual credential UX that cannot be represented by descriptor fields.

Provider selection UI

Review provider selection surfaces if the provider has new capabilities:

  • src/renderer/src/components/workspace/ServerDialog.svelte
  • src/renderer/src/components/workspace/DomainDialog.svelte
  • src/renderer/src/components/workspace/ZoneDialog.svelte

Server creation is descriptor-driven for providers with computeCatalog.

Domain and DNS flows still use narrower domain/DNS form unions where provider semantics are more specific. Extend those deliberately if the new provider supports standard DNS or domain registration.

Manual provider UX

The manual provider does not have credentials. It exposes a static compute catalog and captures server connection details as offer variables:

  • public IP or hostname
  • SSH username
  • password for the initial managed-key bootstrap

Manual server plans copy those offer variables into manualConfig at the main-process planning boundary and assume an Ubuntu-compatible host. The deployment path auto-detects privilege mode from the SSH user: root runs directly, other users are expected to have passwordless sudo.

Manual firewall handling is also user-mediated. The firewall planner keeps manual firewall reconciliation out of provider mutation paths; it does not silently pretend provider firewall reconciliation happened.

Translations

Update renderer i18n files for:

  • provider labels in src/renderer/src/i18n/locales/
  • any provider-specific credential field labels that are not already covered by generic field ids
  • any UI copy you add

All user-facing strings must be added for:

  • en-GB
  • de-DE
  • es-ES

Step 10. Add live catalog smoke coverage

Every integrated provider should have a read-only live test before destructive node deployment coverage.

Current examples:

  • tests/integration/hcloud.catalog.test.ts
  • tests/integration/ovh.smoke.test.ts
  • tests/integration/scaleway.smoke.test.ts

Add a provider-specific integration smoke test that validates at least:

  1. authentication works
  2. the compute catalog loads
  3. the DNS catalog loads, if supported
  4. the normalized catalog has sensible defaults
  5. selectProvisioningOffer() works with the provider's IDs

If the provider depends on a generated client or provider SDK schema surface, check the API endpoints you rely on directly in this test, similar to OVH's schema assertions.

Then wire it into package.json:

  • add a dedicated script
  • include it in test:integration:live if appropriate

Also update TESTING.md.

Step 11. Add destructive headless node deployment coverage

Do not create a new test harness per provider.

Use the shared live harness:

  • tests/billed/e2e/cli-live-test-harness.ts

Create only a thin provider wrapper, following:

  • tests/billed/e2e/hetzner-cli.test.ts
  • tests/billed/e2e/ovh-cli.test.ts
  • tests/billed/e2e/scaleway-cli.test.ts

Your wrapper should provide:

  • required credential env vars
  • optional provider-specific env vars such as instance/offer selection or boot mode
  • providerConfigureArgs
  • providerEnv
  • provider label/kind

The harness already handles:

  • isolated HOME
  • config init
  • provider configure
  • nodes add
  • nodes list --json
  • nodes destroy
  • live stdout/stderr streaming
  • LEGION_CC_E2E_KEEP_INSTANCE=1
  • exporting the managed SSH key and printing an SSH command

Also update:

  • package.json
  • TESTING.md

Step 12. Validate the provider step by step

Do not jump straight to the destructive E2E.

Use this order.

1. Static verification

Run:

npm run lint
npm run typecheck:node

If the renderer was touched:

npm run typecheck

2. Mocked/local tests

Run the most relevant existing suites:

npm run test:unit
npm run test:integration:mock

At minimum, make sure any provider-related unit tests still pass:

  • tests/unit/provider-broker.test.ts
  • tests/unit/cli.test.ts
  • tests/unit/node-deployment-runtime.test.ts

Add targeted unit tests if the new provider introduces special selection or credential logic.

For provider mutations, also add tests around ambiguous errors for the high-risk paths:

  • create server
  • delete server
  • create managed SSH key
  • delete managed SSH key
  • firewall/security-group create, update, and delete when supported

The expected behaviour is not "retry until it works"; it is "read provider state and accept success only when the desired final state is visible".

3. Live catalog smoke test

Run the new provider's read-only live integration test first.

This should confirm:

  • credentials are valid
  • catalog normalization works
  • a default offer is selectable

Only after this passes should you attempt real deployment.

4. Destructive headless E2E

Run the provider's live CLI test.

For a new provider, prefer the first debug run with:

export LEGION_CC_E2E_KEEP_INSTANCE=1

That keeps the VM and prints the exported managed SSH key path and a ready-to-use ssh command.

This is the final validation target for the provider addition.

Full checklist

Use this as the compact review list before calling the provider done.

  • src/shared/app.ts
  • src/main/state/store.ts
  • src/main/state/schema.ts
  • src/main/cloud/providers/<provider>.ts
  • src/main/cloud/providers/<provider>-module.ts
  • src/main/cloud/providers/registry.ts
  • src/shared/provider-catalog-policy.ts
  • src/main/cli/cli.ts
  • src/main/cli/node-cli-service.ts
  • renderer form/model/i18n files if the provider should appear in the UI
  • tests/integration/<provider>.smoke.test.ts
  • tests/billed/e2e/<provider>-cli.test.ts
  • package.json
  • TESTING.md

Current gotchas

These are easy to miss in the current codebase.

  • ProviderKind and related unions are duplicated by design across several helper types. Update all affected unions, not just the top-level one.
  • Provider defaults and credential normalization live in the provider module. If persisted state behaves strangely, check the module's credential contract first.
  • node-cli-service.ts still hardcodes credential/env handling for supported providers.
  • The renderer still contains explicit provider branches, especially where DNS/domain support differs by provider. A provider can be fully usable headlessly before the UI knows about it.
  • The shared deployment path is generic, but boot defaults are provider-specific via the provider module's deployment profile.
  • New provider tests should reuse cli-live-test-harness.ts, not fork it.

Suggested development order

If you want the shortest path to a working integration:

  1. shared types
  2. provider module
  3. registry
  4. state normalization/defaults
  5. catalog policy
  6. deployment profile
  7. CLI provider configure
  8. read-only live smoke test
  9. destructive live CLI E2E
  10. renderer support

That gets the infrastructure path working first and keeps UI work separate.