From 9e88927e5bdc030ef34360fb0b6abf74e23f46b6 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Tue, 2 Jun 2026 05:06:30 +0000 Subject: [PATCH] =?UTF-8?q?ideas:=20Co-op=20Cloud=20NixOS=20modules=20?= =?UTF-8?q?=E2=80=94=20mkCcApp=20factory=20+=20health-gated=20rollback?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- ideas/coop-cloud-nixos-modules.md | 223 ++++++++++++++++++++++++++++++ 1 file changed, 223 insertions(+) create mode 100644 ideas/coop-cloud-nixos-modules.md diff --git a/ideas/coop-cloud-nixos-modules.md b/ideas/coop-cloud-nixos-modules.md new file mode 100644 index 0000000..ba82343 --- /dev/null +++ b/ideas/coop-cloud-nixos-modules.md @@ -0,0 +1,223 @@ +# Idea: Co-op Cloud NixOS modules + +**Status:** research / pre-design. Not started. +**Origin:** conversation 2026-06-02 between mfowler and the assistant. + +--- + +## The idea + +A public Nix flake that lets NixOS operators deploy Co-op Cloud apps declaratively — via git, via +`nixos-rebuild switch` — instead of via `abra` imperative commands. Each app is a thin NixOS module +backed by a shared `mkCcApp` factory. Docker Swarm still does the actual container work, so the +container-isolation story is unchanged; Nix manages *what* is deployed and *at what version*. + +```nix +# In a user's NixOS configuration.nix: +services.coop-cloud.ghost = { + enable = true; + domain = "blog.example.org"; + version = "1.3.0+6.42.0-alpine"; + autoUpdate = true; +}; +``` + +--- + +## Why this over native NixOS modules + +NixOS already has native service modules for ~14 of the 18 maintained recipes (matrix-synapse, +keycloak, nextcloud, jitsi-meet, hedgedoc, immich, etc.). The argument for doing it this way +instead: + +**Container isolation is an advantage, not a legacy.** Native NixOS modules run as systemd units +sharing the host's namespaces. Containers give hard network and filesystem isolation between apps — +a compromised Ghost instance cannot see other services' sockets. For a single-node multi-app host +that matters. + +**The recipes already exist and are maintained.** The Co-op Cloud recipe ecosystem (compose.yml, +tested version combinations, backup hooks, Traefik wiring) is the real value. Wrapping it in Nix +preserves that investment rather than reimplementing it as native modules. + +**The target user.** Most Co-op Cloud operators are single-node. They want the isolation and +curation of Co-op Cloud recipes but a declarative, git-tracked, idempotent deployment model rather +than `abra`'s imperative one. This gives them that without giving up containers. + +--- + +## Existing art + +- **Arion** (hercules-ci/arion) — runs docker-compose via NixOS modules. Could be an + implementation backend but requires Docker daemon; adds a layer. +- **compose2nix** (aksiksi/compose2nix) — converts compose.yml to NixOS systemd-nspawn configs. + Loses Docker Swarm isolation model. +- **No existing Co-op Cloud + Nix bridge** was found. The space is open. + +--- + +## Proof of concept already in cc-ci + +The cc-ci NixOS config (`cc-ci/nix/modules/`) already implements this pattern for its own internal +services. The key modules: + +| Module | Pattern | +|---|---| +| `swarm.nix` | Enables Docker, initialises single-node Swarm + `proxy` overlay network as a systemd oneshot | +| `proxy.nix` | Deploys the Co-op Cloud `traefik` recipe via `abra app deploy`, health-gated | +| `warm-keycloak.nix` | Deploys keycloak via `abra app deploy`, with snapshot→upgrade→health-gate→rollback | +| `nightly-sweep.nix` | systemd timer + oneshot that runs nightly upgrades across all warm apps | + +These are bespoke (hard-coded app names, cc-ci-specific reconcile scripts) but the structure is +exactly what a general `mkCcApp` would produce. The flake idea is: extract and parameterise this +pattern, one thin wrapper per recipe. + +--- + +## Proposed design + +### `mkCcApp` — the shared factory + +A function in `lib/mkCcApp.nix` that takes per-recipe parameters and returns a NixOS module +(attrset of `systemd.services`, `systemd.timers`, `sops.secrets`, etc.): + +```nix +mkCcApp { + recipe = "ghost"; # Co-op Cloud recipe name + appName = "ghost"; # abra app name (often == recipe) + domain = "blog.example.org"; + version = "1.3.0+6.42.0-alpine"; + env = { MAIL_TRANSPORT = "SMTP"; }; # extra .env vars + healthPath = "/ghost/api/admin/site"; # HTTP path for health gate + healthOk = [ 200 ]; + healthTimeout = 120; + stateful = true; # snapshot data volumes before upgrade + autoUpdate = true; # add a nightly timer + updateSchedule = "03:00:00"; # systemd OnCalendar time + after = []; # extra systemd ordering deps + timeout = 600; # deploy timeout in seconds +} +``` + +This emits: + +1. **`systemd.services.cc-app-`** — a oneshot that: + - Creates the abra app if it doesn't exist (`abra app new `) + - Writes env vars into the abra `.env` file + - Runs `abra app deploy --no-input` + - Orders after `swarm-init.service` and `deploy-proxy.service` + +2. **`systemd.services.cc-app--reconcile`** (if `autoUpdate = true`) — a oneshot that + implements the health-gated upgrade/rollback loop (see below), driven by a timer. + +3. **`systemd.timers.cc-app--reconcile`** (if `autoUpdate = true`) — fires the reconcile + service on `updateSchedule`. + +4. **`sops.secrets.*`** entries for any declared secrets, wired to paths the abra env file + references. + +### Per-recipe module (thin wrapper) + +Each recipe becomes a file like `apps/ghost.nix`: + +```nix +{ mkCcApp, ... }: +mkCcApp { + recipe = "ghost"; + healthPath = "/ghost/api/admin/site"; + stateful = true; +} +``` + +An operator's NixOS config imports the recipe module, sets their domain/version, done. + +### Health-gated upgrade/rollback + +Modelled directly on `cc-ci/runner/warm_reconcile.py`. The reconcile oneshot: + +``` +read running version (last-good) +fetch latest available version via abra +if running == latest → health-check → update last-good if healthy → exit +if major-version jump → hold + alert, no deploy +record last-good = current +if stateful: + abra app undeploy + abra app snapshot (data volumes) +abra app upgrade → latest +wait for healthPath to return healthOk (up to healthTimeout) +if healthy: + write last-good = latest +if unhealthy: + if stateful: abra app restore snapshot + abra app deploy last-good version + write alert sentinel to /var/lib/coop-cloud/alerts/.json +``` + +For stateless apps (e.g. traefik, custom-html) the snapshot/restore steps are skipped — only the +version is rolled back. + +### Swarm bootstrap + +A `coop-cloud-base.nix` module (imported once by the host, not per-app) handles: + +- `virtualisation.docker.enable = true` +- `swarm-init` oneshot (identical to `cc-ci/nix/modules/swarm.nix`) +- `deploy-proxy` oneshot for the traefik recipe + +All per-app services order after `deploy-proxy.service`. + +--- + +## Secrets model + +The cc-ci approach is sops-nix: secrets live in a git-tracked encrypted `secrets.yaml`, decrypted +at activation by the host's SSH key (age identity). That's the right model for operator use too — +no out-of-band secret drops. Each `mkCcApp` call can declare its secrets: + +```nix +secrets = { + db_password = { sopsPath = "ghost_db_password"; }; + smtp_password = { sopsPath = "ghost_smtp_password"; }; +}; +``` + +The factory generates the `sops.secrets.ghost_db_password` entry and wires the decrypted path into +the abra `.env` file (or a swarm secret, depending on how the recipe reads it). + +--- + +## Open questions + +1. **Abra state vs Nix store.** Abra manages its own state in `~/.abra/apps/`. The Nix module + writes `.env` files there at deploy time. This is slightly un-Nix (mutable state outside the + store), but it's how `cc-ci` works today and it's fine for single-node operators. + +2. **Version pinning vs autoUpdate.** If `autoUpdate = false`, the operator pins a version in their + NixOS config and upgrades by bumping the string and running `nixos-rebuild switch`. Clean model. + If `autoUpdate = true`, the reconciler diverges from the declared version — the Nix config + becomes the floor ("at least this version") rather than the exact pin. Worth documenting this + tension. + +3. **Recipe flake vs per-operator flake.** Two distribution models: + - A single public `coop-cloud-nix` flake with all 18 recipes, operators add it as an input. + - Operators fork/extend. Probably start with option A; per-recipe modules stay thin enough that + forks are easy. + +4. **Recipes without a clean health endpoint.** Some apps (mumble, mailu) don't have a simple + HTTP health path. The `healthPath = null` case would skip the gate and just wait for the swarm + service to stabilise — weaker but still useful. + +5. **Relationship to Co-op Cloud upstream.** This is a parallel deployment interface for the same + recipes, not a fork. Recipe compose.yml files stay upstream. The flake just wraps them. Worth + coordinating with the Co-op Cloud maintainers rather than building in isolation. + +--- + +## Recipes to cover (the 18 maintained) + +bluesky-pds, cryptpad, custom-html, custom-html-tiny, discourse, ghost, hedgedoc, immich, +keycloak, lasuite-docs, lasuite-drive, lasuite-meet, mailu, matrix-synapse, mattermost-lts, +mumble, n8n, plausible, uptime-kuma. + +Notable gaps vs nixpkgs native modules: ghost (no nixpkgs module), mailu (no nixpkgs module). +The rest have native modules but the container-isolation argument still applies.