ideas: Co-op Cloud NixOS modules — mkCcApp factory + health-gated rollback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
autonomic-bot
2026-06-02 05:06:30 +00:00
parent 5c691cdb66
commit 9e88927e5b

View File

@ -0,0 +1,223 @@
# Idea: Co-op Cloud NixOS modules
**Status:** research / pre-design. Not started.
**Origin:** conversation 2026-06-02 between mfowler and the assistant.
---
## The idea
A public Nix flake that lets NixOS operators deploy Co-op Cloud apps declaratively — via git, via
`nixos-rebuild switch` — instead of via `abra` imperative commands. Each app is a thin NixOS module
backed by a shared `mkCcApp` factory. Docker Swarm still does the actual container work, so the
container-isolation story is unchanged; Nix manages *what* is deployed and *at what version*.
```nix
# In a user's NixOS configuration.nix:
services.coop-cloud.ghost = {
enable = true;
domain = "blog.example.org";
version = "1.3.0+6.42.0-alpine";
autoUpdate = true;
};
```
---
## Why this over native NixOS modules
NixOS already has native service modules for ~14 of the 18 maintained recipes (matrix-synapse,
keycloak, nextcloud, jitsi-meet, hedgedoc, immich, etc.). The argument for doing it this way
instead:
**Container isolation is an advantage, not a legacy.** Native NixOS modules run as systemd units
sharing the host's namespaces. Containers give hard network and filesystem isolation between apps —
a compromised Ghost instance cannot see other services' sockets. For a single-node multi-app host
that matters.
**The recipes already exist and are maintained.** The Co-op Cloud recipe ecosystem (compose.yml,
tested version combinations, backup hooks, Traefik wiring) is the real value. Wrapping it in Nix
preserves that investment rather than reimplementing it as native modules.
**The target user.** Most Co-op Cloud operators are single-node. They want the isolation and
curation of Co-op Cloud recipes but a declarative, git-tracked, idempotent deployment model rather
than `abra`'s imperative one. This gives them that without giving up containers.
---
## Existing art
- **Arion** (hercules-ci/arion) — runs docker-compose via NixOS modules. Could be an
implementation backend but requires Docker daemon; adds a layer.
- **compose2nix** (aksiksi/compose2nix) — converts compose.yml to NixOS systemd-nspawn configs.
Loses Docker Swarm isolation model.
- **No existing Co-op Cloud + Nix bridge** was found. The space is open.
---
## Proof of concept already in cc-ci
The cc-ci NixOS config (`cc-ci/nix/modules/`) already implements this pattern for its own internal
services. The key modules:
| Module | Pattern |
|---|---|
| `swarm.nix` | Enables Docker, initialises single-node Swarm + `proxy` overlay network as a systemd oneshot |
| `proxy.nix` | Deploys the Co-op Cloud `traefik` recipe via `abra app deploy`, health-gated |
| `warm-keycloak.nix` | Deploys keycloak via `abra app deploy`, with snapshot→upgrade→health-gate→rollback |
| `nightly-sweep.nix` | systemd timer + oneshot that runs nightly upgrades across all warm apps |
These are bespoke (hard-coded app names, cc-ci-specific reconcile scripts) but the structure is
exactly what a general `mkCcApp` would produce. The flake idea is: extract and parameterise this
pattern, one thin wrapper per recipe.
---
## Proposed design
### `mkCcApp` — the shared factory
A function in `lib/mkCcApp.nix` that takes per-recipe parameters and returns a NixOS module
(attrset of `systemd.services`, `systemd.timers`, `sops.secrets`, etc.):
```nix
mkCcApp {
recipe = "ghost"; # Co-op Cloud recipe name
appName = "ghost"; # abra app name (often == recipe)
domain = "blog.example.org";
version = "1.3.0+6.42.0-alpine";
env = { MAIL_TRANSPORT = "SMTP"; }; # extra .env vars
healthPath = "/ghost/api/admin/site"; # HTTP path for health gate
healthOk = [ 200 ];
healthTimeout = 120;
stateful = true; # snapshot data volumes before upgrade
autoUpdate = true; # add a nightly timer
updateSchedule = "03:00:00"; # systemd OnCalendar time
after = []; # extra systemd ordering deps
timeout = 600; # deploy timeout in seconds
}
```
This emits:
1. **`systemd.services.cc-app-<appName>`** — a oneshot that:
- Creates the abra app if it doesn't exist (`abra app new <recipe> <appName>`)
- Writes env vars into the abra `.env` file
- Runs `abra app deploy <appName> --no-input`
- Orders after `swarm-init.service` and `deploy-proxy.service`
2. **`systemd.services.cc-app-<appName>-reconcile`** (if `autoUpdate = true`) — a oneshot that
implements the health-gated upgrade/rollback loop (see below), driven by a timer.
3. **`systemd.timers.cc-app-<appName>-reconcile`** (if `autoUpdate = true`) — fires the reconcile
service on `updateSchedule`.
4. **`sops.secrets.*`** entries for any declared secrets, wired to paths the abra env file
references.
### Per-recipe module (thin wrapper)
Each recipe becomes a file like `apps/ghost.nix`:
```nix
{ mkCcApp, ... }:
mkCcApp {
recipe = "ghost";
healthPath = "/ghost/api/admin/site";
stateful = true;
}
```
An operator's NixOS config imports the recipe module, sets their domain/version, done.
### Health-gated upgrade/rollback
Modelled directly on `cc-ci/runner/warm_reconcile.py`. The reconcile oneshot:
```
read running version (last-good)
fetch latest available version via abra
if running == latest → health-check → update last-good if healthy → exit
if major-version jump → hold + alert, no deploy
record last-good = current
if stateful:
abra app undeploy
abra app snapshot (data volumes)
abra app upgrade → latest
wait for healthPath to return healthOk (up to healthTimeout)
if healthy:
write last-good = latest
if unhealthy:
if stateful: abra app restore snapshot
abra app deploy last-good version
write alert sentinel to /var/lib/coop-cloud/alerts/<appName>.json
```
For stateless apps (e.g. traefik, custom-html) the snapshot/restore steps are skipped — only the
version is rolled back.
### Swarm bootstrap
A `coop-cloud-base.nix` module (imported once by the host, not per-app) handles:
- `virtualisation.docker.enable = true`
- `swarm-init` oneshot (identical to `cc-ci/nix/modules/swarm.nix`)
- `deploy-proxy` oneshot for the traefik recipe
All per-app services order after `deploy-proxy.service`.
---
## Secrets model
The cc-ci approach is sops-nix: secrets live in a git-tracked encrypted `secrets.yaml`, decrypted
at activation by the host's SSH key (age identity). That's the right model for operator use too —
no out-of-band secret drops. Each `mkCcApp` call can declare its secrets:
```nix
secrets = {
db_password = { sopsPath = "ghost_db_password"; };
smtp_password = { sopsPath = "ghost_smtp_password"; };
};
```
The factory generates the `sops.secrets.ghost_db_password` entry and wires the decrypted path into
the abra `.env` file (or a swarm secret, depending on how the recipe reads it).
---
## Open questions
1. **Abra state vs Nix store.** Abra manages its own state in `~/.abra/apps/`. The Nix module
writes `.env` files there at deploy time. This is slightly un-Nix (mutable state outside the
store), but it's how `cc-ci` works today and it's fine for single-node operators.
2. **Version pinning vs autoUpdate.** If `autoUpdate = false`, the operator pins a version in their
NixOS config and upgrades by bumping the string and running `nixos-rebuild switch`. Clean model.
If `autoUpdate = true`, the reconciler diverges from the declared version — the Nix config
becomes the floor ("at least this version") rather than the exact pin. Worth documenting this
tension.
3. **Recipe flake vs per-operator flake.** Two distribution models:
- A single public `coop-cloud-nix` flake with all 18 recipes, operators add it as an input.
- Operators fork/extend. Probably start with option A; per-recipe modules stay thin enough that
forks are easy.
4. **Recipes without a clean health endpoint.** Some apps (mumble, mailu) don't have a simple
HTTP health path. The `healthPath = null` case would skip the gate and just wait for the swarm
service to stabilise — weaker but still useful.
5. **Relationship to Co-op Cloud upstream.** This is a parallel deployment interface for the same
recipes, not a fork. Recipe compose.yml files stay upstream. The flake just wraps them. Worth
coordinating with the Co-op Cloud maintainers rather than building in isolation.
---
## Recipes to cover (the 18 maintained)
bluesky-pds, cryptpad, custom-html, custom-html-tiny, discourse, ghost, hedgedoc, immich,
keycloak, lasuite-docs, lasuite-drive, lasuite-meet, mailu, matrix-synapse, mattermost-lts,
mumble, n8n, plausible, uptime-kuma.
Notable gaps vs nixpkgs native modules: ghost (no nixpkgs module), mailu (no nixpkgs module).
The rest have native modules but the container-isolation argument still applies.