ideas: Co-op Cloud NixOS modules — mkCcApp factory + health-gated rollback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
223
ideas/coop-cloud-nixos-modules.md
Normal file
223
ideas/coop-cloud-nixos-modules.md
Normal file
@ -0,0 +1,223 @@
|
||||
# Idea: Co-op Cloud NixOS modules
|
||||
|
||||
**Status:** research / pre-design. Not started.
|
||||
**Origin:** conversation 2026-06-02 between mfowler and the assistant.
|
||||
|
||||
---
|
||||
|
||||
## The idea
|
||||
|
||||
A public Nix flake that lets NixOS operators deploy Co-op Cloud apps declaratively — via git, via
|
||||
`nixos-rebuild switch` — instead of via `abra` imperative commands. Each app is a thin NixOS module
|
||||
backed by a shared `mkCcApp` factory. Docker Swarm still does the actual container work, so the
|
||||
container-isolation story is unchanged; Nix manages *what* is deployed and *at what version*.
|
||||
|
||||
```nix
|
||||
# In a user's NixOS configuration.nix:
|
||||
services.coop-cloud.ghost = {
|
||||
enable = true;
|
||||
domain = "blog.example.org";
|
||||
version = "1.3.0+6.42.0-alpine";
|
||||
autoUpdate = true;
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Why this over native NixOS modules
|
||||
|
||||
NixOS already has native service modules for ~14 of the 18 maintained recipes (matrix-synapse,
|
||||
keycloak, nextcloud, jitsi-meet, hedgedoc, immich, etc.). The argument for doing it this way
|
||||
instead:
|
||||
|
||||
**Container isolation is an advantage, not a legacy.** Native NixOS modules run as systemd units
|
||||
sharing the host's namespaces. Containers give hard network and filesystem isolation between apps —
|
||||
a compromised Ghost instance cannot see other services' sockets. For a single-node multi-app host
|
||||
that matters.
|
||||
|
||||
**The recipes already exist and are maintained.** The Co-op Cloud recipe ecosystem (compose.yml,
|
||||
tested version combinations, backup hooks, Traefik wiring) is the real value. Wrapping it in Nix
|
||||
preserves that investment rather than reimplementing it as native modules.
|
||||
|
||||
**The target user.** Most Co-op Cloud operators are single-node. They want the isolation and
|
||||
curation of Co-op Cloud recipes but a declarative, git-tracked, idempotent deployment model rather
|
||||
than `abra`'s imperative one. This gives them that without giving up containers.
|
||||
|
||||
---
|
||||
|
||||
## Existing art
|
||||
|
||||
- **Arion** (hercules-ci/arion) — runs docker-compose via NixOS modules. Could be an
|
||||
implementation backend but requires Docker daemon; adds a layer.
|
||||
- **compose2nix** (aksiksi/compose2nix) — converts compose.yml to NixOS systemd-nspawn configs.
|
||||
Loses Docker Swarm isolation model.
|
||||
- **No existing Co-op Cloud + Nix bridge** was found. The space is open.
|
||||
|
||||
---
|
||||
|
||||
## Proof of concept already in cc-ci
|
||||
|
||||
The cc-ci NixOS config (`cc-ci/nix/modules/`) already implements this pattern for its own internal
|
||||
services. The key modules:
|
||||
|
||||
| Module | Pattern |
|
||||
|---|---|
|
||||
| `swarm.nix` | Enables Docker, initialises single-node Swarm + `proxy` overlay network as a systemd oneshot |
|
||||
| `proxy.nix` | Deploys the Co-op Cloud `traefik` recipe via `abra app deploy`, health-gated |
|
||||
| `warm-keycloak.nix` | Deploys keycloak via `abra app deploy`, with snapshot→upgrade→health-gate→rollback |
|
||||
| `nightly-sweep.nix` | systemd timer + oneshot that runs nightly upgrades across all warm apps |
|
||||
|
||||
These are bespoke (hard-coded app names, cc-ci-specific reconcile scripts) but the structure is
|
||||
exactly what a general `mkCcApp` would produce. The flake idea is: extract and parameterise this
|
||||
pattern, one thin wrapper per recipe.
|
||||
|
||||
---
|
||||
|
||||
## Proposed design
|
||||
|
||||
### `mkCcApp` — the shared factory
|
||||
|
||||
A function in `lib/mkCcApp.nix` that takes per-recipe parameters and returns a NixOS module
|
||||
(attrset of `systemd.services`, `systemd.timers`, `sops.secrets`, etc.):
|
||||
|
||||
```nix
|
||||
mkCcApp {
|
||||
recipe = "ghost"; # Co-op Cloud recipe name
|
||||
appName = "ghost"; # abra app name (often == recipe)
|
||||
domain = "blog.example.org";
|
||||
version = "1.3.0+6.42.0-alpine";
|
||||
env = { MAIL_TRANSPORT = "SMTP"; }; # extra .env vars
|
||||
healthPath = "/ghost/api/admin/site"; # HTTP path for health gate
|
||||
healthOk = [ 200 ];
|
||||
healthTimeout = 120;
|
||||
stateful = true; # snapshot data volumes before upgrade
|
||||
autoUpdate = true; # add a nightly timer
|
||||
updateSchedule = "03:00:00"; # systemd OnCalendar time
|
||||
after = []; # extra systemd ordering deps
|
||||
timeout = 600; # deploy timeout in seconds
|
||||
}
|
||||
```
|
||||
|
||||
This emits:
|
||||
|
||||
1. **`systemd.services.cc-app-<appName>`** — a oneshot that:
|
||||
- Creates the abra app if it doesn't exist (`abra app new <recipe> <appName>`)
|
||||
- Writes env vars into the abra `.env` file
|
||||
- Runs `abra app deploy <appName> --no-input`
|
||||
- Orders after `swarm-init.service` and `deploy-proxy.service`
|
||||
|
||||
2. **`systemd.services.cc-app-<appName>-reconcile`** (if `autoUpdate = true`) — a oneshot that
|
||||
implements the health-gated upgrade/rollback loop (see below), driven by a timer.
|
||||
|
||||
3. **`systemd.timers.cc-app-<appName>-reconcile`** (if `autoUpdate = true`) — fires the reconcile
|
||||
service on `updateSchedule`.
|
||||
|
||||
4. **`sops.secrets.*`** entries for any declared secrets, wired to paths the abra env file
|
||||
references.
|
||||
|
||||
### Per-recipe module (thin wrapper)
|
||||
|
||||
Each recipe becomes a file like `apps/ghost.nix`:
|
||||
|
||||
```nix
|
||||
{ mkCcApp, ... }:
|
||||
mkCcApp {
|
||||
recipe = "ghost";
|
||||
healthPath = "/ghost/api/admin/site";
|
||||
stateful = true;
|
||||
}
|
||||
```
|
||||
|
||||
An operator's NixOS config imports the recipe module, sets their domain/version, done.
|
||||
|
||||
### Health-gated upgrade/rollback
|
||||
|
||||
Modelled directly on `cc-ci/runner/warm_reconcile.py`. The reconcile oneshot:
|
||||
|
||||
```
|
||||
read running version (last-good)
|
||||
fetch latest available version via abra
|
||||
if running == latest → health-check → update last-good if healthy → exit
|
||||
if major-version jump → hold + alert, no deploy
|
||||
record last-good = current
|
||||
if stateful:
|
||||
abra app undeploy
|
||||
abra app snapshot (data volumes)
|
||||
abra app upgrade → latest
|
||||
wait for healthPath to return healthOk (up to healthTimeout)
|
||||
if healthy:
|
||||
write last-good = latest
|
||||
if unhealthy:
|
||||
if stateful: abra app restore snapshot
|
||||
abra app deploy last-good version
|
||||
write alert sentinel to /var/lib/coop-cloud/alerts/<appName>.json
|
||||
```
|
||||
|
||||
For stateless apps (e.g. traefik, custom-html) the snapshot/restore steps are skipped — only the
|
||||
version is rolled back.
|
||||
|
||||
### Swarm bootstrap
|
||||
|
||||
A `coop-cloud-base.nix` module (imported once by the host, not per-app) handles:
|
||||
|
||||
- `virtualisation.docker.enable = true`
|
||||
- `swarm-init` oneshot (identical to `cc-ci/nix/modules/swarm.nix`)
|
||||
- `deploy-proxy` oneshot for the traefik recipe
|
||||
|
||||
All per-app services order after `deploy-proxy.service`.
|
||||
|
||||
---
|
||||
|
||||
## Secrets model
|
||||
|
||||
The cc-ci approach is sops-nix: secrets live in a git-tracked encrypted `secrets.yaml`, decrypted
|
||||
at activation by the host's SSH key (age identity). That's the right model for operator use too —
|
||||
no out-of-band secret drops. Each `mkCcApp` call can declare its secrets:
|
||||
|
||||
```nix
|
||||
secrets = {
|
||||
db_password = { sopsPath = "ghost_db_password"; };
|
||||
smtp_password = { sopsPath = "ghost_smtp_password"; };
|
||||
};
|
||||
```
|
||||
|
||||
The factory generates the `sops.secrets.ghost_db_password` entry and wires the decrypted path into
|
||||
the abra `.env` file (or a swarm secret, depending on how the recipe reads it).
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **Abra state vs Nix store.** Abra manages its own state in `~/.abra/apps/`. The Nix module
|
||||
writes `.env` files there at deploy time. This is slightly un-Nix (mutable state outside the
|
||||
store), but it's how `cc-ci` works today and it's fine for single-node operators.
|
||||
|
||||
2. **Version pinning vs autoUpdate.** If `autoUpdate = false`, the operator pins a version in their
|
||||
NixOS config and upgrades by bumping the string and running `nixos-rebuild switch`. Clean model.
|
||||
If `autoUpdate = true`, the reconciler diverges from the declared version — the Nix config
|
||||
becomes the floor ("at least this version") rather than the exact pin. Worth documenting this
|
||||
tension.
|
||||
|
||||
3. **Recipe flake vs per-operator flake.** Two distribution models:
|
||||
- A single public `coop-cloud-nix` flake with all 18 recipes, operators add it as an input.
|
||||
- Operators fork/extend. Probably start with option A; per-recipe modules stay thin enough that
|
||||
forks are easy.
|
||||
|
||||
4. **Recipes without a clean health endpoint.** Some apps (mumble, mailu) don't have a simple
|
||||
HTTP health path. The `healthPath = null` case would skip the gate and just wait for the swarm
|
||||
service to stabilise — weaker but still useful.
|
||||
|
||||
5. **Relationship to Co-op Cloud upstream.** This is a parallel deployment interface for the same
|
||||
recipes, not a fork. Recipe compose.yml files stay upstream. The flake just wraps them. Worth
|
||||
coordinating with the Co-op Cloud maintainers rather than building in isolation.
|
||||
|
||||
---
|
||||
|
||||
## Recipes to cover (the 18 maintained)
|
||||
|
||||
bluesky-pds, cryptpad, custom-html, custom-html-tiny, discourse, ghost, hedgedoc, immich,
|
||||
keycloak, lasuite-docs, lasuite-drive, lasuite-meet, mailu, matrix-synapse, mattermost-lts,
|
||||
mumble, n8n, plausible, uptime-kuma.
|
||||
|
||||
Notable gaps vs nixpkgs native modules: ghost (no nixpkgs module), mailu (no nixpkgs module).
|
||||
The rest have native modules but the container-isolation argument still applies.
|
||||
Reference in New Issue
Block a user