From b700cd2fda9e8eec5a26c637317e2481726d40f2 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Wed, 27 May 2026 19:52:03 +0100 Subject: [PATCH] =?UTF-8?q?1c/C7:=20docs=20=E2=80=94=20secrets.md=20+=20ar?= =?UTF-8?q?chitecture.md=20updated=20to=20the=201c=20model=20(cc-ci-secret?= =?UTF-8?q?s=20submodule,=20cert-in-git,=20bootstrap=20age=20key,=20Drone-?= =?UTF-8?q?token=20injection,=20verified=20D8)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/architecture.md | 17 +++++++---- docs/secrets.md | 70 ++++++++++++++++++++++++++++---------------- 2 files changed, 55 insertions(+), 32 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index 943a65c..85855f2 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -11,14 +11,18 @@ reports the result back. Everything on the `cc-ci` host is declared in this repo | **Drone server** | `modules/drone.nix` — coop-cloud `drone` recipe via abra (`drone.ci.commoninternet.net`, Gitea SSO) | CI engine. Holds the `recipe-ci` (custom-event) and `self-test` (push) pipelines (`.drone.yml`). | | **Drone exec runner** | `modules/drone-runner.nix` — host systemd service | Runs pipeline steps **on the host** so they can drive `abra`/Docker. `DRONE_RUNNER_CAPACITY=1` (MAX_TESTS) caps concurrent builds; the rest queue natively. | | **harness** | `runner/run_recipe_ci.py` + `runner/harness/` + `tests/` | Orchestrates per run: fetch recipe at the PR head → install → upgrade → backup/restore → recipe-local (D4) → guaranteed teardown. pytest + Playwright via the Nix `cc-ci-run` env. | -| **swarm + traefik** | `modules/swarm.nix`, `modules/proxy.nix` — coop-cloud `traefik` recipe via abra | Single-node Docker Swarm + `proxy` overlay; traefik terminates TLS with the pre-issued wildcard cert (file provider, **no ACME**). The real deploy target for recipes-under-test. | +| **swarm + traefik** | `modules/swarm.nix`, `modules/proxy.nix` — coop-cloud `traefik` recipe via abra | Single-node Docker Swarm + `proxy` overlay; traefik terminates TLS with the wildcard cert (**sops-decrypted from git** to `/var/lib/ci-certs/live`, file provider, **no ACME**). The real deploy target for recipes-under-test. | | **backup-bot-two** | `modules/backupbot.nix` | restic-based volume/DB backups; `abra app backup/restore` drive it. | | **dashboard** | `dashboard/dashboard.py`, `modules/dashboard.nix` (`ci.commoninternet.net`) | YunoHost-CI-like overview: latest run per recipe + status badges + run links; `/badge/.svg`. | -| **secrets** | `modules/secrets.nix` + `secrets/secrets.yaml` (sops-nix) | Infra secrets, decrypted at activation via the host SSH key as the age identity. See `secrets.md`. | +| **secrets** | `modules/secrets.nix` + `secrets/` = **`cc-ci-secrets` submodule** (sops-nix) | ALL secrets incl. the **wildcard cert** are sops-encrypted in the private `cc-ci-secrets` repo (a submodule); decrypted at activation via the bootstrap age key (`sops.age.keyFile` + host SSH key). The base repo holds no secrets. See `secrets.md`. | All swarm infra (traefik, drone, bridge, dashboard, backupbot) is brought up by **idempotent-reconcile -systemd oneshots** that converge on every activation/boot (no run-once sentinels) — so a from-scratch -install is `git clone` + `nixos-rebuild switch` + the operator preconditions (`install.md`). +systemd oneshots** that converge on every activation/boot (no run-once sentinels), **serialized** +(proxy→drone→bridge→dashboard→backupbot) so a single switch converges on a blank host — so a +from-scratch install is `git clone --recursive` + provision the one bootstrap age key + +`nixos-rebuild switch` + the external DNS/gateway (`install.md`). **Phase-1c verified this on a real +throwaway VM (D8): blank host + the two repos + the age key → a fully-converged cc-ci that serves a +real `!testme` run end-to-end over the public domain.** ## The `!testme` flow @@ -37,8 +41,9 @@ PR comment "!testme" ## Network & TLS (see install.md §domain) `*.ci.commoninternet.net` (and bare `ci.commoninternet.net`) resolve to an operator **gateway** that -**TLS-passthroughs** by SNI to cc-ci. cc-ci's traefik terminates TLS with the **pre-issued wildcard -cert** at `/var/lib/ci-certs/live/` (no ACME, no DNS token on the box). Each run gets a unique short +**TLS-passthroughs** by SNI to cc-ci. cc-ci's traefik terminates TLS with the **wildcard cert +sops-decrypted from git** (`cc-ci-secrets`) to `/var/lib/ci-certs/live/` (no ACME, no DNS token on the +box; operator re-issues + re-commits to rotate). Each run gets a unique short subdomain `-<6hex>.ci.commoninternet.net` (covered by the wildcard) so concurrent/serial runs never collide; it's torn down at run end. diff --git a/docs/secrets.md b/docs/secrets.md index 8841fa9..4b758b0 100644 --- a/docs/secrets.md +++ b/docs/secrets.md @@ -5,18 +5,31 @@ secret ever lives in git, logs, or the results UI** — only sops-encrypted ciph references-by-location. The Adversary's leak test greps published Drone logs + the dashboard for known secret patterns and any generated app password; it must find nothing. -## Decryption chain (sops-nix) +## Where secrets live (Phase-1c: a private companion repo) -- Infra secrets live sops-encrypted in `secrets/secrets.yaml` (committed). `/.sops.yaml` lists two - age recipients: the **host key** (`age1h90ut…`, derived from cc-ci's SSH host key via ssh-to-age) - and an off-box **master recovery key** (`age1cmk26t…`; its private half is kept only at - `/srv/cc-ci/.sops/master-age.txt` on the build host, never in the repo). -- On cc-ci, `sops-nix` decrypts at activation using the host's ed25519 SSH host key as the age - identity (`sops.age.sshKeyPaths = ["/etc/ssh/ssh_host_ed25519_key"]`), materialising each secret at - `/run/secrets/` (mode 0400 root). No extra key file to manage on the box. +All sops-encrypted secret material — including the **wildcard TLS cert+key** — lives in a **separate +private repo `recipe-maintainers/cc-ci-secrets`**, mounted into this repo as a **git submodule at +`secrets/`** (so the base resolves `secrets/secrets.yaml`). The base `cc-ci` repo holds **no secrets**, +only code/config + instance parameters; `secrets/.sops.yaml` (in the submodule) lists the two age +recipients: the **host key** (`age1h90ut…`, cc-ci's SSH host key via ssh-to-age) and the off-box +**master/recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on the +build host / provisioned to a fresh host — never in either repo). Clone with `git clone --recursive` +(bot/deploy creds for the private submodule); build with `?submodules=1` (see docs/install.md). + +## Decryption chain (sops-nix) — the ONE out-of-band secret + +- **Bootstrap age key (the only secret not in git):** provisioned to `/var/lib/sops-nix/key.txt` + (0600) before the first rebuild. `sops.age.keyFile` points there; `sops.age.sshKeyPaths` also offers + cc-ci's SSH host key. On the canonical cc-ci the keyFile holds the host-derived age identity + (`ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key`, == the `host` recipient); on a + fresh/cloned host whose SSH key is NOT a recipient (e.g. the throwaway rebuild), it holds the + **recovery key** — so any host decrypts every secret. (sops-install-secrets aborts if a configured + keyFile is missing, so it must exist before `nixos-rebuild`.) +- `sops-nix` decrypts at activation into `/run/secrets/` (ramfs, mode 0400 root). The wildcard + cert/key are placed at `/var/lib/ci-certs/live/{fullchain,privkey}.pem` (symlinks → /run/secrets) via + `sops.secrets..path` — the path traefik reads (no out-of-band cert file). - Swarm services don't read `/run/secrets` directly; the reconcile oneshots copy each into a **docker - swarm secret** (`docker secret create … /run/secrets/`) which the service mounts. abra-managed - apps use `abra app secret …`. + swarm secret** which the service mounts. abra-managed apps use `abra app secret …`. ## Class A1 — external inputs (operator-provided; the loop CANNOT create them) @@ -25,19 +38,23 @@ known secret patterns and any generated app password; it must find nothing. | Tailscale auth key | `/srv/cc-ci/.testenv` (sandbox) | operator re-issues; re-run `tailscale up` | | cc-ci SSH root key | `~/.ssh/cc-ci-root-ed25519` (sandbox) | operator re-keys `authorized_keys` | | Gitea bot creds | `/srv/cc-ci/.testenv` (`GITEA_USERNAME/PASSWORD`) | operator resets; update `.testenv` | -| **Wildcard TLS cert** | cc-ci `/var/lib/ci-certs/live/{fullchain,privkey}.pem` | **operator** re-issues (LE DNS-01/Gandi, ~90d, next ~2026-08-24) — see below | -| Registry pull creds (if needed) | sops `secrets/secrets.yaml` | operator-provided | +| **Bootstrap age key** | host `/var/lib/sops-nix/key.txt` (0600) — **the one out-of-band secret** | host-derived (cc-ci) or recovery key (clone); re-provision on host re-key | +| **Wildcard TLS cert+key** | sops in **`cc-ci-secrets`** → decrypted to `/var/lib/ci-certs/live/` | operator re-issues then **commits the new cert into `cc-ci-secrets`** (see below) | +| Registry pull creds (if needed) | sops `cc-ci-secrets/secrets.yaml` | operator-provided | A missing/invalid A1 secret is a `## Blocked` condition — the agent never invents or works around it, -and **never** runs ACME/DNS-01 for commoninternet.net. +and **never** runs ACME/DNS-01 for commoninternet.net. (Phase-1c: the cert is now *committed encrypted* +in `cc-ci-secrets`, not dropped as a file — but issuance is still operator-only; the Gandi token never +touches the repo or the box.) -**Wildcard cert rotation (manual, operator + agent):** +**Wildcard cert rotation (operator; the cert now lives in git):** 1. Operator re-issues the SAN cert (`*.ci.commoninternet.net` + `ci.commoninternet.net`) out-of-band - and writes it to `/var/lib/ci-certs/live/{fullchain,privkey}.pem` on cc-ci. -2. Bump `SECRET_WILDCARD_CERT_VERSION` / `SECRET_WILDCARD_KEY_VERSION` on the traefik app env - (modules/proxy.nix) so the next reconcile inserts the new cert as a fresh swarm secret version. -3. `nixos-rebuild switch` (re-runs the proxy reconcile → re-inserts + redeploys traefik). One cert - covers every per-run subdomain (SNI), so no per-app cert work. + (LE DNS-01/Gandi, ~90d, next ~2026-08-24). +2. Re-encrypt it into the secrets repo: `sops cc-ci-secrets/secrets.yaml` and replace + `wildcard_cert` / `wildcard_key` (each a PEM block scalar); commit + push `cc-ci-secrets`, bump the + base submodule pointer. +3. `nixos-rebuild switch`: sops re-writes `/var/lib/ci-certs/live/*` from git; the proxy reconcile + re-inserts the swarm secret + redeploys traefik. One cert covers every per-run subdomain (SNI). ## Class A2 — internal infra secrets (the loop GENERATES + manages; never a blocker) @@ -48,14 +65,15 @@ All sops-encrypted in `secrets/secrets.yaml`, decrypted to `/run/secrets/` | `drone_rpc_secret` | Drone server ↔ exec runner RPC | `openssl rand -hex 32` | | `drone_gitea_client_secret` | Drone↔Gitea OAuth app | from the Gitea OAuth app creation | | `bridge_webhook_hmac` | comment-bridge webhook HMAC | `openssl rand -hex 32` | -| `bridge_drone_token` | bridge + dashboard → Drone API | minted Drone user token | +| `bridge_drone_token` | bridge + dashboard → Drone API | hex token; **injected as the bot's Drone machine token** via `DRONE_USER_CREATE=…,token:$(cat /run/secrets/bridge_drone_token)` (modules/drone.nix) so it's reproducible on a fresh Drone DB (else the bridge gets 401 on a clean-room rebuild) | | `bridge_gitea_token` | bridge → Gitea API (poll/comment) | minted Gitea token (bot) | | `restic_password` | backup-bot-two restic repo | **abra-generated** (`abra app secret generate`, kept stable across reconciles) | **Rotate an A2 secret** (e.g. `bridge_webhook_hmac`): -1. `set -a; . /srv/cc-ci/.testenv; set +a` (for the editor key, not echoed). -2. In the repo: `sops secrets/secrets.yaml` → replace the value (or `openssl rand -hex 32 | …`), - save. (Re-encrypts to both recipients automatically per `.sops.yaml`.) +1. Have an age identity that is a recipient (the host key via ssh-to-age, or the recovery key). +2. In the **`cc-ci-secrets`** submodule: `sops secrets.yaml` → replace the value (or + `openssl rand -hex 32`), save (re-encrypts to both recipients per its `.sops.yaml`); commit + push + `cc-ci-secrets`, then bump the base repo's submodule pointer (`git add secrets && commit`). 3. For swarm-secret-backed values, **bump the consuming app's secret version** so the reconcile re-creates the swarm secret (docker swarm secrets are immutable): e.g. drone `RPC_SECRET_VERSION` v1→v2 (modules/drone.nix), bridge `cc_ci_bridge_*_v` (modules/bridge.nix). Update both ends @@ -64,9 +82,9 @@ All sops-encrypted in `secrets/secrets.yaml`, decrypted to `/run/secrets/` 5. Verify: the consuming service is healthy and re-auth works (e.g. a fresh build triggers). **Re-key sops recipients** (e.g. cc-ci host re-provisioned → new host age key): add the new -`age1…` to `/.sops.yaml`, `sops updatekeys secrets/secrets.yaml` (run from the build host, which -holds the master identity), commit. The master recovery key lets you re-encrypt even if the host key -is lost. +`age1…` to `cc-ci-secrets/.sops.yaml`, `sops updatekeys secrets.yaml` (run with the master identity), +commit `cc-ci-secrets` + bump the submodule pointer. The master/recovery key lets you re-encrypt even +if the host key is lost — and is itself the bootstrap key a fresh host uses (`/var/lib/sops-nix/key.txt`). ## Class B — recipe app secrets (the harness generates per run; NEVER a blocker)