Files
cc-ci/docs/secrets.md

7.9 KiB

Secrets model & rotation (D6)

cc-ci handles three classes of secret in deliberately different ways (plan §4.4). No plaintext secret ever lives in git, logs, or the results UI — only sops-encrypted ciphertext and references-by-location. The Adversary's leak test greps published Drone logs + the dashboard for known secret patterns and any generated app password; it must find nothing.

Where secrets live (Phase-1c: a private companion repo)

All sops-encrypted secret material — including the wildcard TLS cert+key — lives in a separate private repo recipe-maintainers/cc-ci-secrets, mounted into this repo as a git submodule at secrets/ (so the base resolves secrets/secrets.yaml). The base cc-ci repo holds no secrets, only code/config + instance parameters; secrets/.sops.yaml (in the submodule) lists the two age recipients: the host key (age1h90ut…, cc-ci's SSH host key via ssh-to-age) and the off-box master/recovery key (age1cmk26t…; private half only at /srv/cc-ci/.sops/master-age.txt on the build host / provisioned to a fresh host — never in either repo). Clone with git clone --recursive (bot/deploy creds for the private submodule); build with ?submodules=1 (see docs/install.md).

Decryption chain (sops-nix) — the ONE out-of-band secret

  • Bootstrap age key (the only secret not in git): provisioned to /var/lib/sops-nix/key.txt (0600) before the first rebuild. sops.age.keyFile points there; sops.age.sshKeyPaths also offers cc-ci's SSH host key. On the canonical cc-ci the keyFile holds the host-derived age identity (ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key, == the host recipient); on a fresh/cloned host whose SSH key is NOT a recipient (e.g. the throwaway rebuild), it holds the recovery key — so any host decrypts every secret. (sops-install-secrets aborts if a configured keyFile is missing, so it must exist before nixos-rebuild.)
  • sops-nix decrypts at activation into /run/secrets/<name> (ramfs, mode 0400 root). The wildcard cert/key are placed at /var/lib/ci-certs/live/{fullchain,privkey}.pem (symlinks → /run/secrets) via sops.secrets.<name>.path — the path traefik reads (no out-of-band cert file).
  • Swarm services don't read /run/secrets directly; the reconcile oneshots copy each into a docker swarm secret which the service mounts. abra-managed apps use abra app secret ….

Class A1 — external inputs (operator-provided; the loop CANNOT create them)

Secret Location Rotation
Tailscale auth key /srv/cc-ci/.testenv (sandbox) operator re-issues; re-run tailscale up
cc-ci SSH root key ~/.ssh/cc-ci-root-ed25519 (sandbox) operator re-keys authorized_keys
Gitea bot creds /srv/cc-ci/.testenv (GITEA_USERNAME/PASSWORD) operator resets; update .testenv
Bootstrap age key host /var/lib/sops-nix/key.txt (0600) — the one out-of-band secret host-derived (cc-ci) or recovery key (clone); re-provision on host re-key
Wildcard TLS cert+key sops in cc-ci-secrets → decrypted to /var/lib/ci-certs/live/ operator re-issues then commits the new cert into cc-ci-secrets (see below)
Registry pull creds (if needed) sops cc-ci-secrets/secrets.yaml operator-provided

A missing/invalid A1 secret is a ## Blocked condition — the agent never invents or works around it, and never runs ACME/DNS-01 for commoninternet.net. (Phase-1c: the cert is now committed encrypted in cc-ci-secrets, not dropped as a file — but issuance is still operator-only; the Gandi token never touches the repo or the box.)

Wildcard cert rotation (operator; the cert now lives in git):

  1. Operator re-issues the SAN cert (*.ci.commoninternet.net + ci.commoninternet.net) out-of-band (LE DNS-01/Gandi, ~90d, next ~2026-08-24).
  2. Re-encrypt it into the secrets repo: sops cc-ci-secrets/secrets.yaml and replace wildcard_cert / wildcard_key (each a PEM block scalar); commit + push cc-ci-secrets, bump the base submodule pointer.
  3. nixos-rebuild switch: sops re-writes /var/lib/ci-certs/live/* from git; the proxy reconcile re-inserts the swarm secret + redeploys traefik. One cert covers every per-run subdomain (SNI).

Class A2 — internal infra secrets (the loop GENERATES + manages; never a blocker)

All sops-encrypted in secrets/secrets.yaml, decrypted to /run/secrets/<name>:

Secret Used by Generate
drone_rpc_secret Drone server ↔ exec runner RPC openssl rand -hex 32
drone_gitea_client_secret Drone↔Gitea OAuth app from the Gitea OAuth app creation
bridge_webhook_hmac comment-bridge webhook HMAC openssl rand -hex 32
bridge_drone_token bridge + dashboard → Drone API hex token; injected as the bot's Drone machine token via DRONE_USER_CREATE=…,token:$(cat /run/secrets/bridge_drone_token) (modules/drone.nix) so it's reproducible on a fresh Drone DB (else the bridge gets 401 on a clean-room rebuild)
bridge_gitea_token bridge → Gitea API (poll/comment) minted Gitea token (bot)
restic_password backup-bot-two restic repo abra-generated (abra app secret generate, kept stable across reconciles)

Rotate an A2 secret (e.g. bridge_webhook_hmac):

  1. Have an age identity that is a recipient (the host key via ssh-to-age, or the recovery key).
  2. In the cc-ci-secrets submodule: sops secrets.yaml → replace the value (or openssl rand -hex 32), save (re-encrypts to both recipients per its .sops.yaml); commit + push cc-ci-secrets, then bump the base repo's submodule pointer (git add secrets && commit).
  3. For swarm-secret-backed values, bump the consuming app's secret version so the reconcile re-creates the swarm secret (docker swarm secrets are immutable): e.g. drone RPC_SECRET_VERSION v1→v2 (modules/drone.nix), bridge cc_ci_bridge_*_v<n> (modules/bridge.nix). Update both ends (server + runner share drone_rpc_secret).
  4. git commit + push, sync to host, nixos-rebuild switch → reconcile re-inserts + redeploys.
  5. Verify: the consuming service is healthy and re-auth works (e.g. a fresh build triggers).

Re-key sops recipients (e.g. cc-ci host re-provisioned → new host age key): add the new age1… to cc-ci-secrets/.sops.yaml, sops updatekeys secrets.yaml (run with the master identity), commit cc-ci-secrets + bump the submodule pointer. The master/recovery key lets you re-encrypt even if the host key is lost — and is itself the bootstrap key a fresh host uses (/var/lib/sops-nix/key.txt).

Class B — recipe app secrets (the harness generates per run; NEVER a blocker)

  • Generated at install: abra app secret generate <app> --all (+ any deterministic test fixtures the harness chooses) when the recipe deploys.
  • Persisted for the run: the same generated values survive install → upgrade → backup/restore because abra/swarm holds them keyed by the per-run app name (<recipe[:4]>-<6hex>); the harness re-reads them between stages. Concurrent runs are isolated by the unique per-run app name (and MAX_TESTS=1 means no concurrency anyway).
  • Destroyed at teardown: the same teardown that removes the app/volumes runs abra app secret remove <app> --all (+ docker-secret cleanup by stack name as a fallback). Nothing generated for a run outlives it.

No-plaintext guarantees

  • Secrets are referenced by /run/secrets/<name> path or read inline (e.g. PGPASSWORD=$(cat /run/secrets/…) inside the app container), never printed by the harness.
  • abra does not echo generated secret values; reconciles redirect secret-generate stdout to /dev/null.
  • The results dashboard renders run status only (no log bodies); per-run logs live in Drone's UI.
  • Adversary leak test: greps published Drone logs + the dashboard for the known infra-secret values and any generated app password → must be zero. (Baseline + recipe-CI log scans: clean.)