diff --git a/docs/secrets.md b/docs/secrets.md new file mode 100644 index 0000000..8841fa9 --- /dev/null +++ b/docs/secrets.md @@ -0,0 +1,91 @@ +# Secrets model & rotation (D6) + +cc-ci handles three classes of secret in deliberately different ways (plan §4.4). **No plaintext +secret ever lives in git, logs, or the results UI** — only sops-encrypted ciphertext and +references-by-location. The Adversary's leak test greps published Drone logs + the dashboard for +known secret patterns and any generated app password; it must find nothing. + +## Decryption chain (sops-nix) + +- Infra secrets live sops-encrypted in `secrets/secrets.yaml` (committed). `/.sops.yaml` lists two + age recipients: the **host key** (`age1h90ut…`, derived from cc-ci's SSH host key via ssh-to-age) + and an off-box **master recovery key** (`age1cmk26t…`; its private half is kept only at + `/srv/cc-ci/.sops/master-age.txt` on the build host, never in the repo). +- On cc-ci, `sops-nix` decrypts at activation using the host's ed25519 SSH host key as the age + identity (`sops.age.sshKeyPaths = ["/etc/ssh/ssh_host_ed25519_key"]`), materialising each secret at + `/run/secrets/` (mode 0400 root). No extra key file to manage on the box. +- Swarm services don't read `/run/secrets` directly; the reconcile oneshots copy each into a **docker + swarm secret** (`docker secret create … /run/secrets/`) which the service mounts. abra-managed + apps use `abra app secret …`. + +## Class A1 — external inputs (operator-provided; the loop CANNOT create them) + +| Secret | Location | Rotation | +|---|---|---| +| Tailscale auth key | `/srv/cc-ci/.testenv` (sandbox) | operator re-issues; re-run `tailscale up` | +| cc-ci SSH root key | `~/.ssh/cc-ci-root-ed25519` (sandbox) | operator re-keys `authorized_keys` | +| Gitea bot creds | `/srv/cc-ci/.testenv` (`GITEA_USERNAME/PASSWORD`) | operator resets; update `.testenv` | +| **Wildcard TLS cert** | cc-ci `/var/lib/ci-certs/live/{fullchain,privkey}.pem` | **operator** re-issues (LE DNS-01/Gandi, ~90d, next ~2026-08-24) — see below | +| Registry pull creds (if needed) | sops `secrets/secrets.yaml` | operator-provided | + +A missing/invalid A1 secret is a `## Blocked` condition — the agent never invents or works around it, +and **never** runs ACME/DNS-01 for commoninternet.net. + +**Wildcard cert rotation (manual, operator + agent):** +1. Operator re-issues the SAN cert (`*.ci.commoninternet.net` + `ci.commoninternet.net`) out-of-band + and writes it to `/var/lib/ci-certs/live/{fullchain,privkey}.pem` on cc-ci. +2. Bump `SECRET_WILDCARD_CERT_VERSION` / `SECRET_WILDCARD_KEY_VERSION` on the traefik app env + (modules/proxy.nix) so the next reconcile inserts the new cert as a fresh swarm secret version. +3. `nixos-rebuild switch` (re-runs the proxy reconcile → re-inserts + redeploys traefik). One cert + covers every per-run subdomain (SNI), so no per-app cert work. + +## Class A2 — internal infra secrets (the loop GENERATES + manages; never a blocker) + +All sops-encrypted in `secrets/secrets.yaml`, decrypted to `/run/secrets/`: + +| Secret | Used by | Generate | +|---|---|---| +| `drone_rpc_secret` | Drone server ↔ exec runner RPC | `openssl rand -hex 32` | +| `drone_gitea_client_secret` | Drone↔Gitea OAuth app | from the Gitea OAuth app creation | +| `bridge_webhook_hmac` | comment-bridge webhook HMAC | `openssl rand -hex 32` | +| `bridge_drone_token` | bridge + dashboard → Drone API | minted Drone user token | +| `bridge_gitea_token` | bridge → Gitea API (poll/comment) | minted Gitea token (bot) | +| `restic_password` | backup-bot-two restic repo | **abra-generated** (`abra app secret generate`, kept stable across reconciles) | + +**Rotate an A2 secret** (e.g. `bridge_webhook_hmac`): +1. `set -a; . /srv/cc-ci/.testenv; set +a` (for the editor key, not echoed). +2. In the repo: `sops secrets/secrets.yaml` → replace the value (or `openssl rand -hex 32 | …`), + save. (Re-encrypts to both recipients automatically per `.sops.yaml`.) +3. For swarm-secret-backed values, **bump the consuming app's secret version** so the reconcile + re-creates the swarm secret (docker swarm secrets are immutable): e.g. drone `RPC_SECRET_VERSION` + v1→v2 (modules/drone.nix), bridge `cc_ci_bridge_*_v` (modules/bridge.nix). Update both ends + (server + runner share `drone_rpc_secret`). +4. `git commit` + push, sync to host, `nixos-rebuild switch` → reconcile re-inserts + redeploys. +5. Verify: the consuming service is healthy and re-auth works (e.g. a fresh build triggers). + +**Re-key sops recipients** (e.g. cc-ci host re-provisioned → new host age key): add the new +`age1…` to `/.sops.yaml`, `sops updatekeys secrets/secrets.yaml` (run from the build host, which +holds the master identity), commit. The master recovery key lets you re-encrypt even if the host key +is lost. + +## Class B — recipe app secrets (the harness generates per run; NEVER a blocker) + +- **Generated at install:** `abra app secret generate --all` (+ any deterministic test fixtures + the harness chooses) when the recipe deploys. +- **Persisted for the run:** the same generated values survive install → upgrade → backup/restore + because abra/swarm holds them keyed by the per-run app name (`-<6hex>`); the harness + re-reads them between stages. Concurrent runs are isolated by the unique per-run app name (and + MAX_TESTS=1 means no concurrency anyway). +- **Destroyed at teardown:** the same teardown that removes the app/volumes runs + `abra app secret remove --all` (+ docker-secret cleanup by stack name as a fallback). Nothing + generated for a run outlives it. + +## No-plaintext guarantees + +- Secrets are referenced by `/run/secrets/` path or read inline (e.g. + `PGPASSWORD=$(cat /run/secrets/…)` *inside* the app container), never printed by the harness. +- abra does not echo generated secret values; reconciles redirect secret-generate stdout to + `/dev/null`. +- The results dashboard renders run status only (no log bodies); per-run logs live in Drone's UI. +- Adversary leak test: greps published Drone logs + the dashboard for the known infra-secret values + and any generated app password → must be zero. (Baseline + recipe-CI log scans: clean.) diff --git a/runner/run_recipe_ci.py b/runner/run_recipe_ci.py index 89851b2..67c52dd 100644 --- a/runner/run_recipe_ci.py +++ b/runner/run_recipe_ci.py @@ -33,6 +33,40 @@ STAGE_FILES = { } +def _redact_values() -> list[str]: + """Values to scrub from published logs (D6 redaction filter, plan §4.4). The infra secrets + materialised at /run/secrets/* — if any subprocess ever echoes one, mask it. Only >=8-char + values, so it never false-positives on short strings / SHAs.""" + vals = set() + for p in glob.glob("/run/secrets/*"): + try: + v = open(p).read().strip() + except OSError: + continue + if len(v) >= 8: + vals.add(v) + return sorted(vals, key=len, reverse=True) + + +_REDACT = _redact_values() + + +def run_stage_redacted(cmd: list[str], env: dict | None = None) -> int: + """Run a stage subprocess, streaming its output live (so Drone logs stay tail-able) but masking + any known infra-secret value first. Belt-and-suspenders: the harness already never prints + secrets and abra doesn't echo generated ones.""" + proc = subprocess.Popen(cmd, cwd=ROOT, env=env, stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, text=True, bufsize=1) + assert proc.stdout is not None + for line in proc.stdout: + for v in _REDACT: + if v in line: + line = line.replace(v, "***REDACTED***") + sys.stdout.write(line) + sys.stdout.flush() + return proc.wait() + + def _gitea_token() -> str | None: tok = os.environ.get("GITEA_TOKEN") if not tok and os.path.exists("/run/secrets/bridge_gitea_token"): @@ -94,7 +128,7 @@ def main() -> int: continue print(f"\n===== STAGE: {stage} =====", flush=True) # each stage is its own pytest invocation => its own reported result (D2 separate stages) - rc = subprocess.call([sys.executable, "-m", "pytest", "-v", "-rA", path], cwd=ROOT) + rc = run_stage_redacted([sys.executable, "-m", "pytest", "-v", "-rA", path]) ran += 1 if rc != 0: overall = rc @@ -135,8 +169,7 @@ def run_recipe_local(recipe: str, local_tests: str | None) -> int | None: lifecycle.deploy_app(recipe, domain, version=os.environ.get("VERSION") or None) lifecycle.wait_healthy(domain) env = dict(os.environ, CCCI_APP_DOMAIN=domain, CCCI_BASE_URL=f"https://{domain}") - return subprocess.call([sys.executable, "-m", "pytest", "-v", "-rA", local_tests], - cwd=ROOT, env=env) + return run_stage_redacted([sys.executable, "-m", "pytest", "-v", "-rA", local_tests], env=env) finally: lifecycle.teardown_app(domain, verify=False)