diff --git a/cc-ci-plan/README.md b/cc-ci-plan/README.md index 129d86f..553d29d 100644 --- a/cc-ci-plan/README.md +++ b/cc-ci-plan/README.md @@ -16,7 +16,7 @@ autonomous Claude loops (a Builder and an adversarial Reviewer) running over day |---|---| | `plan.md` | The Phase-1 plan (build the CI server). Agents treat it as their single source of truth. | | `plan-phase1b-review-lint.md` | **Phase 1b** (bounded pass at the end of Phase 1): deterministic linting/formatting in CI + a white-box review checklist (real tests, DRY harness, idempotent Nix, no footguns/secrets). | -| `plan-phase1c-full-reproducibility.md` | **Phase 1c**: make the VM fully reproducible from git (all secrets incl. the wildcard cert in sops; generic base + private instance flake input) and do the **genuine throwaway-VM live rebuild** to close D8 honestly (the "infeasible by design" was overstated). | +| `plan-phase1c-full-reproducibility.md` | **Phase 1c**: make the VM fully reproducible from git (all secrets incl. the wildcard cert in sops, in a separate private `cc-ci-secrets` repo as a flake input; base stays well-parameterized) and do the **genuine throwaway-VM live rebuild** to close D8 honestly (the "infeasible by design" was overstated). | | `plan-phase2-recipe-tests.md` | **Phase 2** (after Phase 1b): author comprehensive per-recipe tests — port every recipe-maintainer test + ≥2 recipe-specific tests per app. | | `plan-phase2b-test-performance.md` | **Phase 2b** (after Phase 2, before Phase 3): empirically measure where test time goes and reduce it (image cache, readiness tuning, dedup deploys, warm infra, concurrency) — no weakened tests. | | `plan-phase3-results-ux.md` | **Phase 3** (after Phase 2b): beautiful YunoHost-style results — per-run **level**, image-forward PR comment (badge + summary card + app screenshot), polished dashboard. | diff --git a/cc-ci-plan/plan-phase1c-full-reproducibility.md b/cc-ci-plan/plan-phase1c-full-reproducibility.md index 53e29bc..bca8e69 100644 --- a/cc-ci-plan/plan-phase1c-full-reproducibility.md +++ b/cc-ci-plan/plan-phase1c-full-reproducibility.md @@ -21,8 +21,8 @@ Phase-1 D8 was marked PASS with the throwaway-VM **live rebuild "documented infe refused to self-certify; the bar then slipped to "infeasible." This phase does two connected things: **(A)** make the VM **fully reproducible from git, including -all secrets** (move the wildcard cert and everything else into sops-in-git; split generic base from -private instance), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing +all secrets** (move the wildcard cert and everything else into sops-in-git, in a separate private +`cc-ci-secrets` repo), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the *static* half of D8; this adds the *live* half it was missing. @@ -30,7 +30,7 @@ this adds the *live* half it was missing. ## 1. Mission -A blank NixOS VM, given only **(1)** the two git repos (generic base + private instance), **(2)** the +A blank NixOS VM, given only **(1)** the two git repos (base `cc-ci` + private `cc-ci-secrets`), **(2)** the single bootstrap age key, and **(3)** the external DNS/gateway already pointing at it, becomes a working cc-ci via **`nixos-rebuild switch`** with **no undocumented manual steps** — secrets and the wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops. @@ -39,32 +39,36 @@ wildcard cert included, decrypted from git. Proven on a real throwaway VM by the ## 2. The reproducibility model (target architecture) -**Two repos, composed via a flake input (default) — generic base + private instance.** +**Split only the *secrets* into their own repo. The base stays one well-parameterized repo.** The +boundary is *secrecy*, not modularity: secrets get a separate private repo (an extra access-control +layer); instance-specific **non-secret** vars (domain, gateway/DNS facts) stay in the base as plain, +changeable parameters — another admin can repoint cc-ci by editing them, no second config repo needed. -- **`cc-ci` (base, instance-agnostic):** `flake.nix` exposing a parameterized `nixosModules.cc-ci`, - plus `runner/`, `tests/`, `docs/`. **No hardcoded domain, no instance secrets.** -- **`cc-ci-instance` (private, e.g. `recipe-maintainers/cc-ci-instance`):** `instance.nix` - (DOMAIN=`ci.commoninternet.net`, gateway/DNS facts, sops recipients), `secrets/secrets.yaml` - (sops-encrypted: **wildcard cert + key**, Drone OAuth client_id/secret + RPC secret, webhook HMAC, - registry creds if any, app/infra secrets), and `.sops.yaml`. +- **`cc-ci` (base — one repo, well-parameterized):** `flake.nix`, modules, `runner/`, `tests/`, + `docs/`, **and** the instance config as parameters/defaults (e.g. `DOMAIN = "ci.commoninternet.net"`, + gateway/DNS facts, sops recipients). Instance *config* lives here; only *secret material* is external. + Keep it well-parameterized so changing the domain/recipients is a one-line edit, not a fork. +- **`cc-ci-secrets` (private, `recipe-maintainers/cc-ci-secrets`):** holds **only** the sops-encrypted + secret material — `secrets/secrets.yaml` (**wildcard cert + key**, Drone OAuth client_id/secret + + RPC secret, webhook HMAC, registry creds if any, app/infra secrets) + `.sops.yaml`. **No code, no + config logic** — just encrypted secrets, as a separate security layer with its own access control. - **Linkage (default = flake input):** base `flake.nix` has - `inputs.instance.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-instance"` - (private; fetched via the bot token / a read deploy key), and - `nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem { modules = [ self.nixosModules.cc-ci instance.nixosModules.instance ]; }`. - *Alternative:* a git **submodule** at `cc-ci/instance/` (simpler single checkout; submodule - footguns). Record the choice in `DECISIONS.md`; flake input is the recommended default. -- **sops-nix wiring:** `sops.defaultSopsFile = instance secrets.yaml`; `sops.age.sshKeyPaths` = host - key + the recovery recipient. The **wildcard cert/key are sops secrets** decrypted at activation to - `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed into the Traefik recipe's - `ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.** -- **The one irreducible out-of-band secret:** the **age private key** that unlocks the repo's sops - secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. - This is the *only* permitted "not in git" secret, and it's provisioned to the host at creation. + `inputs.secrets.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-secrets"` + (private; fetched via the bot token / a read deploy key); sops-nix reads `secrets/secrets.yaml` from + it. *Alternative:* a git **submodule** at `cc-ci/secrets/`. Record the choice in `DECISIONS.md`; + flake input is the recommended default. +- **sops-nix wiring:** `sops.defaultSopsFile` → the `cc-ci-secrets` `secrets.yaml`; + `sops.age.sshKeyPaths` = host key + the recovery recipient. The **wildcard cert/key are sops + secrets** decrypted at activation to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed + into the Traefik recipe's `ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.** +- **The one irreducible out-of-band secret:** the **age private key** that unlocks the secrets (the + host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. This is the + *only* permitted "not in git" secret, provisioned to the host at creation. - **Still external (not the VM's git, by nature):** the DNS records + the TLS-passthrough gateway (network infra) — documented as preconditions. (IaC for those is out of scope — see §7.) -- **Token discipline preserved:** only the cert *artifact* enters git (encrypted); the **Gandi DNS - token never enters the repo or the agent**. Renewal = operator re-issues the cert out-of-band, then - commits the new sops-encrypted cert to the instance repo (a versioned, reproducible renewal). +- **Token discipline preserved:** only the cert *artifact* enters git (encrypted, in `cc-ci-secrets`); + the **Gandi DNS token never enters any repo or the agent**. Renewal = operator re-issues the cert + out-of-band, then commits the new sops-encrypted cert to `cc-ci-secrets` (versioned, reproducible). --- @@ -73,10 +77,12 @@ wildcard cert included, decrypted from git. Proven on a real throwaway VM by the Terminates only when every item holds **and the Adversary has independently re-verified each within 24h, from a cold start** (logged in `REVIEW.md`): -- [ ] **C1 — Repo split.** Generic base + private instance repo, composed (flake input by default). - The base builds with no instance secrets/domain baked in; the instance carries all instance - specifics. `nixosConfigurations.cc-ci` still builds byte-identically to the running system. -- [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in the instance repo, decrypted +- [ ] **C1 — Secrets-repo split.** A separate private `cc-ci-secrets` repo holds **only** the + sops-encrypted secrets (+ `.sops.yaml`), consumed by the base via a flake input. The base + `cc-ci` stays one well-parameterized repo — instance vars (domain, gateway, recipients) remain + changeable parameters in the base, **not** moved out (only secrets are external). + `nixosConfigurations.cc-ci` still builds byte-identically to the running system. +- [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in `cc-ci-secrets`, decrypted at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is gone. Verified: a rebuild serves valid TLS from the git-sourced cert. - [ ] **C3 — All secrets in git (one exception).** Every infra/app secret (cert, Drone OAuth/RPC, @@ -126,9 +132,10 @@ out — see memory). 1. **W1 — Headroom.** Resize `cc-nix-test` 6 GB→**4 GB** (stop→set→start) so a **4 GB** throwaway VM fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). *Accept:* b1 has room; cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). *(Final sizing decided in W6.)* -2. **W2 — Repo split + secrets into git.** Create the private `cc-ci-instance` repo; move instance - specifics + all secrets (incl. the **wildcard cert+key**, read from `/var/lib/ci-certs/live`) into - sops there; wire the base flake to consume it (flake input). *Accept:* `nixos-rebuild build` of the +2. **W2 — Secrets repo + cert into git.** Create the private `cc-ci-secrets` repo; move **all secrets** + into sops there — including the **wildcard cert+key** (read from the current `/var/lib/ci-certs/live`) + and the existing `secrets/secrets.yaml` contents; keep instance vars parameterized in the base; + wire the base flake to consume `cc-ci-secrets` (flake input). *Accept:* `nixos-rebuild build` of the restructured config is **byte-identical** to the running system (zero drift), and `cc-nix-test` `nixos-rebuild switch`es cleanly onto the new structure with TLS still served from the git cert. 3. **W3 — Throwaway VM.** Create a blank NixOS VM in `terraform-ci` (the incus-base image), sized @@ -155,8 +162,10 @@ out — see memory). accept. - **Gandi token stays out of repo/agent** — only the cert artifact is committed (encrypted). Renewal is operator-issues-then-commits. -- **Base repo stays generic** — no instance domain/secret leakage into the base; the Adversary checks - the base builds/clones clean of instance specifics. +- **No plaintext secret leaks into the base (or the store).** Instance *vars* (domain, gateway) may + live in the base as parameters — that's fine; what must NOT leak is any *secret* (cert/keys/tokens): + those stay encrypted in `cc-ci-secrets`. The Adversary greps the base + the Nix store for plaintext + secret material. - **Incus guardrails** (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't touch other instances. - **No weakened tests / no drift** — the restructured config must remain byte-identical to running @@ -165,7 +174,7 @@ out — see memory). --- ## 7. Open decisions (log in DECISIONS.md) -- **Flake input vs git submodule** for the instance repo (default: flake input). +- **Flake input vs git submodule** for the `cc-ci-secrets` repo (default: flake input). - **Bootstrap-key provisioning** for a new VM: provision the off-box recovery age key to the host (decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is simpler for a clone; per-host re-encrypt is cleaner long-term.) @@ -173,5 +182,5 @@ out — see memory). reproducible VM** to be the canonical cc-ci and retire the old one. - **DNS/gateway as IaC** (terraform for the Gandi records + the gateway) — likely a separate future item ([[IDEAS]]), out of 1c scope; 1c keeps them as documented external preconditions. -- Whether the instance repo is private under `recipe-maintainers` (bot is admin) and how the loops - fetch it during rebuild (token-in-URL vs read deploy key). +- How the loops fetch the private `cc-ci-secrets` repo during rebuild (bot token-in-URL vs a read + deploy key) — it's private under `recipe-maintainers` (bot is admin).