diff --git a/cc-ci-plan/README.md b/cc-ci-plan/README.md index eb2d931..129d86f 100644 --- a/cc-ci-plan/README.md +++ b/cc-ci-plan/README.md @@ -16,6 +16,7 @@ autonomous Claude loops (a Builder and an adversarial Reviewer) running over day |---|---| | `plan.md` | The Phase-1 plan (build the CI server). Agents treat it as their single source of truth. | | `plan-phase1b-review-lint.md` | **Phase 1b** (bounded pass at the end of Phase 1): deterministic linting/formatting in CI + a white-box review checklist (real tests, DRY harness, idempotent Nix, no footguns/secrets). | +| `plan-phase1c-full-reproducibility.md` | **Phase 1c**: make the VM fully reproducible from git (all secrets incl. the wildcard cert in sops; generic base + private instance flake input) and do the **genuine throwaway-VM live rebuild** to close D8 honestly (the "infeasible by design" was overstated). | | `plan-phase2-recipe-tests.md` | **Phase 2** (after Phase 1b): author comprehensive per-recipe tests — port every recipe-maintainer test + ≥2 recipe-specific tests per app. | | `plan-phase2b-test-performance.md` | **Phase 2b** (after Phase 2, before Phase 3): empirically measure where test time goes and reduce it (image cache, readiness tuning, dedup deploys, warm infra, concurrency) — no weakened tests. | | `plan-phase3-results-ux.md` | **Phase 3** (after Phase 2b): beautiful YunoHost-style results — per-run **level**, image-forward PR comment (badge + summary card + app screenshot), polished dashboard. | diff --git a/cc-ci-plan/plan-phase1c-full-reproducibility.md b/cc-ci-plan/plan-phase1c-full-reproducibility.md new file mode 100644 index 0000000..25f87a7 --- /dev/null +++ b/cc-ci-plan/plan-phase1c-full-reproducibility.md @@ -0,0 +1,173 @@ +# cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan) + +**Status:** QUEUED — runs after Phase 1 (`plan.md`); pairs with Phase 1b (review/lint). **Manual** +transition. **Driven by the Builder + Adversary loops** (same protocol as `plan.md` §6/§6.1/§7) — +the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold. +**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md` + +--- + +## 0. Why this phase + +Phase-1 D8 was marked PASS with the throwaway-VM **live rebuild "documented infeasible by design +(sops host-key binding + operator DNS/cert)."** That justification doesn't hold up: +- **sops host-key binding** is defeated by the project's **own master recovery age key** + (`/srv/cc-ci/.sops/master-age.txt`, a sops recipient created *"for re-keying if cc-ci is lost"*) — + a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is *not* infeasible. +- **operator DNS/cert** is a *precondition*, not a rebuild blocker — it only gates the full + end-to-end HTTPS path, not "a blank host + the repo boots into the declared system." +- Incus is available, and the rate-limit premise that originally deferred the test was obsolete + (D10 passed without registry creds). The Builder itself flagged the rebuild as *feasible now* and + refused to self-certify; the bar then slipped to "infeasible." + +This phase does two connected things: **(A)** make the VM **fully reproducible from git, including +all secrets** (move the wildcard cert and everything else into sops-in-git; split generic base from +private instance), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing +D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the *static* half of D8; +this adds the *live* half it was missing. + +--- + +## 1. Mission + +A blank NixOS VM, given only **(1)** the two git repos (generic base + private instance), **(2)** the +single bootstrap age key, and **(3)** the external DNS/gateway already pointing at it, becomes a +working cc-ci via **`nixos-rebuild switch`** with **no undocumented manual steps** — secrets and the +wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops. + +--- + +## 2. The reproducibility model (target architecture) + +**Two repos, composed via a flake input (default) — generic base + private instance.** + +- **`cc-ci` (base, instance-agnostic):** `flake.nix` exposing a parameterized `nixosModules.cc-ci`, + plus `runner/`, `tests/`, `docs/`. **No hardcoded domain, no instance secrets.** +- **`cc-ci-instance` (private, e.g. `recipe-maintainers/cc-ci-instance`):** `instance.nix` + (DOMAIN=`ci.commoninternet.net`, gateway/DNS facts, sops recipients), `secrets/secrets.yaml` + (sops-encrypted: **wildcard cert + key**, Drone OAuth client_id/secret + RPC secret, webhook HMAC, + registry creds if any, app/infra secrets), and `.sops.yaml`. +- **Linkage (default = flake input):** base `flake.nix` has + `inputs.instance.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-instance"` + (private; fetched via the bot token / a read deploy key), and + `nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem { modules = [ self.nixosModules.cc-ci instance.nixosModules.instance ]; }`. + *Alternative:* a git **submodule** at `cc-ci/instance/` (simpler single checkout; submodule + footguns). Record the choice in `DECISIONS.md`; flake input is the recommended default. +- **sops-nix wiring:** `sops.defaultSopsFile = instance secrets.yaml`; `sops.age.sshKeyPaths` = host + key + the recovery recipient. The **wildcard cert/key are sops secrets** decrypted at activation to + `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed into the Traefik recipe's + `ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.** +- **The one irreducible out-of-band secret:** the **age private key** that unlocks the repo's sops + secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. + This is the *only* permitted "not in git" secret, and it's provisioned to the host at creation. +- **Still external (not the VM's git, by nature):** the DNS records + the TLS-passthrough gateway + (network infra) — documented as preconditions. (IaC for those is out of scope — see §7.) +- **Token discipline preserved:** only the cert *artifact* enters git (encrypted); the **Gandi DNS + token never enters the repo or the agent**. Renewal = operator re-issues the cert out-of-band, then + commits the new sops-encrypted cert to the instance repo (a versioned, reproducible renewal). + +--- + +## 3. Definition of Done + +Terminates only when every item holds **and the Adversary has independently re-verified each within +24h, from a cold start** (logged in `REVIEW.md`): + +- [ ] **C1 — Repo split.** Generic base + private instance repo, composed (flake input by default). + The base builds with no instance secrets/domain baked in; the instance carries all instance + specifics. `nixosConfigurations.cc-ci` still builds byte-identically to the running system. +- [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in the instance repo, decrypted + at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is + gone. Verified: a rebuild serves valid TLS from the git-sourced cert. +- [ ] **C3 — All secrets in git (one exception).** Every infra/app secret (cert, Drone OAuth/RPC, + webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The **only** + out-of-band secret is the bootstrap age key — documented precisely, nothing else. +- [ ] **C4 — Genuine throwaway-VM live rebuild.** On a blank NixOS VM (Incus, `terraform-ci`), + provisioned with *only* the bootstrap age key, the loops `git clone` base+instance and run + `nixos-rebuild switch`; the system activates and the reconcile oneshots converge + (swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with **no manual step + not in `docs/install.md`**. The Adversary performs this **cold** and logs evidence. +- [ ] **C5 — Honest D8.** The D8 evidence is rewritten: byte-identical closure (static) **plus** the + live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any + single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off + limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a + blanket "infeasible." +- [ ] **C6 — Resource fit + cleanup.** `cc-nix-test` resized to **2 GB** to free b1 headroom for a + properly-sized throwaway VM (§5 step 1); the throwaway VM is **destroyed** after the test (no + leftover, respect the `terraform-ci` <10 GB-running cap); final `cc-nix-test` sizing decided and + applied (restore to 6 GB, or promote the rebuilt VM — record in `DECISIONS.md`). +- [ ] **C7 — Docs.** `docs/install.md`, `docs/secrets.md`, `architecture.md`, and the main plan's + cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision + the age key + (external) DNS/gateway → one `nixos-rebuild switch`. A new engineer can stand up a + fresh instance from the docs. + +When C1–C7 hold and are Adversary-verified, write `## DONE` to Phase-1c `STATUS.md`. + +--- + +## 4. Incus capability (granted for this phase only) + +The loops normally only `ssh cc-ci`. For 1c they MAY drive Incus on **b1** (resize `cc-nix-test`; +create/destroy ONE throwaway VM in `terraform-ci`), using the mTLS certs at +`/srv/incus-terraform-nix-vm-creator/terraform-secrets/` through the existing SOCKS proxy +(`127.0.0.1:1055`) — see the incus skill (`/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md`) +and [[cc-ci-vm-incus]]. Guardrails: only `terraform-ci`; **respect the <10 GB running-RAM cap** +(that's why `cc-nix-test`→2 GB first); **destroy the throwaway VM when done**; never touch other +projects/instances; live-memory changes need stop→set→start (hotplug times out — see memory). + +--- + +## 5. Method (ordered; each milestone ends with an Adversary gate) + +1. **W1 — Headroom.** Resize `cc-nix-test` 6 GB→**2 GB** (stop→set→start) to fit a ~6 GB throwaway VM + under b1's budget. *Accept:* b1 has room; cc-nix-test still healthy at 2 GB (no heavy recipe CI + runs during 1c). *(Note: restore sizing in W6.)* +2. **W2 — Repo split + secrets into git.** Create the private `cc-ci-instance` repo; move instance + specifics + all secrets (incl. the **wildcard cert+key**, read from `/var/lib/ci-certs/live`) into + sops there; wire the base flake to consume it (flake input). *Accept:* `nixos-rebuild build` of the + restructured config is **byte-identical** to the running system (zero drift), and `cc-nix-test` + `nixos-rebuild switch`es cleanly onto the new structure with TLS still served from the git cert. +3. **W3 — Throwaway VM.** Create a blank NixOS VM in `terraform-ci` (the incus-base image), sized + ~6 GB. *Accept:* VM reachable; bootstrap age key provisioned by the documented mechanism only. +4. **W4 — Reproducible live rebuild.** On the throwaway VM: clone base+instance, `nixos-rebuild + switch`, watch oneshots converge, secrets+cert decrypt. *Accept:* system fully up with **no step + outside `docs/install.md`**; capture evidence. +5. **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 from scratch independently and + rewrites the D8 evidence (static + live), removing "infeasible by design." *Accept:* Adversary + logs a real D8 live-rebuild PASS (or a narrow, signed-off limitation per §3 C5). +6. **W6 — Cleanup + docs + final sizing.** Destroy the throwaway VM; update all docs (C7); decide and + apply final `cc-nix-test` sizing. *Accept:* no leftover VM/secret leak; docs match; flip Phase-1c + `STATUS.md` to `## DONE`. + +--- + +## 6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate) + +- **Don't fake the rebuild.** "Infeasible/can't reproduce" is allowed only for a true, + narrowly-scoped blocker with the maximal tested subset and **Adversary sign-off** — the host-key + and DNS/cert reasons are explicitly *not* valid (the recovery key + the cert-in-git fix remove them). +- **Exactly one out-of-band secret.** The bootstrap age key. Everything else in git, encrypted. If + the loops find another secret that "has to" be out-of-band, that's a finding to design away, not + accept. +- **Gandi token stays out of repo/agent** — only the cert artifact is committed (encrypted). Renewal + is operator-issues-then-commits. +- **Base repo stays generic** — no instance domain/secret leakage into the base; the Adversary checks + the base builds/clones clean of instance specifics. +- **Incus guardrails** (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't + touch other instances. +- **No weakened tests / no drift** — the restructured config must remain byte-identical to running + (zero drift) and all of D1–D10 must still hold after the refactor. + +--- + +## 7. Open decisions (log in DECISIONS.md) +- **Flake input vs git submodule** for the instance repo (default: flake input). +- **Bootstrap-key provisioning** for a new VM: provision the off-box recovery age key to the host + (decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is + simpler for a clone; per-host re-encrypt is cleaner long-term.) +- **Final `cc-nix-test` sizing** after the test: restore to 6 GB, or **promote the freshly-rebuilt + reproducible VM** to be the canonical cc-ci and retire the old one. +- **DNS/gateway as IaC** (terraform for the Gandi records + the gateway) — likely a separate future + item ([[IDEAS]]), out of 1c scope; 1c keeps them as documented external preconditions. +- Whether the instance repo is private under `recipe-maintainers` (bot is admin) and how the loops + fetch it during rebuild (token-in-URL vs read deploy key).