# cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan) **Status:** QUEUED — runs after Phase 1 (`plan.md`); pairs with Phase 1b (review/lint). **Manual** transition. **Driven by the Builder + Adversary loops** (same protocol as `plan.md` §6/§6.1/§7) — the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold. **This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md` --- ## 0. Why this phase Phase-1 D8 was marked PASS with the throwaway-VM **live rebuild "documented infeasible by design (sops host-key binding + operator DNS/cert)."** That justification doesn't hold up: - **sops host-key binding** is defeated by the project's **own master recovery age key** (`/srv/cc-ci/.sops/master-age.txt`, a sops recipient created *"for re-keying if cc-ci is lost"*) — a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is *not* infeasible. - **operator DNS/cert** is a *precondition*, not a rebuild blocker — it only gates the full end-to-end HTTPS path, not "a blank host + the repo boots into the declared system." - Incus is available, and the rate-limit premise that originally deferred the test was obsolete (D10 passed without registry creds). The Builder itself flagged the rebuild as *feasible now* and refused to self-certify; the bar then slipped to "infeasible." This phase does two connected things: **(A)** make the VM **fully reproducible from git, including all secrets** (move the wildcard cert and everything else into sops-in-git; split generic base from private instance), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the *static* half of D8; this adds the *live* half it was missing. --- ## 1. Mission A blank NixOS VM, given only **(1)** the two git repos (generic base + private instance), **(2)** the single bootstrap age key, and **(3)** the external DNS/gateway already pointing at it, becomes a working cc-ci via **`nixos-rebuild switch`** with **no undocumented manual steps** — secrets and the wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops. --- ## 2. The reproducibility model (target architecture) **Two repos, composed via a flake input (default) — generic base + private instance.** - **`cc-ci` (base, instance-agnostic):** `flake.nix` exposing a parameterized `nixosModules.cc-ci`, plus `runner/`, `tests/`, `docs/`. **No hardcoded domain, no instance secrets.** - **`cc-ci-instance` (private, e.g. `recipe-maintainers/cc-ci-instance`):** `instance.nix` (DOMAIN=`ci.commoninternet.net`, gateway/DNS facts, sops recipients), `secrets/secrets.yaml` (sops-encrypted: **wildcard cert + key**, Drone OAuth client_id/secret + RPC secret, webhook HMAC, registry creds if any, app/infra secrets), and `.sops.yaml`. - **Linkage (default = flake input):** base `flake.nix` has `inputs.instance.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-instance"` (private; fetched via the bot token / a read deploy key), and `nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem { modules = [ self.nixosModules.cc-ci instance.nixosModules.instance ]; }`. *Alternative:* a git **submodule** at `cc-ci/instance/` (simpler single checkout; submodule footguns). Record the choice in `DECISIONS.md`; flake input is the recommended default. - **sops-nix wiring:** `sops.defaultSopsFile = instance secrets.yaml`; `sops.age.sshKeyPaths` = host key + the recovery recipient. The **wildcard cert/key are sops secrets** decrypted at activation to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed into the Traefik recipe's `ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.** - **The one irreducible out-of-band secret:** the **age private key** that unlocks the repo's sops secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. This is the *only* permitted "not in git" secret, and it's provisioned to the host at creation. - **Still external (not the VM's git, by nature):** the DNS records + the TLS-passthrough gateway (network infra) — documented as preconditions. (IaC for those is out of scope — see §7.) - **Token discipline preserved:** only the cert *artifact* enters git (encrypted); the **Gandi DNS token never enters the repo or the agent**. Renewal = operator re-issues the cert out-of-band, then commits the new sops-encrypted cert to the instance repo (a versioned, reproducible renewal). --- ## 3. Definition of Done Terminates only when every item holds **and the Adversary has independently re-verified each within 24h, from a cold start** (logged in `REVIEW.md`): - [ ] **C1 — Repo split.** Generic base + private instance repo, composed (flake input by default). The base builds with no instance secrets/domain baked in; the instance carries all instance specifics. `nixosConfigurations.cc-ci` still builds byte-identically to the running system. - [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in the instance repo, decrypted at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is gone. Verified: a rebuild serves valid TLS from the git-sourced cert. - [ ] **C3 — All secrets in git (one exception).** Every infra/app secret (cert, Drone OAuth/RPC, webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The **only** out-of-band secret is the bootstrap age key — documented precisely, nothing else. - [ ] **C4 — Genuine throwaway-VM live rebuild.** On a blank NixOS VM (Incus, `terraform-ci`), provisioned with *only* the bootstrap age key, the loops `git clone` base+instance and run `nixos-rebuild switch`; the system activates and the reconcile oneshots converge (swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with **no manual step not in `docs/install.md`**. The Adversary performs this **cold** and logs evidence. - [ ] **C5 — Honest D8.** The D8 evidence is rewritten: byte-identical closure (static) **plus** the live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a blanket "infeasible." - [ ] **C6 — Resource fit + cleanup.** `cc-nix-test` resized **6 GB→4 GB** and the throwaway VM created at **4 GB**, within the **~12 GB running-RAM guideline** (cc-nix-test 4 + lichen-staging 4 + throwaway 4 = 12 ≤ 16 GB physical on b1; the guideline is doc-only, not an enforced project limit). The throwaway VM is **destroyed** after the test (no leftover). Final `cc-nix-test` sizing decided and applied (keep 4 GB, restore to 6 GB, or promote the rebuilt VM — record in `DECISIONS.md`). - [ ] **C7 — Docs.** `docs/install.md`, `docs/secrets.md`, `architecture.md`, and the main plan's cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision the age key + (external) DNS/gateway → one `nixos-rebuild switch`. A new engineer can stand up a fresh instance from the docs. When C1–C7 hold and are Adversary-verified, write `## DONE` to Phase-1c `STATUS.md`. --- ## 4. Incus capability (granted for this phase only) The loops normally only `ssh cc-ci`. For 1c they MAY drive Incus on **b1** (resize `cc-nix-test`; create/destroy ONE throwaway VM in `terraform-ci`), using the mTLS certs at `/srv/incus-terraform-nix-vm-creator/terraform-secrets/` through the existing SOCKS proxy (`127.0.0.1:1055`) — see the incus skill (`/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md`) and [[cc-ci-vm-incus]]. Guardrails: only `terraform-ci`; keep total running RAM within the **~12 GB guideline** (doc-only — terraform-ci has no enforced `limits.memory`; b1 is 16 GB physical) — hence `cc-nix-test`→4 GB + throwaway 4 GB + lichen-staging 4 GB = 12 GB; **destroy the throwaway VM when done**; never touch other projects/instances; live-memory changes need stop→set→start (hotplug times out — see memory). --- ## 5. Method (ordered; each milestone ends with an Adversary gate) 1. **W1 — Headroom.** Resize `cc-nix-test` 6 GB→**4 GB** (stop→set→start) so a **4 GB** throwaway VM fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). *Accept:* b1 has room; cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). *(Final sizing decided in W6.)* 2. **W2 — Repo split + secrets into git.** Create the private `cc-ci-instance` repo; move instance specifics + all secrets (incl. the **wildcard cert+key**, read from `/var/lib/ci-certs/live`) into sops there; wire the base flake to consume it (flake input). *Accept:* `nixos-rebuild build` of the restructured config is **byte-identical** to the running system (zero drift), and `cc-nix-test` `nixos-rebuild switch`es cleanly onto the new structure with TLS still served from the git cert. 3. **W3 — Throwaway VM.** Create a blank NixOS VM in `terraform-ci` (the incus-base image), sized **4 GB**. *Accept:* VM reachable; bootstrap age key provisioned by the documented mechanism only. 4. **W4 — Reproducible live rebuild.** On the throwaway VM: clone base+instance, `nixos-rebuild switch`, watch oneshots converge, secrets+cert decrypt. *Accept:* system fully up with **no step outside `docs/install.md`**; capture evidence. 5. **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 from scratch independently and rewrites the D8 evidence (static + live), removing "infeasible by design." *Accept:* Adversary logs a real D8 live-rebuild PASS (or a narrow, signed-off limitation per §3 C5). 6. **W6 — Cleanup + docs + final sizing.** Destroy the throwaway VM; update all docs (C7); decide and apply final `cc-nix-test` sizing. *Accept:* no leftover VM/secret leak; docs match; flip Phase-1c `STATUS.md` to `## DONE`. --- ## 6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate) - **Don't fake the rebuild.** "Infeasible/can't reproduce" is allowed only for a true, narrowly-scoped blocker with the maximal tested subset and **Adversary sign-off** — the host-key and DNS/cert reasons are explicitly *not* valid (the recovery key + the cert-in-git fix remove them). - **Exactly one out-of-band secret.** The bootstrap age key. Everything else in git, encrypted. If the loops find another secret that "has to" be out-of-band, that's a finding to design away, not accept. - **Gandi token stays out of repo/agent** — only the cert artifact is committed (encrypted). Renewal is operator-issues-then-commits. - **Base repo stays generic** — no instance domain/secret leakage into the base; the Adversary checks the base builds/clones clean of instance specifics. - **Incus guardrails** (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't touch other instances. - **No weakened tests / no drift** — the restructured config must remain byte-identical to running (zero drift) and all of D1–D10 must still hold after the refactor. --- ## 7. Open decisions (log in DECISIONS.md) - **Flake input vs git submodule** for the instance repo (default: flake input). - **Bootstrap-key provisioning** for a new VM: provision the off-box recovery age key to the host (decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is simpler for a clone; per-host re-encrypt is cleaner long-term.) - **Final `cc-nix-test` sizing** after the test: restore to 6 GB, or **promote the freshly-rebuilt reproducible VM** to be the canonical cc-ci and retire the old one. - **DNS/gateway as IaC** (terraform for the Gandi records + the gateway) — likely a separate future item ([[IDEAS]]), out of 1c scope; 1c keeps them as documented external preconditions. - Whether the instance repo is private under `recipe-maintainers` (bot is admin) and how the loops fetch it during rebuild (token-in-URL vs read deploy key).