# cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan) **Status:** QUEUED — runs after Phase 1 (`plan.md`) and **before Phase 1b** (review/lint), so the review/lint pass covers this refactor and its final cold re-verification proves the genuine (post-1c) D8. **Manual** transition. **Driven by the Builder + Adversary loops** (same protocol as `plan.md` §6/§6.1/§7) — the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold. **This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md` --- ## 0. Why this phase Phase-1 D8 was marked PASS with the throwaway-VM **live rebuild "documented infeasible by design (sops host-key binding + operator DNS/cert)."** That justification doesn't hold up: - **sops host-key binding** is defeated by the project's **own master recovery age key** (`/srv/cc-ci/.sops/master-age.txt`, a sops recipient created *"for re-keying if cc-ci is lost"*) — a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is *not* infeasible. - **operator DNS/cert** is a *precondition*, not a rebuild blocker — it only gates the full end-to-end HTTPS path, not "a blank host + the repo boots into the declared system." - Incus is available, and the rate-limit premise that originally deferred the test was obsolete (D10 passed without registry creds). The Builder itself flagged the rebuild as *feasible now* and refused to self-certify; the bar then slipped to "infeasible." This phase does two connected things: **(A)** make the VM **fully reproducible from git, including all secrets** (move the wildcard cert and everything else into sops-in-git, in a separate private `cc-ci-secrets` repo), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the *static* half of D8; this adds the *live* half it was missing. --- ## 1. Mission A blank NixOS VM, given only **(1)** the two git repos (base `cc-ci` + private `cc-ci-secrets`), **(2)** the single bootstrap age key, and **(3)** the external DNS/gateway already pointing at it, becomes a working cc-ci via **`nixos-rebuild switch`** with **no undocumented manual steps** — secrets and the wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops. --- ## 2. The reproducibility model (target architecture) **Split only the *secrets* into their own repo. The base stays one well-parameterized repo.** The boundary is *secrecy*, not modularity: secrets get a separate private repo (an extra access-control layer); instance-specific **non-secret** vars (domain, gateway/DNS facts) stay in the base as plain, changeable parameters — another admin can repoint cc-ci by editing them, no second config repo needed. - **`cc-ci` (base — one repo, well-parameterized):** `flake.nix`, modules, `runner/`, `tests/`, `docs/`, **and** the instance config as parameters/defaults (e.g. `DOMAIN = "ci.commoninternet.net"`, gateway/DNS facts, sops recipients). Instance *config* lives here; only *secret material* is external. Keep it well-parameterized so changing the domain/recipients is a one-line edit, not a fork. - **`cc-ci-secrets` (private, `recipe-maintainers/cc-ci-secrets`):** holds **only** the sops-encrypted secret material — `secrets/secrets.yaml` (**wildcard cert + key**, Drone OAuth client_id/secret + RPC secret, webhook HMAC, registry creds if any, app/infra secrets) + `.sops.yaml`. **No code, no config logic** — just encrypted secrets, as a separate security layer with its own access control. - **Linkage (default = flake input):** base `flake.nix` has `inputs.secrets.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-secrets"` (private; fetched via the bot token / a read deploy key); sops-nix reads `secrets/secrets.yaml` from it. *Alternative:* a git **submodule** at `cc-ci/secrets/`. Record the choice in `DECISIONS.md`; flake input is the recommended default. - **sops-nix wiring:** `sops.defaultSopsFile` → the `cc-ci-secrets` `secrets.yaml`; `sops.age.sshKeyPaths` = host key + the recovery recipient. The **wildcard cert/key are sops secrets** decrypted at activation to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed into the Traefik recipe's `ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.** - **The one irreducible out-of-band secret:** the **age private key** that unlocks the secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. This is the *only* permitted "not in git" secret, provisioned to the host at creation. - **Still external (not the VM's git, by nature):** the DNS records + the TLS-passthrough gateway (network infra) — documented as preconditions. (IaC for those is out of scope — see §7.) - **Token discipline preserved:** only the cert *artifact* enters git (encrypted, in `cc-ci-secrets`); the **Gandi DNS token never enters any repo or the agent**. Renewal = operator re-issues the cert out-of-band, then commits the new sops-encrypted cert to `cc-ci-secrets` (versioned, reproducible). --- ## 3. Definition of Done Terminates only when every item holds **and the Adversary has independently re-verified each within 24h, from a cold start** (logged in `REVIEW.md`): - [ ] **C1 — Secrets-repo split.** A separate private `cc-ci-secrets` repo holds **only** the sops-encrypted secrets (+ `.sops.yaml`), consumed by the base via a flake input. The base `cc-ci` stays one well-parameterized repo — instance vars (domain, gateway, recipients) remain changeable parameters in the base, **not** moved out (only secrets are external). `nixosConfigurations.cc-ci` still builds byte-identically to the running system. - [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in `cc-ci-secrets`, decrypted at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is gone. Verified: a rebuild serves valid TLS from the git-sourced cert. - [ ] **C3 — All secrets in git (one exception).** Every infra/app secret (cert, Drone OAuth/RPC, webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The **only** out-of-band secret is the bootstrap age key — documented precisely, nothing else. - [ ] **C4 — Genuine throwaway-VM live rebuild.** On a blank NixOS VM (Incus, `terraform-ci`), provisioned with *only* the bootstrap age key, the loops `git clone` base+secrets and run `nixos-rebuild switch`; the system activates and the reconcile oneshots converge (swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with **no manual step not in `docs/install.md`**. **The true proof is a clean-room repeat (C4 done right):** the Adversary **deletes** any existing throwaway VM, **creates a brand-new blank VM via Incus**, and runs the *entire* install from scratch (clone base+secrets → provision age key → `nixos-rebuild switch` → everything comes up) — proving reproducibility on a genuinely fresh machine, with **no residue** from the Builder's setup attempt masking a gap. Done **cold** by the Adversary, with logged evidence (VM id, the exact commands from `docs/install.md`, convergence + TLS-from-git-cert proof). - [ ] **C5 — Honest D8.** The D8 evidence is rewritten: byte-identical closure (static) **plus** the live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a blanket "infeasible." - [ ] **C6 — Resource fit + cleanup.** `cc-nix-test` resized **6 GB→4 GB** and the throwaway VM created at **4 GB**, within the **~12 GB running-RAM guideline** (cc-nix-test 4 + lichen-staging 4 + throwaway 4 = 12 ≤ 16 GB physical on b1; the guideline is doc-only, not an enforced project limit). The throwaway VM is **destroyed** after the test (no leftover). Final `cc-nix-test` sizing decided and applied (keep 4 GB, restore to 6 GB, or promote the rebuilt VM — record in `DECISIONS.md`). - [ ] **C7 — Docs.** `docs/install.md`, `docs/secrets.md`, `architecture.md`, and the main plan's cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision the age key + (external) DNS/gateway → one `nixos-rebuild switch`. A new engineer can stand up a fresh instance from the docs. When C1–C7 hold and are Adversary-verified, write `## DONE` to Phase-1c `STATUS.md`. --- ## 4. Incus capability (granted for this phase only) The loops normally only `ssh cc-ci`. For 1c they MAY drive Incus on **b1** (resize `cc-nix-test`; create/destroy ONE throwaway VM in `terraform-ci`), using the mTLS certs at `/srv/incus-terraform-nix-vm-creator/terraform-secrets/` (b1 is reachable directly from the VM — direct tailnet peer, no proxy) — see the incus skill (`/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md`) and [[cc-ci-vm-incus]]. Guardrails: only `terraform-ci`; keep total running RAM within the **~12 GB guideline** (doc-only — terraform-ci has no enforced `limits.memory`; b1 is 16 GB physical) — hence `cc-nix-test`→4 GB + throwaway 4 GB + lichen-staging 4 GB = 12 GB; **destroy the throwaway VM when done**; never touch other projects/instances; live-memory changes need stop→set→start (hotplug times out — see memory). --- ## 5. Method (ordered; each milestone ends with an Adversary gate) 1. **W1 — Headroom.** Resize `cc-nix-test` 6 GB→**4 GB** (stop→set→start) so a **4 GB** throwaway VM fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). *Accept:* b1 has room; cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). *(Final sizing decided in W6.)* 2. **W2 — Secrets repo + cert into git.** Create the private `cc-ci-secrets` repo; move **all secrets** into sops there — including the **wildcard cert+key** (read from the current `/var/lib/ci-certs/live`) and the existing `secrets/secrets.yaml` contents; keep instance vars parameterized in the base; wire the base flake to consume `cc-ci-secrets` (flake input). *Accept:* `nixos-rebuild build` of the restructured config is **byte-identical** to the running system (zero drift), and `cc-nix-test` `nixos-rebuild switch`es cleanly onto the new structure with TLS still served from the git cert. 3. **W3 — Throwaway VM.** Create a blank NixOS VM in `terraform-ci` (the incus-base image), sized **4 GB**. *Accept:* VM reachable; bootstrap age key provisioned by the documented mechanism only. 4. **W4 — Reproducible live rebuild.** On the throwaway VM: clone base+instance, `nixos-rebuild switch`, watch oneshots converge, secrets+cert decrypt. *Accept:* system fully up with **no step outside `docs/install.md`**; capture evidence. 5. **W5 — Adversary clean-room proof + honest D8.** The Adversary **deletes** the Builder's throwaway VM, **creates a brand-new blank VM**, and runs the full install from scratch per `docs/install.md` (clone base+secrets → provision age key → `nixos-rebuild switch` → all up) — a genuinely fresh machine, no residue. Then rewrites the D8 evidence (static byte-identical + this live clean-room rebuild), removing "infeasible by design." *Accept:* Adversary logs a real D8 live-rebuild PASS on a freshly-created VM (or a narrow, signed-off limitation per §3 C5). 6. **W6 — Cleanup + docs + final sizing.** Destroy the throwaway VM; update all docs (C7); decide and apply final `cc-nix-test` sizing. *Accept:* no leftover VM/secret leak; docs match; flip Phase-1c `STATUS.md` to `## DONE`. --- ## 6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate) - **Don't fake the rebuild.** "Infeasible/can't reproduce" is allowed only for a true, narrowly-scoped blocker with the maximal tested subset and **Adversary sign-off** — the host-key and DNS/cert reasons are explicitly *not* valid (the recovery key + the cert-in-git fix remove them). - **Exactly one out-of-band secret.** The bootstrap age key. Everything else in git, encrypted. If the loops find another secret that "has to" be out-of-band, that's a finding to design away, not accept. - **Gandi token stays out of repo/agent** — only the cert artifact is committed (encrypted). Renewal is operator-issues-then-commits. - **No plaintext secret leaks into the base (or the store).** Instance *vars* (domain, gateway) may live in the base as parameters — that's fine; what must NOT leak is any *secret* (cert/keys/tokens): those stay encrypted in `cc-ci-secrets`. The Adversary greps the base + the Nix store for plaintext secret material. - **Incus guardrails** (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't touch other instances. - **No weakened tests / no drift** — the restructured config must remain byte-identical to running (zero drift) and all of D1–D10 must still hold after the refactor. --- ## 7. Open decisions (log in DECISIONS.md) - **Flake input vs git submodule** for the `cc-ci-secrets` repo (default: flake input). - **Bootstrap-key provisioning** for a new VM: provision the off-box recovery age key to the host (decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is simpler for a clone; per-host re-encrypt is cleaner long-term.) - **Final `cc-nix-test` sizing** after the test: restore to 6 GB, or **promote the freshly-rebuilt reproducible VM** to be the canonical cc-ci and retire the old one. - **DNS/gateway as IaC** (terraform for the Gandi records + the gateway) — likely a separate future item ([[IDEAS]]), out of 1c scope; 1c keeps them as documented external preconditions. - How the loops fetch the private `cc-ci-secrets` repo during rebuild (bot token-in-URL vs a read deploy key) — it's private under `recipe-maintainers` (bot is admin).