Files
cc-ci-orchestrator/cc-ci-plan/plan-phase1c-full-reproducibility.md
autonomic-bot 769dfd0c62 Phase-1c: resource plan -> 4GB/4GB under a 12GB guideline (not 2GB)
Per operator: don't downsize cc-nix-test to 2GB. Instead raise the terraform-ci running-RAM
guideline to ~12GB (it's doc-only — the project has no enforced limits.memory; b1 is 16GB),
resize cc-nix-test 6->4GB, and create the throwaway VM at 4GB (4+4+lichen 4 = 12 <= 16).
Updated W1/W3/C6/§4 and the incus memory note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:29:37 +01:00

12 KiB
Raw Blame History

cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan)

Status: QUEUED — runs after Phase 1 (plan.md); pairs with Phase 1b (review/lint). Manual transition. Driven by the Builder + Adversary loops (same protocol as plan.md §6/§6.1/§7) — the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold. This file's path: /srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md


0. Why this phase

Phase-1 D8 was marked PASS with the throwaway-VM live rebuild "documented infeasible by design (sops host-key binding + operator DNS/cert)." That justification doesn't hold up:

  • sops host-key binding is defeated by the project's own master recovery age key (/srv/cc-ci/.sops/master-age.txt, a sops recipient created "for re-keying if cc-ci is lost") — a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is not infeasible.
  • operator DNS/cert is a precondition, not a rebuild blocker — it only gates the full end-to-end HTTPS path, not "a blank host + the repo boots into the declared system."
  • Incus is available, and the rate-limit premise that originally deferred the test was obsolete (D10 passed without registry creds). The Builder itself flagged the rebuild as feasible now and refused to self-certify; the bar then slipped to "infeasible."

This phase does two connected things: (A) make the VM fully reproducible from git, including all secrets (move the wildcard cert and everything else into sops-in-git; split generic base from private instance), and (B) actually perform and verify the throwaway-VM live rebuild, closing D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the static half of D8; this adds the live half it was missing.


1. Mission

A blank NixOS VM, given only (1) the two git repos (generic base + private instance), (2) the single bootstrap age key, and (3) the external DNS/gateway already pointing at it, becomes a working cc-ci via nixos-rebuild switch with no undocumented manual steps — secrets and the wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops.


2. The reproducibility model (target architecture)

Two repos, composed via a flake input (default) — generic base + private instance.

  • cc-ci (base, instance-agnostic): flake.nix exposing a parameterized nixosModules.cc-ci, plus runner/, tests/, docs/. No hardcoded domain, no instance secrets.
  • cc-ci-instance (private, e.g. recipe-maintainers/cc-ci-instance): instance.nix (DOMAIN=ci.commoninternet.net, gateway/DNS facts, sops recipients), secrets/secrets.yaml (sops-encrypted: wildcard cert + key, Drone OAuth client_id/secret + RPC secret, webhook HMAC, registry creds if any, app/infra secrets), and .sops.yaml.
  • Linkage (default = flake input): base flake.nix has inputs.instance.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-instance" (private; fetched via the bot token / a read deploy key), and nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem { modules = [ self.nixosModules.cc-ci instance.nixosModules.instance ]; }. Alternative: a git submodule at cc-ci/instance/ (simpler single checkout; submodule footguns). Record the choice in DECISIONS.md; flake input is the recommended default.
  • sops-nix wiring: sops.defaultSopsFile = instance secrets.yaml; sops.age.sshKeyPaths = host key + the recovery recipient. The wildcard cert/key are sops secrets decrypted at activation to /var/lib/ci-certs/live/{fullchain.pem,privkey.pem} and fed into the Traefik recipe's ssl_cert/ssl_key swarm secrets — no out-of-band cert file.
  • The one irreducible out-of-band secret: the age private key that unlocks the repo's sops secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. This is the only permitted "not in git" secret, and it's provisioned to the host at creation.
  • Still external (not the VM's git, by nature): the DNS records + the TLS-passthrough gateway (network infra) — documented as preconditions. (IaC for those is out of scope — see §7.)
  • Token discipline preserved: only the cert artifact enters git (encrypted); the Gandi DNS token never enters the repo or the agent. Renewal = operator re-issues the cert out-of-band, then commits the new sops-encrypted cert to the instance repo (a versioned, reproducible renewal).

3. Definition of Done

Terminates only when every item holds and the Adversary has independently re-verified each within 24h, from a cold start (logged in REVIEW.md):

  • C1 — Repo split. Generic base + private instance repo, composed (flake input by default). The base builds with no instance secrets/domain baked in; the instance carries all instance specifics. nixosConfigurations.cc-ci still builds byte-identically to the running system.
  • C2 — Cert in git. The wildcard cert+key are sops secrets in the instance repo, decrypted at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is gone. Verified: a rebuild serves valid TLS from the git-sourced cert.
  • C3 — All secrets in git (one exception). Every infra/app secret (cert, Drone OAuth/RPC, webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The only out-of-band secret is the bootstrap age key — documented precisely, nothing else.
  • C4 — Genuine throwaway-VM live rebuild. On a blank NixOS VM (Incus, terraform-ci), provisioned with only the bootstrap age key, the loops git clone base+instance and run nixos-rebuild switch; the system activates and the reconcile oneshots converge (swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with no manual step not in docs/install.md. The Adversary performs this cold and logs evidence.
  • C5 — Honest D8. The D8 evidence is rewritten: byte-identical closure (static) plus the live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a blanket "infeasible."
  • C6 — Resource fit + cleanup. cc-nix-test resized 6 GB→4 GB and the throwaway VM created at 4 GB, within the ~12 GB running-RAM guideline (cc-nix-test 4 + lichen-staging 4 + throwaway 4 = 12 ≤ 16 GB physical on b1; the guideline is doc-only, not an enforced project limit). The throwaway VM is destroyed after the test (no leftover). Final cc-nix-test sizing decided and applied (keep 4 GB, restore to 6 GB, or promote the rebuilt VM — record in DECISIONS.md).
  • C7 — Docs. docs/install.md, docs/secrets.md, architecture.md, and the main plan's cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision the age key + (external) DNS/gateway → one nixos-rebuild switch. A new engineer can stand up a fresh instance from the docs.

When C1C7 hold and are Adversary-verified, write ## DONE to Phase-1c STATUS.md.


4. Incus capability (granted for this phase only)

The loops normally only ssh cc-ci. For 1c they MAY drive Incus on b1 (resize cc-nix-test; create/destroy ONE throwaway VM in terraform-ci), using the mTLS certs at /srv/incus-terraform-nix-vm-creator/terraform-secrets/ through the existing SOCKS proxy (127.0.0.1:1055) — see the incus skill (/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md) and cc-ci-vm-incus. Guardrails: only terraform-ci; keep total running RAM within the ~12 GB guideline (doc-only — terraform-ci has no enforced limits.memory; b1 is 16 GB physical) — hence cc-nix-test→4 GB + throwaway 4 GB + lichen-staging 4 GB = 12 GB; destroy the throwaway VM when done; never touch other projects/instances; live-memory changes need stop→set→start (hotplug times out — see memory).


5. Method (ordered; each milestone ends with an Adversary gate)

  1. W1 — Headroom. Resize cc-nix-test 6 GB→4 GB (stop→set→start) so a 4 GB throwaway VM fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). Accept: b1 has room; cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). (Final sizing decided in W6.)
  2. W2 — Repo split + secrets into git. Create the private cc-ci-instance repo; move instance specifics + all secrets (incl. the wildcard cert+key, read from /var/lib/ci-certs/live) into sops there; wire the base flake to consume it (flake input). Accept: nixos-rebuild build of the restructured config is byte-identical to the running system (zero drift), and cc-nix-test nixos-rebuild switches cleanly onto the new structure with TLS still served from the git cert.
  3. W3 — Throwaway VM. Create a blank NixOS VM in terraform-ci (the incus-base image), sized 4 GB. Accept: VM reachable; bootstrap age key provisioned by the documented mechanism only.
  4. W4 — Reproducible live rebuild. On the throwaway VM: clone base+instance, nixos-rebuild switch, watch oneshots converge, secrets+cert decrypt. Accept: system fully up with no step outside docs/install.md; capture evidence.
  5. W5 — Adversary cold proof + honest D8. Adversary repeats W4 from scratch independently and rewrites the D8 evidence (static + live), removing "infeasible by design." Accept: Adversary logs a real D8 live-rebuild PASS (or a narrow, signed-off limitation per §3 C5).
  6. W6 — Cleanup + docs + final sizing. Destroy the throwaway VM; update all docs (C7); decide and apply final cc-nix-test sizing. Accept: no leftover VM/secret leak; docs match; flip Phase-1c STATUS.md to ## DONE.

6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate)

  • Don't fake the rebuild. "Infeasible/can't reproduce" is allowed only for a true, narrowly-scoped blocker with the maximal tested subset and Adversary sign-off — the host-key and DNS/cert reasons are explicitly not valid (the recovery key + the cert-in-git fix remove them).
  • Exactly one out-of-band secret. The bootstrap age key. Everything else in git, encrypted. If the loops find another secret that "has to" be out-of-band, that's a finding to design away, not accept.
  • Gandi token stays out of repo/agent — only the cert artifact is committed (encrypted). Renewal is operator-issues-then-commits.
  • Base repo stays generic — no instance domain/secret leakage into the base; the Adversary checks the base builds/clones clean of instance specifics.
  • Incus guardrails (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't touch other instances.
  • No weakened tests / no drift — the restructured config must remain byte-identical to running (zero drift) and all of D1D10 must still hold after the refactor.

7. Open decisions (log in DECISIONS.md)

  • Flake input vs git submodule for the instance repo (default: flake input).
  • Bootstrap-key provisioning for a new VM: provision the off-box recovery age key to the host (decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is simpler for a clone; per-host re-encrypt is cleaner long-term.)
  • Final cc-nix-test sizing after the test: restore to 6 GB, or promote the freshly-rebuilt reproducible VM to be the canonical cc-ci and retire the old one.
  • DNS/gateway as IaC (terraform for the Gandi records + the gateway) — likely a separate future item (IDEAS), out of 1c scope; 1c keeps them as documented external preconditions.
  • Whether the instance repo is private under recipe-maintainers (bot is admin) and how the loops fetch it during rebuild (token-in-URL vs read deploy key).