Files
cc-ci-orchestrator/cc-ci-plan/plan-phase1c-full-reproducibility.md
autonomic-bot 769dfd0c62 Phase-1c: resource plan -> 4GB/4GB under a 12GB guideline (not 2GB)
Per operator: don't downsize cc-nix-test to 2GB. Instead raise the terraform-ci running-RAM
guideline to ~12GB (it's doc-only — the project has no enforced limits.memory; b1 is 16GB),
resize cc-nix-test 6->4GB, and create the throwaway VM at 4GB (4+4+lichen 4 = 12 <= 16).
Updated W1/W3/C6/§4 and the incus memory note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:29:37 +01:00

178 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan)
**Status:** QUEUED — runs after Phase 1 (`plan.md`); pairs with Phase 1b (review/lint). **Manual**
transition. **Driven by the Builder + Adversary loops** (same protocol as `plan.md` §6/§6.1/§7) —
the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold.
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md`
---
## 0. Why this phase
Phase-1 D8 was marked PASS with the throwaway-VM **live rebuild "documented infeasible by design
(sops host-key binding + operator DNS/cert)."** That justification doesn't hold up:
- **sops host-key binding** is defeated by the project's **own master recovery age key**
(`/srv/cc-ci/.sops/master-age.txt`, a sops recipient created *"for re-keying if cc-ci is lost"*) —
a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is *not* infeasible.
- **operator DNS/cert** is a *precondition*, not a rebuild blocker — it only gates the full
end-to-end HTTPS path, not "a blank host + the repo boots into the declared system."
- Incus is available, and the rate-limit premise that originally deferred the test was obsolete
(D10 passed without registry creds). The Builder itself flagged the rebuild as *feasible now* and
refused to self-certify; the bar then slipped to "infeasible."
This phase does two connected things: **(A)** make the VM **fully reproducible from git, including
all secrets** (move the wildcard cert and everything else into sops-in-git; split generic base from
private instance), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing
D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the *static* half of D8;
this adds the *live* half it was missing.
---
## 1. Mission
A blank NixOS VM, given only **(1)** the two git repos (generic base + private instance), **(2)** the
single bootstrap age key, and **(3)** the external DNS/gateway already pointing at it, becomes a
working cc-ci via **`nixos-rebuild switch`** with **no undocumented manual steps** — secrets and the
wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops.
---
## 2. The reproducibility model (target architecture)
**Two repos, composed via a flake input (default) — generic base + private instance.**
- **`cc-ci` (base, instance-agnostic):** `flake.nix` exposing a parameterized `nixosModules.cc-ci`,
plus `runner/`, `tests/`, `docs/`. **No hardcoded domain, no instance secrets.**
- **`cc-ci-instance` (private, e.g. `recipe-maintainers/cc-ci-instance`):** `instance.nix`
(DOMAIN=`ci.commoninternet.net`, gateway/DNS facts, sops recipients), `secrets/secrets.yaml`
(sops-encrypted: **wildcard cert + key**, Drone OAuth client_id/secret + RPC secret, webhook HMAC,
registry creds if any, app/infra secrets), and `.sops.yaml`.
- **Linkage (default = flake input):** base `flake.nix` has
`inputs.instance.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-instance"`
(private; fetched via the bot token / a read deploy key), and
`nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem { modules = [ self.nixosModules.cc-ci instance.nixosModules.instance ]; }`.
*Alternative:* a git **submodule** at `cc-ci/instance/` (simpler single checkout; submodule
footguns). Record the choice in `DECISIONS.md`; flake input is the recommended default.
- **sops-nix wiring:** `sops.defaultSopsFile = instance secrets.yaml`; `sops.age.sshKeyPaths` = host
key + the recovery recipient. The **wildcard cert/key are sops secrets** decrypted at activation to
`/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed into the Traefik recipe's
`ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.**
- **The one irreducible out-of-band secret:** the **age private key** that unlocks the repo's sops
secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts.
This is the *only* permitted "not in git" secret, and it's provisioned to the host at creation.
- **Still external (not the VM's git, by nature):** the DNS records + the TLS-passthrough gateway
(network infra) — documented as preconditions. (IaC for those is out of scope — see §7.)
- **Token discipline preserved:** only the cert *artifact* enters git (encrypted); the **Gandi DNS
token never enters the repo or the agent**. Renewal = operator re-issues the cert out-of-band, then
commits the new sops-encrypted cert to the instance repo (a versioned, reproducible renewal).
---
## 3. Definition of Done
Terminates only when every item holds **and the Adversary has independently re-verified each within
24h, from a cold start** (logged in `REVIEW.md`):
- [ ] **C1 — Repo split.** Generic base + private instance repo, composed (flake input by default).
The base builds with no instance secrets/domain baked in; the instance carries all instance
specifics. `nixosConfigurations.cc-ci` still builds byte-identically to the running system.
- [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in the instance repo, decrypted
at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is
gone. Verified: a rebuild serves valid TLS from the git-sourced cert.
- [ ] **C3 — All secrets in git (one exception).** Every infra/app secret (cert, Drone OAuth/RPC,
webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The **only**
out-of-band secret is the bootstrap age key — documented precisely, nothing else.
- [ ] **C4 — Genuine throwaway-VM live rebuild.** On a blank NixOS VM (Incus, `terraform-ci`),
provisioned with *only* the bootstrap age key, the loops `git clone` base+instance and run
`nixos-rebuild switch`; the system activates and the reconcile oneshots converge
(swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with **no manual step
not in `docs/install.md`**. The Adversary performs this **cold** and logs evidence.
- [ ] **C5 — Honest D8.** The D8 evidence is rewritten: byte-identical closure (static) **plus** the
live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any
single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off
limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a
blanket "infeasible."
- [ ] **C6 — Resource fit + cleanup.** `cc-nix-test` resized **6 GB→4 GB** and the throwaway VM
created at **4 GB**, within the **~12 GB running-RAM guideline** (cc-nix-test 4 + lichen-staging 4
+ throwaway 4 = 12 ≤ 16 GB physical on b1; the guideline is doc-only, not an enforced project
limit). The throwaway VM is **destroyed** after the test (no leftover). Final `cc-nix-test`
sizing decided and applied (keep 4 GB, restore to 6 GB, or promote the rebuilt VM — record in
`DECISIONS.md`).
- [ ] **C7 — Docs.** `docs/install.md`, `docs/secrets.md`, `architecture.md`, and the main plan's
cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision
the age key + (external) DNS/gateway → one `nixos-rebuild switch`. A new engineer can stand up a
fresh instance from the docs.
When C1C7 hold and are Adversary-verified, write `## DONE` to Phase-1c `STATUS.md`.
---
## 4. Incus capability (granted for this phase only)
The loops normally only `ssh cc-ci`. For 1c they MAY drive Incus on **b1** (resize `cc-nix-test`;
create/destroy ONE throwaway VM in `terraform-ci`), using the mTLS certs at
`/srv/incus-terraform-nix-vm-creator/terraform-secrets/` through the existing SOCKS proxy
(`127.0.0.1:1055`) — see the incus skill (`/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md`)
and [[cc-ci-vm-incus]]. Guardrails: only `terraform-ci`; keep total running RAM within the **~12 GB
guideline** (doc-only — terraform-ci has no enforced `limits.memory`; b1 is 16 GB physical) — hence
`cc-nix-test`→4 GB + throwaway 4 GB + lichen-staging 4 GB = 12 GB; **destroy the throwaway VM when
done**; never touch other projects/instances; live-memory changes need stop→set→start (hotplug times
out — see memory).
---
## 5. Method (ordered; each milestone ends with an Adversary gate)
1. **W1 — Headroom.** Resize `cc-nix-test` 6 GB→**4 GB** (stop→set→start) so a **4 GB** throwaway VM
fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). *Accept:* b1 has room;
cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). *(Final sizing decided in W6.)*
2. **W2 — Repo split + secrets into git.** Create the private `cc-ci-instance` repo; move instance
specifics + all secrets (incl. the **wildcard cert+key**, read from `/var/lib/ci-certs/live`) into
sops there; wire the base flake to consume it (flake input). *Accept:* `nixos-rebuild build` of the
restructured config is **byte-identical** to the running system (zero drift), and `cc-nix-test`
`nixos-rebuild switch`es cleanly onto the new structure with TLS still served from the git cert.
3. **W3 — Throwaway VM.** Create a blank NixOS VM in `terraform-ci` (the incus-base image), sized
**4 GB**. *Accept:* VM reachable; bootstrap age key provisioned by the documented mechanism only.
4. **W4 — Reproducible live rebuild.** On the throwaway VM: clone base+instance, `nixos-rebuild
switch`, watch oneshots converge, secrets+cert decrypt. *Accept:* system fully up with **no step
outside `docs/install.md`**; capture evidence.
5. **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 from scratch independently and
rewrites the D8 evidence (static + live), removing "infeasible by design." *Accept:* Adversary
logs a real D8 live-rebuild PASS (or a narrow, signed-off limitation per §3 C5).
6. **W6 — Cleanup + docs + final sizing.** Destroy the throwaway VM; update all docs (C7); decide and
apply final `cc-nix-test` sizing. *Accept:* no leftover VM/secret leak; docs match; flip Phase-1c
`STATUS.md` to `## DONE`.
---
## 6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate)
- **Don't fake the rebuild.** "Infeasible/can't reproduce" is allowed only for a true,
narrowly-scoped blocker with the maximal tested subset and **Adversary sign-off** — the host-key
and DNS/cert reasons are explicitly *not* valid (the recovery key + the cert-in-git fix remove them).
- **Exactly one out-of-band secret.** The bootstrap age key. Everything else in git, encrypted. If
the loops find another secret that "has to" be out-of-band, that's a finding to design away, not
accept.
- **Gandi token stays out of repo/agent** — only the cert artifact is committed (encrypted). Renewal
is operator-issues-then-commits.
- **Base repo stays generic** — no instance domain/secret leakage into the base; the Adversary checks
the base builds/clones clean of instance specifics.
- **Incus guardrails** (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't
touch other instances.
- **No weakened tests / no drift** — the restructured config must remain byte-identical to running
(zero drift) and all of D1D10 must still hold after the refactor.
---
## 7. Open decisions (log in DECISIONS.md)
- **Flake input vs git submodule** for the instance repo (default: flake input).
- **Bootstrap-key provisioning** for a new VM: provision the off-box recovery age key to the host
(decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is
simpler for a clone; per-host re-encrypt is cleaner long-term.)
- **Final `cc-nix-test` sizing** after the test: restore to 6 GB, or **promote the freshly-rebuilt
reproducible VM** to be the canonical cc-ci and retire the old one.
- **DNS/gateway as IaC** (terraform for the Gandi records + the gateway) — likely a separate future
item ([[IDEAS]]), out of 1c scope; 1c keeps them as documented external preconditions.
- Whether the instance repo is private under `recipe-maintainers` (bot is admin) and how the loops
fetch it during rebuild (token-in-URL vs read deploy key).