Add Phase-1c plan: full git reproducibility (secrets+cert in sops) + genuine D8 live rebuild
D8's throwaway-VM live rebuild was wrongly marked "infeasible by design" — the master recovery age key defeats the sops-host-key reason, DNS/cert is a precondition not a rebuild blocker, and Incus was available. Phase 1c (loop-driven): (A) make the VM fully reproducible from git including ALL secrets — move the wildcard cert + every secret into sops-in-git, split generic base repo from a private instance repo composed via a flake input (the only out-of-band secret is the bootstrap age key); (B) actually perform + cold-verify a blank-VM nixos-rebuild and rewrite D8 honestly. Resize cc-nix-test to 2GB first to free b1 headroom for a sized throwaway VM; destroy it after; restore/promote sizing. Gandi token stays out of repo/agent (only the cert artifact is committed). Linked README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -16,6 +16,7 @@ autonomous Claude loops (a Builder and an adversarial Reviewer) running over day
|
||||
|---|---|
|
||||
| `plan.md` | The Phase-1 plan (build the CI server). Agents treat it as their single source of truth. |
|
||||
| `plan-phase1b-review-lint.md` | **Phase 1b** (bounded pass at the end of Phase 1): deterministic linting/formatting in CI + a white-box review checklist (real tests, DRY harness, idempotent Nix, no footguns/secrets). |
|
||||
| `plan-phase1c-full-reproducibility.md` | **Phase 1c**: make the VM fully reproducible from git (all secrets incl. the wildcard cert in sops; generic base + private instance flake input) and do the **genuine throwaway-VM live rebuild** to close D8 honestly (the "infeasible by design" was overstated). |
|
||||
| `plan-phase2-recipe-tests.md` | **Phase 2** (after Phase 1b): author comprehensive per-recipe tests — port every recipe-maintainer test + ≥2 recipe-specific tests per app. |
|
||||
| `plan-phase2b-test-performance.md` | **Phase 2b** (after Phase 2, before Phase 3): empirically measure where test time goes and reduce it (image cache, readiness tuning, dedup deploys, warm infra, concurrency) — no weakened tests. |
|
||||
| `plan-phase3-results-ux.md` | **Phase 3** (after Phase 2b): beautiful YunoHost-style results — per-run **level**, image-forward PR comment (badge + summary card + app screenshot), polished dashboard. |
|
||||
|
||||
173
cc-ci-plan/plan-phase1c-full-reproducibility.md
Normal file
173
cc-ci-plan/plan-phase1c-full-reproducibility.md
Normal file
@ -0,0 +1,173 @@
|
||||
# cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan)
|
||||
|
||||
**Status:** QUEUED — runs after Phase 1 (`plan.md`); pairs with Phase 1b (review/lint). **Manual**
|
||||
transition. **Driven by the Builder + Adversary loops** (same protocol as `plan.md` §6/§6.1/§7) —
|
||||
the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold.
|
||||
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. Why this phase
|
||||
|
||||
Phase-1 D8 was marked PASS with the throwaway-VM **live rebuild "documented infeasible by design
|
||||
(sops host-key binding + operator DNS/cert)."** That justification doesn't hold up:
|
||||
- **sops host-key binding** is defeated by the project's **own master recovery age key**
|
||||
(`/srv/cc-ci/.sops/master-age.txt`, a sops recipient created *"for re-keying if cc-ci is lost"*) —
|
||||
a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is *not* infeasible.
|
||||
- **operator DNS/cert** is a *precondition*, not a rebuild blocker — it only gates the full
|
||||
end-to-end HTTPS path, not "a blank host + the repo boots into the declared system."
|
||||
- Incus is available, and the rate-limit premise that originally deferred the test was obsolete
|
||||
(D10 passed without registry creds). The Builder itself flagged the rebuild as *feasible now* and
|
||||
refused to self-certify; the bar then slipped to "infeasible."
|
||||
|
||||
This phase does two connected things: **(A)** make the VM **fully reproducible from git, including
|
||||
all secrets** (move the wildcard cert and everything else into sops-in-git; split generic base from
|
||||
private instance), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing
|
||||
D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the *static* half of D8;
|
||||
this adds the *live* half it was missing.
|
||||
|
||||
---
|
||||
|
||||
## 1. Mission
|
||||
|
||||
A blank NixOS VM, given only **(1)** the two git repos (generic base + private instance), **(2)** the
|
||||
single bootstrap age key, and **(3)** the external DNS/gateway already pointing at it, becomes a
|
||||
working cc-ci via **`nixos-rebuild switch`** with **no undocumented manual steps** — secrets and the
|
||||
wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops.
|
||||
|
||||
---
|
||||
|
||||
## 2. The reproducibility model (target architecture)
|
||||
|
||||
**Two repos, composed via a flake input (default) — generic base + private instance.**
|
||||
|
||||
- **`cc-ci` (base, instance-agnostic):** `flake.nix` exposing a parameterized `nixosModules.cc-ci`,
|
||||
plus `runner/`, `tests/`, `docs/`. **No hardcoded domain, no instance secrets.**
|
||||
- **`cc-ci-instance` (private, e.g. `recipe-maintainers/cc-ci-instance`):** `instance.nix`
|
||||
(DOMAIN=`ci.commoninternet.net`, gateway/DNS facts, sops recipients), `secrets/secrets.yaml`
|
||||
(sops-encrypted: **wildcard cert + key**, Drone OAuth client_id/secret + RPC secret, webhook HMAC,
|
||||
registry creds if any, app/infra secrets), and `.sops.yaml`.
|
||||
- **Linkage (default = flake input):** base `flake.nix` has
|
||||
`inputs.instance.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-instance"`
|
||||
(private; fetched via the bot token / a read deploy key), and
|
||||
`nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem { modules = [ self.nixosModules.cc-ci instance.nixosModules.instance ]; }`.
|
||||
*Alternative:* a git **submodule** at `cc-ci/instance/` (simpler single checkout; submodule
|
||||
footguns). Record the choice in `DECISIONS.md`; flake input is the recommended default.
|
||||
- **sops-nix wiring:** `sops.defaultSopsFile = instance secrets.yaml`; `sops.age.sshKeyPaths` = host
|
||||
key + the recovery recipient. The **wildcard cert/key are sops secrets** decrypted at activation to
|
||||
`/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed into the Traefik recipe's
|
||||
`ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.**
|
||||
- **The one irreducible out-of-band secret:** the **age private key** that unlocks the repo's sops
|
||||
secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts.
|
||||
This is the *only* permitted "not in git" secret, and it's provisioned to the host at creation.
|
||||
- **Still external (not the VM's git, by nature):** the DNS records + the TLS-passthrough gateway
|
||||
(network infra) — documented as preconditions. (IaC for those is out of scope — see §7.)
|
||||
- **Token discipline preserved:** only the cert *artifact* enters git (encrypted); the **Gandi DNS
|
||||
token never enters the repo or the agent**. Renewal = operator re-issues the cert out-of-band, then
|
||||
commits the new sops-encrypted cert to the instance repo (a versioned, reproducible renewal).
|
||||
|
||||
---
|
||||
|
||||
## 3. Definition of Done
|
||||
|
||||
Terminates only when every item holds **and the Adversary has independently re-verified each within
|
||||
24h, from a cold start** (logged in `REVIEW.md`):
|
||||
|
||||
- [ ] **C1 — Repo split.** Generic base + private instance repo, composed (flake input by default).
|
||||
The base builds with no instance secrets/domain baked in; the instance carries all instance
|
||||
specifics. `nixosConfigurations.cc-ci` still builds byte-identically to the running system.
|
||||
- [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in the instance repo, decrypted
|
||||
at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is
|
||||
gone. Verified: a rebuild serves valid TLS from the git-sourced cert.
|
||||
- [ ] **C3 — All secrets in git (one exception).** Every infra/app secret (cert, Drone OAuth/RPC,
|
||||
webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The **only**
|
||||
out-of-band secret is the bootstrap age key — documented precisely, nothing else.
|
||||
- [ ] **C4 — Genuine throwaway-VM live rebuild.** On a blank NixOS VM (Incus, `terraform-ci`),
|
||||
provisioned with *only* the bootstrap age key, the loops `git clone` base+instance and run
|
||||
`nixos-rebuild switch`; the system activates and the reconcile oneshots converge
|
||||
(swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with **no manual step
|
||||
not in `docs/install.md`**. The Adversary performs this **cold** and logs evidence.
|
||||
- [ ] **C5 — Honest D8.** The D8 evidence is rewritten: byte-identical closure (static) **plus** the
|
||||
live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any
|
||||
single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off
|
||||
limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a
|
||||
blanket "infeasible."
|
||||
- [ ] **C6 — Resource fit + cleanup.** `cc-nix-test` resized to **2 GB** to free b1 headroom for a
|
||||
properly-sized throwaway VM (§5 step 1); the throwaway VM is **destroyed** after the test (no
|
||||
leftover, respect the `terraform-ci` <10 GB-running cap); final `cc-nix-test` sizing decided and
|
||||
applied (restore to 6 GB, or promote the rebuilt VM — record in `DECISIONS.md`).
|
||||
- [ ] **C7 — Docs.** `docs/install.md`, `docs/secrets.md`, `architecture.md`, and the main plan's
|
||||
cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision
|
||||
the age key + (external) DNS/gateway → one `nixos-rebuild switch`. A new engineer can stand up a
|
||||
fresh instance from the docs.
|
||||
|
||||
When C1–C7 hold and are Adversary-verified, write `## DONE` to Phase-1c `STATUS.md`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Incus capability (granted for this phase only)
|
||||
|
||||
The loops normally only `ssh cc-ci`. For 1c they MAY drive Incus on **b1** (resize `cc-nix-test`;
|
||||
create/destroy ONE throwaway VM in `terraform-ci`), using the mTLS certs at
|
||||
`/srv/incus-terraform-nix-vm-creator/terraform-secrets/` through the existing SOCKS proxy
|
||||
(`127.0.0.1:1055`) — see the incus skill (`/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md`)
|
||||
and [[cc-ci-vm-incus]]. Guardrails: only `terraform-ci`; **respect the <10 GB running-RAM cap**
|
||||
(that's why `cc-nix-test`→2 GB first); **destroy the throwaway VM when done**; never touch other
|
||||
projects/instances; live-memory changes need stop→set→start (hotplug times out — see memory).
|
||||
|
||||
---
|
||||
|
||||
## 5. Method (ordered; each milestone ends with an Adversary gate)
|
||||
|
||||
1. **W1 — Headroom.** Resize `cc-nix-test` 6 GB→**2 GB** (stop→set→start) to fit a ~6 GB throwaway VM
|
||||
under b1's budget. *Accept:* b1 has room; cc-nix-test still healthy at 2 GB (no heavy recipe CI
|
||||
runs during 1c). *(Note: restore sizing in W6.)*
|
||||
2. **W2 — Repo split + secrets into git.** Create the private `cc-ci-instance` repo; move instance
|
||||
specifics + all secrets (incl. the **wildcard cert+key**, read from `/var/lib/ci-certs/live`) into
|
||||
sops there; wire the base flake to consume it (flake input). *Accept:* `nixos-rebuild build` of the
|
||||
restructured config is **byte-identical** to the running system (zero drift), and `cc-nix-test`
|
||||
`nixos-rebuild switch`es cleanly onto the new structure with TLS still served from the git cert.
|
||||
3. **W3 — Throwaway VM.** Create a blank NixOS VM in `terraform-ci` (the incus-base image), sized
|
||||
~6 GB. *Accept:* VM reachable; bootstrap age key provisioned by the documented mechanism only.
|
||||
4. **W4 — Reproducible live rebuild.** On the throwaway VM: clone base+instance, `nixos-rebuild
|
||||
switch`, watch oneshots converge, secrets+cert decrypt. *Accept:* system fully up with **no step
|
||||
outside `docs/install.md`**; capture evidence.
|
||||
5. **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 from scratch independently and
|
||||
rewrites the D8 evidence (static + live), removing "infeasible by design." *Accept:* Adversary
|
||||
logs a real D8 live-rebuild PASS (or a narrow, signed-off limitation per §3 C5).
|
||||
6. **W6 — Cleanup + docs + final sizing.** Destroy the throwaway VM; update all docs (C7); decide and
|
||||
apply final `cc-nix-test` sizing. *Accept:* no leftover VM/secret leak; docs match; flip Phase-1c
|
||||
`STATUS.md` to `## DONE`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate)
|
||||
|
||||
- **Don't fake the rebuild.** "Infeasible/can't reproduce" is allowed only for a true,
|
||||
narrowly-scoped blocker with the maximal tested subset and **Adversary sign-off** — the host-key
|
||||
and DNS/cert reasons are explicitly *not* valid (the recovery key + the cert-in-git fix remove them).
|
||||
- **Exactly one out-of-band secret.** The bootstrap age key. Everything else in git, encrypted. If
|
||||
the loops find another secret that "has to" be out-of-band, that's a finding to design away, not
|
||||
accept.
|
||||
- **Gandi token stays out of repo/agent** — only the cert artifact is committed (encrypted). Renewal
|
||||
is operator-issues-then-commits.
|
||||
- **Base repo stays generic** — no instance domain/secret leakage into the base; the Adversary checks
|
||||
the base builds/clones clean of instance specifics.
|
||||
- **Incus guardrails** (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't
|
||||
touch other instances.
|
||||
- **No weakened tests / no drift** — the restructured config must remain byte-identical to running
|
||||
(zero drift) and all of D1–D10 must still hold after the refactor.
|
||||
|
||||
---
|
||||
|
||||
## 7. Open decisions (log in DECISIONS.md)
|
||||
- **Flake input vs git submodule** for the instance repo (default: flake input).
|
||||
- **Bootstrap-key provisioning** for a new VM: provision the off-box recovery age key to the host
|
||||
(decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is
|
||||
simpler for a clone; per-host re-encrypt is cleaner long-term.)
|
||||
- **Final `cc-nix-test` sizing** after the test: restore to 6 GB, or **promote the freshly-rebuilt
|
||||
reproducible VM** to be the canonical cc-ci and retire the old one.
|
||||
- **DNS/gateway as IaC** (terraform for the Gandi records + the gateway) — likely a separate future
|
||||
item ([[IDEAS]]), out of 1c scope; 1c keeps them as documented external preconditions.
|
||||
- Whether the instance repo is private under `recipe-maintainers` (bot is admin) and how the loops
|
||||
fetch it during rebuild (token-in-URL vs read deploy key).
|
||||
Reference in New Issue
Block a user