Files
cc-ci-orchestrator/cc-ci-plan/plan-phase1c-full-reproducibility.md
autonomic-bot 01874821f2 decommission Pi: update all docs for VM-only setup
The orchestrator Pi is retired (2026-05-31). All agents now run on the
cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a
direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled,
no ProxyCommand. Updated across all affected files:

AGENTS.md
  - Remove Pi from reboot description; migration complete (not "parked")
  - cc-ci access: direct ssh, not via proxy

kickoff.md
  - Prerequisites: direct tailnet peer, not proxy
  - Host deps: NixOS (not apt)
  - Fallback/Incus: b1 reachable directly, no --proxy curl flag

plan.md §1 + §1.5
  - §1 bootstrap: direct SSH, check tailscale status (not restart proxy)
  - §1.5 intro: "VM" not "sandbox host"; no proxy
  - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row
  - Replace "Tailscale connection (proxy)" subsection with direct-peer description

plan-orchestrator-migration.md
  - Mark COMPLETE (2026-05-31); historical record only

plan-phase1c-full-reproducibility.md
  - Incus access: direct, not via SOCKS proxy

prompts/builder.md + prompts/adversary.md
  - cc-ci access language only: direct ssh, no proxy restart instructions
  - adversary: *.ci.commoninternet.net via plain curl, no proxy flag

REBOOTS.md
  - Retitle for VM; note Pi retired; Pi entries marked historical

systemd/cc-ci-loops.service
  - User/Group/HOME/PATH: notplants → loops
  - Remove cc-ci-tailscaled.service dependency (no proxy on VM)
  - Add note about nix/configuration.nix as the authoritative VM declaration

test-e2e-testme-acceptance.md
  - tailscale status: no --socket flag
  - ssh to throwaway: no ProxyCommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 00:16:37 +00:00

14 KiB
Raw Blame History

cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan)

Status: QUEUED — runs after Phase 1 (plan.md) and before Phase 1b (review/lint), so the review/lint pass covers this refactor and its final cold re-verification proves the genuine (post-1c) D8. Manual transition. Driven by the Builder + Adversary loops (same protocol as plan.md §6/§6.1/§7) — the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold. This file's path: /srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md


0. Why this phase

Phase-1 D8 was marked PASS with the throwaway-VM live rebuild "documented infeasible by design (sops host-key binding + operator DNS/cert)." That justification doesn't hold up:

  • sops host-key binding is defeated by the project's own master recovery age key (/srv/cc-ci/.sops/master-age.txt, a sops recipient created "for re-keying if cc-ci is lost") — a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is not infeasible.
  • operator DNS/cert is a precondition, not a rebuild blocker — it only gates the full end-to-end HTTPS path, not "a blank host + the repo boots into the declared system."
  • Incus is available, and the rate-limit premise that originally deferred the test was obsolete (D10 passed without registry creds). The Builder itself flagged the rebuild as feasible now and refused to self-certify; the bar then slipped to "infeasible."

This phase does two connected things: (A) make the VM fully reproducible from git, including all secrets (move the wildcard cert and everything else into sops-in-git, in a separate private cc-ci-secrets repo), and (B) actually perform and verify the throwaway-VM live rebuild, closing D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the static half of D8; this adds the live half it was missing.


1. Mission

A blank NixOS VM, given only (1) the two git repos (base cc-ci + private cc-ci-secrets), (2) the single bootstrap age key, and (3) the external DNS/gateway already pointing at it, becomes a working cc-ci via nixos-rebuild switch with no undocumented manual steps — secrets and the wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops.


2. The reproducibility model (target architecture)

Split only the secrets into their own repo. The base stays one well-parameterized repo. The boundary is secrecy, not modularity: secrets get a separate private repo (an extra access-control layer); instance-specific non-secret vars (domain, gateway/DNS facts) stay in the base as plain, changeable parameters — another admin can repoint cc-ci by editing them, no second config repo needed.

  • cc-ci (base — one repo, well-parameterized): flake.nix, modules, runner/, tests/, docs/, and the instance config as parameters/defaults (e.g. DOMAIN = "ci.commoninternet.net", gateway/DNS facts, sops recipients). Instance config lives here; only secret material is external. Keep it well-parameterized so changing the domain/recipients is a one-line edit, not a fork.
  • cc-ci-secrets (private, recipe-maintainers/cc-ci-secrets): holds only the sops-encrypted secret material — secrets/secrets.yaml (wildcard cert + key, Drone OAuth client_id/secret + RPC secret, webhook HMAC, registry creds if any, app/infra secrets) + .sops.yaml. No code, no config logic — just encrypted secrets, as a separate security layer with its own access control.
  • Linkage (default = flake input): base flake.nix has inputs.secrets.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-secrets" (private; fetched via the bot token / a read deploy key); sops-nix reads secrets/secrets.yaml from it. Alternative: a git submodule at cc-ci/secrets/. Record the choice in DECISIONS.md; flake input is the recommended default.
  • sops-nix wiring: sops.defaultSopsFile → the cc-ci-secrets secrets.yaml; sops.age.sshKeyPaths = host key + the recovery recipient. The wildcard cert/key are sops secrets decrypted at activation to /var/lib/ci-certs/live/{fullchain.pem,privkey.pem} and fed into the Traefik recipe's ssl_cert/ssl_key swarm secrets — no out-of-band cert file.
  • The one irreducible out-of-band secret: the age private key that unlocks the secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. This is the only permitted "not in git" secret, provisioned to the host at creation.
  • Still external (not the VM's git, by nature): the DNS records + the TLS-passthrough gateway (network infra) — documented as preconditions. (IaC for those is out of scope — see §7.)
  • Token discipline preserved: only the cert artifact enters git (encrypted, in cc-ci-secrets); the Gandi DNS token never enters any repo or the agent. Renewal = operator re-issues the cert out-of-band, then commits the new sops-encrypted cert to cc-ci-secrets (versioned, reproducible).

3. Definition of Done

Terminates only when every item holds and the Adversary has independently re-verified each within 24h, from a cold start (logged in REVIEW.md):

  • C1 — Secrets-repo split. A separate private cc-ci-secrets repo holds only the sops-encrypted secrets (+ .sops.yaml), consumed by the base via a flake input. The base cc-ci stays one well-parameterized repo — instance vars (domain, gateway, recipients) remain changeable parameters in the base, not moved out (only secrets are external). nixosConfigurations.cc-ci still builds byte-identically to the running system.
  • C2 — Cert in git. The wildcard cert+key are sops secrets in cc-ci-secrets, decrypted at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is gone. Verified: a rebuild serves valid TLS from the git-sourced cert.
  • C3 — All secrets in git (one exception). Every infra/app secret (cert, Drone OAuth/RPC, webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The only out-of-band secret is the bootstrap age key — documented precisely, nothing else.
  • C4 — Genuine throwaway-VM live rebuild. On a blank NixOS VM (Incus, terraform-ci), provisioned with only the bootstrap age key, the loops git clone base+secrets and run nixos-rebuild switch; the system activates and the reconcile oneshots converge (swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with no manual step not in docs/install.md. The true proof is a clean-room repeat (C4 done right): the Adversary deletes any existing throwaway VM, creates a brand-new blank VM via Incus, and runs the entire install from scratch (clone base+secrets → provision age key → nixos-rebuild switch → everything comes up) — proving reproducibility on a genuinely fresh machine, with no residue from the Builder's setup attempt masking a gap. Done cold by the Adversary, with logged evidence (VM id, the exact commands from docs/install.md, convergence + TLS-from-git-cert proof).
  • C5 — Honest D8. The D8 evidence is rewritten: byte-identical closure (static) plus the live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a blanket "infeasible."
  • C6 — Resource fit + cleanup. cc-nix-test resized 6 GB→4 GB and the throwaway VM created at 4 GB, within the ~12 GB running-RAM guideline (cc-nix-test 4 + lichen-staging 4 + throwaway 4 = 12 ≤ 16 GB physical on b1; the guideline is doc-only, not an enforced project limit). The throwaway VM is destroyed after the test (no leftover). Final cc-nix-test sizing decided and applied (keep 4 GB, restore to 6 GB, or promote the rebuilt VM — record in DECISIONS.md).
  • C7 — Docs. docs/install.md, docs/secrets.md, architecture.md, and the main plan's cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision the age key + (external) DNS/gateway → one nixos-rebuild switch. A new engineer can stand up a fresh instance from the docs.

When C1C7 hold and are Adversary-verified, write ## DONE to Phase-1c STATUS.md.


4. Incus capability (granted for this phase only)

The loops normally only ssh cc-ci. For 1c they MAY drive Incus on b1 (resize cc-nix-test; create/destroy ONE throwaway VM in terraform-ci), using the mTLS certs at /srv/incus-terraform-nix-vm-creator/terraform-secrets/ (b1 is reachable directly from the VM — direct tailnet peer, no proxy) — see the incus skill (/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md) and cc-ci-vm-incus. Guardrails: only terraform-ci; keep total running RAM within the ~12 GB guideline (doc-only — terraform-ci has no enforced limits.memory; b1 is 16 GB physical) — hence cc-nix-test→4 GB + throwaway 4 GB + lichen-staging 4 GB = 12 GB; destroy the throwaway VM when done; never touch other projects/instances; live-memory changes need stop→set→start (hotplug times out — see memory).


5. Method (ordered; each milestone ends with an Adversary gate)

  1. W1 — Headroom. Resize cc-nix-test 6 GB→4 GB (stop→set→start) so a 4 GB throwaway VM fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). Accept: b1 has room; cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). (Final sizing decided in W6.)
  2. W2 — Secrets repo + cert into git. Create the private cc-ci-secrets repo; move all secrets into sops there — including the wildcard cert+key (read from the current /var/lib/ci-certs/live) and the existing secrets/secrets.yaml contents; keep instance vars parameterized in the base; wire the base flake to consume cc-ci-secrets (flake input). Accept: nixos-rebuild build of the restructured config is byte-identical to the running system (zero drift), and cc-nix-test nixos-rebuild switches cleanly onto the new structure with TLS still served from the git cert.
  3. W3 — Throwaway VM. Create a blank NixOS VM in terraform-ci (the incus-base image), sized 4 GB. Accept: VM reachable; bootstrap age key provisioned by the documented mechanism only.
  4. W4 — Reproducible live rebuild. On the throwaway VM: clone base+instance, nixos-rebuild switch, watch oneshots converge, secrets+cert decrypt. Accept: system fully up with no step outside docs/install.md; capture evidence.
  5. W5 — Adversary clean-room proof + honest D8. The Adversary deletes the Builder's throwaway VM, creates a brand-new blank VM, and runs the full install from scratch per docs/install.md (clone base+secrets → provision age key → nixos-rebuild switch → all up) — a genuinely fresh machine, no residue. Then rewrites the D8 evidence (static byte-identical + this live clean-room rebuild), removing "infeasible by design." Accept: Adversary logs a real D8 live-rebuild PASS on a freshly-created VM (or a narrow, signed-off limitation per §3 C5).
  6. W6 — Cleanup + docs + final sizing. Destroy the throwaway VM; update all docs (C7); decide and apply final cc-nix-test sizing. Accept: no leftover VM/secret leak; docs match; flip Phase-1c STATUS.md to ## DONE.

6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate)

  • Don't fake the rebuild. "Infeasible/can't reproduce" is allowed only for a true, narrowly-scoped blocker with the maximal tested subset and Adversary sign-off — the host-key and DNS/cert reasons are explicitly not valid (the recovery key + the cert-in-git fix remove them).
  • Exactly one out-of-band secret. The bootstrap age key. Everything else in git, encrypted. If the loops find another secret that "has to" be out-of-band, that's a finding to design away, not accept.
  • Gandi token stays out of repo/agent — only the cert artifact is committed (encrypted). Renewal is operator-issues-then-commits.
  • No plaintext secret leaks into the base (or the store). Instance vars (domain, gateway) may live in the base as parameters — that's fine; what must NOT leak is any secret (cert/keys/tokens): those stay encrypted in cc-ci-secrets. The Adversary greps the base + the Nix store for plaintext secret material.
  • Incus guardrails (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't touch other instances.
  • No weakened tests / no drift — the restructured config must remain byte-identical to running (zero drift) and all of D1D10 must still hold after the refactor.

7. Open decisions (log in DECISIONS.md)

  • Flake input vs git submodule for the cc-ci-secrets repo (default: flake input).
  • Bootstrap-key provisioning for a new VM: provision the off-box recovery age key to the host (decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is simpler for a clone; per-host re-encrypt is cleaner long-term.)
  • Final cc-nix-test sizing after the test: restore to 6 GB, or promote the freshly-rebuilt reproducible VM to be the canonical cc-ci and retire the old one.
  • DNS/gateway as IaC (terraform for the Gandi records + the gateway) — likely a separate future item (IDEAS), out of 1c scope; 1c keeps them as documented external preconditions.
  • How the loops fetch the private cc-ci-secrets repo during rebuild (bot token-in-URL vs a read deploy key) — it's private under recipe-maintainers (bot is admin).