The orchestrator Pi is retired (2026-05-31). All agents now run on the cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled, no ProxyCommand. Updated across all affected files: AGENTS.md - Remove Pi from reboot description; migration complete (not "parked") - cc-ci access: direct ssh, not via proxy kickoff.md - Prerequisites: direct tailnet peer, not proxy - Host deps: NixOS (not apt) - Fallback/Incus: b1 reachable directly, no --proxy curl flag plan.md §1 + §1.5 - §1 bootstrap: direct SSH, check tailscale status (not restart proxy) - §1.5 intro: "VM" not "sandbox host"; no proxy - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row - Replace "Tailscale connection (proxy)" subsection with direct-peer description plan-orchestrator-migration.md - Mark COMPLETE (2026-05-31); historical record only plan-phase1c-full-reproducibility.md - Incus access: direct, not via SOCKS proxy prompts/builder.md + prompts/adversary.md - cc-ci access language only: direct ssh, no proxy restart instructions - adversary: *.ci.commoninternet.net via plain curl, no proxy flag REBOOTS.md - Retitle for VM; note Pi retired; Pi entries marked historical systemd/cc-ci-loops.service - User/Group/HOME/PATH: notplants → loops - Remove cc-ci-tailscaled.service dependency (no proxy on VM) - Add note about nix/configuration.nix as the authoritative VM declaration test-e2e-testme-acceptance.md - tailscale status: no --socket flag - ssh to throwaway: no ProxyCommand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14 KiB
cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan)
Status: QUEUED — runs after Phase 1 (plan.md) and before Phase 1b (review/lint), so the
review/lint pass covers this refactor and its final cold re-verification proves the genuine
(post-1c) D8. Manual transition. Driven by the Builder + Adversary loops (same protocol as plan.md §6/§6.1/§7) —
the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold.
This file's path: /srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md
0. Why this phase
Phase-1 D8 was marked PASS with the throwaway-VM live rebuild "documented infeasible by design (sops host-key binding + operator DNS/cert)." That justification doesn't hold up:
- sops host-key binding is defeated by the project's own master recovery age key
(
/srv/cc-ci/.sops/master-age.txt, a sops recipient created "for re-keying if cc-ci is lost") — a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is not infeasible. - operator DNS/cert is a precondition, not a rebuild blocker — it only gates the full end-to-end HTTPS path, not "a blank host + the repo boots into the declared system."
- Incus is available, and the rate-limit premise that originally deferred the test was obsolete (D10 passed without registry creds). The Builder itself flagged the rebuild as feasible now and refused to self-certify; the bar then slipped to "infeasible."
This phase does two connected things: (A) make the VM fully reproducible from git, including
all secrets (move the wildcard cert and everything else into sops-in-git, in a separate private
cc-ci-secrets repo), and (B) actually perform and verify the throwaway-VM live rebuild, closing
D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the static half of D8;
this adds the live half it was missing.
1. Mission
A blank NixOS VM, given only (1) the two git repos (base cc-ci + private cc-ci-secrets), (2) the
single bootstrap age key, and (3) the external DNS/gateway already pointing at it, becomes a
working cc-ci via nixos-rebuild switch with no undocumented manual steps — secrets and the
wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops.
2. The reproducibility model (target architecture)
Split only the secrets into their own repo. The base stays one well-parameterized repo. The boundary is secrecy, not modularity: secrets get a separate private repo (an extra access-control layer); instance-specific non-secret vars (domain, gateway/DNS facts) stay in the base as plain, changeable parameters — another admin can repoint cc-ci by editing them, no second config repo needed.
cc-ci(base — one repo, well-parameterized):flake.nix, modules,runner/,tests/,docs/, and the instance config as parameters/defaults (e.g.DOMAIN = "ci.commoninternet.net", gateway/DNS facts, sops recipients). Instance config lives here; only secret material is external. Keep it well-parameterized so changing the domain/recipients is a one-line edit, not a fork.cc-ci-secrets(private,recipe-maintainers/cc-ci-secrets): holds only the sops-encrypted secret material —secrets/secrets.yaml(wildcard cert + key, Drone OAuth client_id/secret + RPC secret, webhook HMAC, registry creds if any, app/infra secrets) +.sops.yaml. No code, no config logic — just encrypted secrets, as a separate security layer with its own access control.- Linkage (default = flake input): base
flake.nixhasinputs.secrets.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-secrets"(private; fetched via the bot token / a read deploy key); sops-nix readssecrets/secrets.yamlfrom it. Alternative: a git submodule atcc-ci/secrets/. Record the choice inDECISIONS.md; flake input is the recommended default. - sops-nix wiring:
sops.defaultSopsFile→ thecc-ci-secretssecrets.yaml;sops.age.sshKeyPaths= host key + the recovery recipient. The wildcard cert/key are sops secrets decrypted at activation to/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}and fed into the Traefik recipe'sssl_cert/ssl_keyswarm secrets — no out-of-band cert file. - The one irreducible out-of-band secret: the age private key that unlocks the secrets (the host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. This is the only permitted "not in git" secret, provisioned to the host at creation.
- Still external (not the VM's git, by nature): the DNS records + the TLS-passthrough gateway (network infra) — documented as preconditions. (IaC for those is out of scope — see §7.)
- Token discipline preserved: only the cert artifact enters git (encrypted, in
cc-ci-secrets); the Gandi DNS token never enters any repo or the agent. Renewal = operator re-issues the cert out-of-band, then commits the new sops-encrypted cert tocc-ci-secrets(versioned, reproducible).
3. Definition of Done
Terminates only when every item holds and the Adversary has independently re-verified each within
24h, from a cold start (logged in REVIEW.md):
- C1 — Secrets-repo split. A separate private
cc-ci-secretsrepo holds only the sops-encrypted secrets (+.sops.yaml), consumed by the base via a flake input. The basecc-cistays one well-parameterized repo — instance vars (domain, gateway, recipients) remain changeable parameters in the base, not moved out (only secrets are external).nixosConfigurations.cc-cistill builds byte-identically to the running system. - C2 — Cert in git. The wildcard cert+key are sops secrets in
cc-ci-secrets, decrypted at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is gone. Verified: a rebuild serves valid TLS from the git-sourced cert. - C3 — All secrets in git (one exception). Every infra/app secret (cert, Drone OAuth/RPC, webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The only out-of-band secret is the bootstrap age key — documented precisely, nothing else.
- C4 — Genuine throwaway-VM live rebuild. On a blank NixOS VM (Incus,
terraform-ci), provisioned with only the bootstrap age key, the loopsgit clonebase+secrets and runnixos-rebuild switch; the system activates and the reconcile oneshots converge (swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with no manual step not indocs/install.md. The true proof is a clean-room repeat (C4 done right): the Adversary deletes any existing throwaway VM, creates a brand-new blank VM via Incus, and runs the entire install from scratch (clone base+secrets → provision age key →nixos-rebuild switch→ everything comes up) — proving reproducibility on a genuinely fresh machine, with no residue from the Builder's setup attempt masking a gap. Done cold by the Adversary, with logged evidence (VM id, the exact commands fromdocs/install.md, convergence + TLS-from-git-cert proof). - C5 — Honest D8. The D8 evidence is rewritten: byte-identical closure (static) plus the live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a blanket "infeasible."
- C6 — Resource fit + cleanup.
cc-nix-testresized 6 GB→4 GB and the throwaway VM created at 4 GB, within the ~12 GB running-RAM guideline (cc-nix-test 4 + lichen-staging 4 + throwaway 4 = 12 ≤ 16 GB physical on b1; the guideline is doc-only, not an enforced project limit). The throwaway VM is destroyed after the test (no leftover). Finalcc-nix-testsizing decided and applied (keep 4 GB, restore to 6 GB, or promote the rebuilt VM — record inDECISIONS.md). - C7 — Docs.
docs/install.md,docs/secrets.md,architecture.md, and the main plan's cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision the age key + (external) DNS/gateway → onenixos-rebuild switch. A new engineer can stand up a fresh instance from the docs.
When C1–C7 hold and are Adversary-verified, write ## DONE to Phase-1c STATUS.md.
4. Incus capability (granted for this phase only)
The loops normally only ssh cc-ci. For 1c they MAY drive Incus on b1 (resize cc-nix-test;
create/destroy ONE throwaway VM in terraform-ci), using the mTLS certs at
/srv/incus-terraform-nix-vm-creator/terraform-secrets/ (b1 is reachable directly from the VM —
direct tailnet peer, no proxy) — see the incus skill (/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md)
and cc-ci-vm-incus. Guardrails: only terraform-ci; keep total running RAM within the ~12 GB
guideline (doc-only — terraform-ci has no enforced limits.memory; b1 is 16 GB physical) — hence
cc-nix-test→4 GB + throwaway 4 GB + lichen-staging 4 GB = 12 GB; destroy the throwaway VM when
done; never touch other projects/instances; live-memory changes need stop→set→start (hotplug times
out — see memory).
5. Method (ordered; each milestone ends with an Adversary gate)
- W1 — Headroom. Resize
cc-nix-test6 GB→4 GB (stop→set→start) so a 4 GB throwaway VM fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). Accept: b1 has room; cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). (Final sizing decided in W6.) - W2 — Secrets repo + cert into git. Create the private
cc-ci-secretsrepo; move all secrets into sops there — including the wildcard cert+key (read from the current/var/lib/ci-certs/live) and the existingsecrets/secrets.yamlcontents; keep instance vars parameterized in the base; wire the base flake to consumecc-ci-secrets(flake input). Accept:nixos-rebuild buildof the restructured config is byte-identical to the running system (zero drift), andcc-nix-testnixos-rebuild switches cleanly onto the new structure with TLS still served from the git cert. - W3 — Throwaway VM. Create a blank NixOS VM in
terraform-ci(the incus-base image), sized 4 GB. Accept: VM reachable; bootstrap age key provisioned by the documented mechanism only. - W4 — Reproducible live rebuild. On the throwaway VM: clone base+instance,
nixos-rebuild switch, watch oneshots converge, secrets+cert decrypt. Accept: system fully up with no step outsidedocs/install.md; capture evidence. - W5 — Adversary clean-room proof + honest D8. The Adversary deletes the Builder's throwaway
VM, creates a brand-new blank VM, and runs the full install from scratch per
docs/install.md(clone base+secrets → provision age key →nixos-rebuild switch→ all up) — a genuinely fresh machine, no residue. Then rewrites the D8 evidence (static byte-identical + this live clean-room rebuild), removing "infeasible by design." Accept: Adversary logs a real D8 live-rebuild PASS on a freshly-created VM (or a narrow, signed-off limitation per §3 C5). - W6 — Cleanup + docs + final sizing. Destroy the throwaway VM; update all docs (C7); decide and
apply final
cc-nix-testsizing. Accept: no leftover VM/secret leak; docs match; flip Phase-1cSTATUS.mdto## DONE.
6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate)
- Don't fake the rebuild. "Infeasible/can't reproduce" is allowed only for a true, narrowly-scoped blocker with the maximal tested subset and Adversary sign-off — the host-key and DNS/cert reasons are explicitly not valid (the recovery key + the cert-in-git fix remove them).
- Exactly one out-of-band secret. The bootstrap age key. Everything else in git, encrypted. If the loops find another secret that "has to" be out-of-band, that's a finding to design away, not accept.
- Gandi token stays out of repo/agent — only the cert artifact is committed (encrypted). Renewal is operator-issues-then-commits.
- No plaintext secret leaks into the base (or the store). Instance vars (domain, gateway) may
live in the base as parameters — that's fine; what must NOT leak is any secret (cert/keys/tokens):
those stay encrypted in
cc-ci-secrets. The Adversary greps the base + the Nix store for plaintext secret material. - Incus guardrails (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't touch other instances.
- No weakened tests / no drift — the restructured config must remain byte-identical to running (zero drift) and all of D1–D10 must still hold after the refactor.
7. Open decisions (log in DECISIONS.md)
- Flake input vs git submodule for the
cc-ci-secretsrepo (default: flake input). - Bootstrap-key provisioning for a new VM: provision the off-box recovery age key to the host (decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is simpler for a clone; per-host re-encrypt is cleaner long-term.)
- Final
cc-nix-testsizing after the test: restore to 6 GB, or promote the freshly-rebuilt reproducible VM to be the canonical cc-ci and retire the old one. - DNS/gateway as IaC (terraform for the Gandi records + the gateway) — likely a separate future item (IDEAS), out of 1c scope; 1c keeps them as documented external preconditions.
- How the loops fetch the private
cc-ci-secretsrepo during rebuild (bot token-in-URL vs a read deploy key) — it's private underrecipe-maintainers(bot is admin).