Files
cc-ci-orchestrator/cc-ci-plan/plan-phase1c-full-reproducibility.md
autonomic-bot 01874821f2 decommission Pi: update all docs for VM-only setup
The orchestrator Pi is retired (2026-05-31). All agents now run on the
cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a
direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled,
no ProxyCommand. Updated across all affected files:

AGENTS.md
  - Remove Pi from reboot description; migration complete (not "parked")
  - cc-ci access: direct ssh, not via proxy

kickoff.md
  - Prerequisites: direct tailnet peer, not proxy
  - Host deps: NixOS (not apt)
  - Fallback/Incus: b1 reachable directly, no --proxy curl flag

plan.md §1 + §1.5
  - §1 bootstrap: direct SSH, check tailscale status (not restart proxy)
  - §1.5 intro: "VM" not "sandbox host"; no proxy
  - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row
  - Replace "Tailscale connection (proxy)" subsection with direct-peer description

plan-orchestrator-migration.md
  - Mark COMPLETE (2026-05-31); historical record only

plan-phase1c-full-reproducibility.md
  - Incus access: direct, not via SOCKS proxy

prompts/builder.md + prompts/adversary.md
  - cc-ci access language only: direct ssh, no proxy restart instructions
  - adversary: *.ci.commoninternet.net via plain curl, no proxy flag

REBOOTS.md
  - Retitle for VM; note Pi retired; Pi entries marked historical

systemd/cc-ci-loops.service
  - User/Group/HOME/PATH: notplants → loops
  - Remove cc-ci-tailscaled.service dependency (no proxy on VM)
  - Add note about nix/configuration.nix as the authoritative VM declaration

test-e2e-testme-acceptance.md
  - tailscale status: no --socket flag
  - ssh to throwaway: no ProxyCommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 00:16:37 +00:00

197 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan)
**Status:** QUEUED — runs after Phase 1 (`plan.md`) and **before Phase 1b** (review/lint), so the
review/lint pass covers this refactor and its final cold re-verification proves the genuine
(post-1c) D8. **Manual** transition. **Driven by the Builder + Adversary loops** (same protocol as `plan.md` §6/§6.1/§7) —
the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold.
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md`
---
## 0. Why this phase
Phase-1 D8 was marked PASS with the throwaway-VM **live rebuild "documented infeasible by design
(sops host-key binding + operator DNS/cert)."** That justification doesn't hold up:
- **sops host-key binding** is defeated by the project's **own master recovery age key**
(`/srv/cc-ci/.sops/master-age.txt`, a sops recipient created *"for re-keying if cc-ci is lost"*) —
a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is *not* infeasible.
- **operator DNS/cert** is a *precondition*, not a rebuild blocker — it only gates the full
end-to-end HTTPS path, not "a blank host + the repo boots into the declared system."
- Incus is available, and the rate-limit premise that originally deferred the test was obsolete
(D10 passed without registry creds). The Builder itself flagged the rebuild as *feasible now* and
refused to self-certify; the bar then slipped to "infeasible."
This phase does two connected things: **(A)** make the VM **fully reproducible from git, including
all secrets** (move the wildcard cert and everything else into sops-in-git, in a separate private
`cc-ci-secrets` repo), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing
D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the *static* half of D8;
this adds the *live* half it was missing.
---
## 1. Mission
A blank NixOS VM, given only **(1)** the two git repos (base `cc-ci` + private `cc-ci-secrets`), **(2)** the
single bootstrap age key, and **(3)** the external DNS/gateway already pointing at it, becomes a
working cc-ci via **`nixos-rebuild switch`** with **no undocumented manual steps** — secrets and the
wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops.
---
## 2. The reproducibility model (target architecture)
**Split only the *secrets* into their own repo. The base stays one well-parameterized repo.** The
boundary is *secrecy*, not modularity: secrets get a separate private repo (an extra access-control
layer); instance-specific **non-secret** vars (domain, gateway/DNS facts) stay in the base as plain,
changeable parameters — another admin can repoint cc-ci by editing them, no second config repo needed.
- **`cc-ci` (base — one repo, well-parameterized):** `flake.nix`, modules, `runner/`, `tests/`,
`docs/`, **and** the instance config as parameters/defaults (e.g. `DOMAIN = "ci.commoninternet.net"`,
gateway/DNS facts, sops recipients). Instance *config* lives here; only *secret material* is external.
Keep it well-parameterized so changing the domain/recipients is a one-line edit, not a fork.
- **`cc-ci-secrets` (private, `recipe-maintainers/cc-ci-secrets`):** holds **only** the sops-encrypted
secret material — `secrets/secrets.yaml` (**wildcard cert + key**, Drone OAuth client_id/secret +
RPC secret, webhook HMAC, registry creds if any, app/infra secrets) + `.sops.yaml`. **No code, no
config logic** — just encrypted secrets, as a separate security layer with its own access control.
- **Linkage (default = flake input):** base `flake.nix` has
`inputs.secrets.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-secrets"`
(private; fetched via the bot token / a read deploy key); sops-nix reads `secrets/secrets.yaml` from
it. *Alternative:* a git **submodule** at `cc-ci/secrets/`. Record the choice in `DECISIONS.md`;
flake input is the recommended default.
- **sops-nix wiring:** `sops.defaultSopsFile` → the `cc-ci-secrets` `secrets.yaml`;
`sops.age.sshKeyPaths` = host key + the recovery recipient. The **wildcard cert/key are sops
secrets** decrypted at activation to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed
into the Traefik recipe's `ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.**
- **The one irreducible out-of-band secret:** the **age private key** that unlocks the secrets (the
host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. This is the
*only* permitted "not in git" secret, provisioned to the host at creation.
- **Still external (not the VM's git, by nature):** the DNS records + the TLS-passthrough gateway
(network infra) — documented as preconditions. (IaC for those is out of scope — see §7.)
- **Token discipline preserved:** only the cert *artifact* enters git (encrypted, in `cc-ci-secrets`);
the **Gandi DNS token never enters any repo or the agent**. Renewal = operator re-issues the cert
out-of-band, then commits the new sops-encrypted cert to `cc-ci-secrets` (versioned, reproducible).
---
## 3. Definition of Done
Terminates only when every item holds **and the Adversary has independently re-verified each within
24h, from a cold start** (logged in `REVIEW.md`):
- [ ] **C1 — Secrets-repo split.** A separate private `cc-ci-secrets` repo holds **only** the
sops-encrypted secrets (+ `.sops.yaml`), consumed by the base via a flake input. The base
`cc-ci` stays one well-parameterized repo — instance vars (domain, gateway, recipients) remain
changeable parameters in the base, **not** moved out (only secrets are external).
`nixosConfigurations.cc-ci` still builds byte-identically to the running system.
- [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in `cc-ci-secrets`, decrypted
at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is
gone. Verified: a rebuild serves valid TLS from the git-sourced cert.
- [ ] **C3 — All secrets in git (one exception).** Every infra/app secret (cert, Drone OAuth/RPC,
webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The **only**
out-of-band secret is the bootstrap age key — documented precisely, nothing else.
- [ ] **C4 — Genuine throwaway-VM live rebuild.** On a blank NixOS VM (Incus, `terraform-ci`),
provisioned with *only* the bootstrap age key, the loops `git clone` base+secrets and run
`nixos-rebuild switch`; the system activates and the reconcile oneshots converge
(swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with **no manual step
not in `docs/install.md`**.
**The true proof is a clean-room repeat (C4 done right):** the Adversary **deletes** any
existing throwaway VM, **creates a brand-new blank VM via Incus**, and runs the *entire* install
from scratch (clone base+secrets → provision age key → `nixos-rebuild switch` → everything comes
up) — proving reproducibility on a genuinely fresh machine, with **no residue** from the
Builder's setup attempt masking a gap. Done **cold** by the Adversary, with logged evidence
(VM id, the exact commands from `docs/install.md`, convergence + TLS-from-git-cert proof).
- [ ] **C5 — Honest D8.** The D8 evidence is rewritten: byte-identical closure (static) **plus** the
live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any
single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off
limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a
blanket "infeasible."
- [ ] **C6 — Resource fit + cleanup.** `cc-nix-test` resized **6 GB→4 GB** and the throwaway VM
created at **4 GB**, within the **~12 GB running-RAM guideline** (cc-nix-test 4 + lichen-staging 4
+ throwaway 4 = 12 ≤ 16 GB physical on b1; the guideline is doc-only, not an enforced project
limit). The throwaway VM is **destroyed** after the test (no leftover). Final `cc-nix-test`
sizing decided and applied (keep 4 GB, restore to 6 GB, or promote the rebuilt VM — record in
`DECISIONS.md`).
- [ ] **C7 — Docs.** `docs/install.md`, `docs/secrets.md`, `architecture.md`, and the main plan's
cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision
the age key + (external) DNS/gateway → one `nixos-rebuild switch`. A new engineer can stand up a
fresh instance from the docs.
When C1C7 hold and are Adversary-verified, write `## DONE` to Phase-1c `STATUS.md`.
---
## 4. Incus capability (granted for this phase only)
The loops normally only `ssh cc-ci`. For 1c they MAY drive Incus on **b1** (resize `cc-nix-test`;
create/destroy ONE throwaway VM in `terraform-ci`), using the mTLS certs at
`/srv/incus-terraform-nix-vm-creator/terraform-secrets/` (b1 is reachable directly from the VM —
direct tailnet peer, no proxy) — see the incus skill (`/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md`)
and [[cc-ci-vm-incus]]. Guardrails: only `terraform-ci`; keep total running RAM within the **~12 GB
guideline** (doc-only — terraform-ci has no enforced `limits.memory`; b1 is 16 GB physical) — hence
`cc-nix-test`→4 GB + throwaway 4 GB + lichen-staging 4 GB = 12 GB; **destroy the throwaway VM when
done**; never touch other projects/instances; live-memory changes need stop→set→start (hotplug times
out — see memory).
---
## 5. Method (ordered; each milestone ends with an Adversary gate)
1. **W1 — Headroom.** Resize `cc-nix-test` 6 GB→**4 GB** (stop→set→start) so a **4 GB** throwaway VM
fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). *Accept:* b1 has room;
cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). *(Final sizing decided in W6.)*
2. **W2 — Secrets repo + cert into git.** Create the private `cc-ci-secrets` repo; move **all secrets**
into sops there — including the **wildcard cert+key** (read from the current `/var/lib/ci-certs/live`)
and the existing `secrets/secrets.yaml` contents; keep instance vars parameterized in the base;
wire the base flake to consume `cc-ci-secrets` (flake input). *Accept:* `nixos-rebuild build` of the
restructured config is **byte-identical** to the running system (zero drift), and `cc-nix-test`
`nixos-rebuild switch`es cleanly onto the new structure with TLS still served from the git cert.
3. **W3 — Throwaway VM.** Create a blank NixOS VM in `terraform-ci` (the incus-base image), sized
**4 GB**. *Accept:* VM reachable; bootstrap age key provisioned by the documented mechanism only.
4. **W4 — Reproducible live rebuild.** On the throwaway VM: clone base+instance, `nixos-rebuild
switch`, watch oneshots converge, secrets+cert decrypt. *Accept:* system fully up with **no step
outside `docs/install.md`**; capture evidence.
5. **W5 — Adversary clean-room proof + honest D8.** The Adversary **deletes** the Builder's throwaway
VM, **creates a brand-new blank VM**, and runs the full install from scratch per `docs/install.md`
(clone base+secrets → provision age key → `nixos-rebuild switch` → all up) — a genuinely fresh
machine, no residue. Then rewrites the D8 evidence (static byte-identical + this live clean-room
rebuild), removing "infeasible by design." *Accept:* Adversary logs a real D8 live-rebuild PASS on
a freshly-created VM (or a narrow, signed-off limitation per §3 C5).
6. **W6 — Cleanup + docs + final sizing.** Destroy the throwaway VM; update all docs (C7); decide and
apply final `cc-nix-test` sizing. *Accept:* no leftover VM/secret leak; docs match; flip Phase-1c
`STATUS.md` to `## DONE`.
---
## 6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate)
- **Don't fake the rebuild.** "Infeasible/can't reproduce" is allowed only for a true,
narrowly-scoped blocker with the maximal tested subset and **Adversary sign-off** — the host-key
and DNS/cert reasons are explicitly *not* valid (the recovery key + the cert-in-git fix remove them).
- **Exactly one out-of-band secret.** The bootstrap age key. Everything else in git, encrypted. If
the loops find another secret that "has to" be out-of-band, that's a finding to design away, not
accept.
- **Gandi token stays out of repo/agent** — only the cert artifact is committed (encrypted). Renewal
is operator-issues-then-commits.
- **No plaintext secret leaks into the base (or the store).** Instance *vars* (domain, gateway) may
live in the base as parameters — that's fine; what must NOT leak is any *secret* (cert/keys/tokens):
those stay encrypted in `cc-ci-secrets`. The Adversary greps the base + the Nix store for plaintext
secret material.
- **Incus guardrails** (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't
touch other instances.
- **No weakened tests / no drift** — the restructured config must remain byte-identical to running
(zero drift) and all of D1D10 must still hold after the refactor.
---
## 7. Open decisions (log in DECISIONS.md)
- **Flake input vs git submodule** for the `cc-ci-secrets` repo (default: flake input).
- **Bootstrap-key provisioning** for a new VM: provision the off-box recovery age key to the host
(decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is
simpler for a clone; per-host re-encrypt is cleaner long-term.)
- **Final `cc-nix-test` sizing** after the test: restore to 6 GB, or **promote the freshly-rebuilt
reproducible VM** to be the canonical cc-ci and retire the old one.
- **DNS/gateway as IaC** (terraform for the Gandi records + the gateway) — likely a separate future
item ([[IDEAS]]), out of 1c scope; 1c keeps them as documented external preconditions.
- How the loops fetch the private `cc-ci-secrets` repo during rebuild (bot token-in-URL vs a read
deploy key) — it's private under `recipe-maintainers` (bot is admin).