The orchestrator Pi is retired (2026-05-31). All agents now run on the cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled, no ProxyCommand. Updated across all affected files: AGENTS.md - Remove Pi from reboot description; migration complete (not "parked") - cc-ci access: direct ssh, not via proxy kickoff.md - Prerequisites: direct tailnet peer, not proxy - Host deps: NixOS (not apt) - Fallback/Incus: b1 reachable directly, no --proxy curl flag plan.md §1 + §1.5 - §1 bootstrap: direct SSH, check tailscale status (not restart proxy) - §1.5 intro: "VM" not "sandbox host"; no proxy - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row - Replace "Tailscale connection (proxy)" subsection with direct-peer description plan-orchestrator-migration.md - Mark COMPLETE (2026-05-31); historical record only plan-phase1c-full-reproducibility.md - Incus access: direct, not via SOCKS proxy prompts/builder.md + prompts/adversary.md - cc-ci access language only: direct ssh, no proxy restart instructions - adversary: *.ci.commoninternet.net via plain curl, no proxy flag REBOOTS.md - Retitle for VM; note Pi retired; Pi entries marked historical systemd/cc-ci-loops.service - User/Group/HOME/PATH: notplants → loops - Remove cc-ci-tailscaled.service dependency (no proxy on VM) - Add note about nix/configuration.nix as the authoritative VM declaration test-e2e-testme-acceptance.md - tailscale status: no --socket flag - ssh to throwaway: no ProxyCommand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
197 lines
14 KiB
Markdown
197 lines
14 KiB
Markdown
# cc-ci Phase 1c — Full git reproducibility + genuine D8 live rebuild (Autonomous Build Plan)
|
||
|
||
**Status:** QUEUED — runs after Phase 1 (`plan.md`) and **before Phase 1b** (review/lint), so the
|
||
review/lint pass covers this refactor and its final cold re-verification proves the genuine
|
||
(post-1c) D8. **Manual** transition. **Driven by the Builder + Adversary loops** (same protocol as `plan.md` §6/§6.1/§7) —
|
||
the orchestrator does NOT do this; the loops do, and the Adversary independently re-proves it cold.
|
||
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md`
|
||
|
||
---
|
||
|
||
## 0. Why this phase
|
||
|
||
Phase-1 D8 was marked PASS with the throwaway-VM **live rebuild "documented infeasible by design
|
||
(sops host-key binding + operator DNS/cert)."** That justification doesn't hold up:
|
||
- **sops host-key binding** is defeated by the project's **own master recovery age key**
|
||
(`/srv/cc-ci/.sops/master-age.txt`, a sops recipient created *"for re-keying if cc-ci is lost"*) —
|
||
a fresh host can decrypt the repo's secrets with it. So a new-host rebuild is *not* infeasible.
|
||
- **operator DNS/cert** is a *precondition*, not a rebuild blocker — it only gates the full
|
||
end-to-end HTTPS path, not "a blank host + the repo boots into the declared system."
|
||
- Incus is available, and the rate-limit premise that originally deferred the test was obsolete
|
||
(D10 passed without registry creds). The Builder itself flagged the rebuild as *feasible now* and
|
||
refused to self-certify; the bar then slipped to "infeasible."
|
||
|
||
This phase does two connected things: **(A)** make the VM **fully reproducible from git, including
|
||
all secrets** (move the wildcard cert and everything else into sops-in-git, in a separate private
|
||
`cc-ci-secrets` repo), and **(B)** actually perform and verify the **throwaway-VM live rebuild**, closing
|
||
D8 honestly. The byte-identical-closure evidence from Phase 1 stays valid as the *static* half of D8;
|
||
this adds the *live* half it was missing.
|
||
|
||
---
|
||
|
||
## 1. Mission
|
||
|
||
A blank NixOS VM, given only **(1)** the two git repos (base `cc-ci` + private `cc-ci-secrets`), **(2)** the
|
||
single bootstrap age key, and **(3)** the external DNS/gateway already pointing at it, becomes a
|
||
working cc-ci via **`nixos-rebuild switch`** with **no undocumented manual steps** — secrets and the
|
||
wildcard cert included, decrypted from git. Proven on a real throwaway VM by the loops.
|
||
|
||
---
|
||
|
||
## 2. The reproducibility model (target architecture)
|
||
|
||
**Split only the *secrets* into their own repo. The base stays one well-parameterized repo.** The
|
||
boundary is *secrecy*, not modularity: secrets get a separate private repo (an extra access-control
|
||
layer); instance-specific **non-secret** vars (domain, gateway/DNS facts) stay in the base as plain,
|
||
changeable parameters — another admin can repoint cc-ci by editing them, no second config repo needed.
|
||
|
||
- **`cc-ci` (base — one repo, well-parameterized):** `flake.nix`, modules, `runner/`, `tests/`,
|
||
`docs/`, **and** the instance config as parameters/defaults (e.g. `DOMAIN = "ci.commoninternet.net"`,
|
||
gateway/DNS facts, sops recipients). Instance *config* lives here; only *secret material* is external.
|
||
Keep it well-parameterized so changing the domain/recipients is a one-line edit, not a fork.
|
||
- **`cc-ci-secrets` (private, `recipe-maintainers/cc-ci-secrets`):** holds **only** the sops-encrypted
|
||
secret material — `secrets/secrets.yaml` (**wildcard cert + key**, Drone OAuth client_id/secret +
|
||
RPC secret, webhook HMAC, registry creds if any, app/infra secrets) + `.sops.yaml`. **No code, no
|
||
config logic** — just encrypted secrets, as a separate security layer with its own access control.
|
||
- **Linkage (default = flake input):** base `flake.nix` has
|
||
`inputs.secrets.url = "git+https://git.autonomic.zone/recipe-maintainers/cc-ci-secrets"`
|
||
(private; fetched via the bot token / a read deploy key); sops-nix reads `secrets/secrets.yaml` from
|
||
it. *Alternative:* a git **submodule** at `cc-ci/secrets/`. Record the choice in `DECISIONS.md`;
|
||
flake input is the recommended default.
|
||
- **sops-nix wiring:** `sops.defaultSopsFile` → the `cc-ci-secrets` `secrets.yaml`;
|
||
`sops.age.sshKeyPaths` = host key + the recovery recipient. The **wildcard cert/key are sops
|
||
secrets** decrypted at activation to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` and fed
|
||
into the Traefik recipe's `ssl_cert`/`ssl_key` swarm secrets — **no out-of-band cert file.**
|
||
- **The one irreducible out-of-band secret:** the **age private key** that unlocks the secrets (the
|
||
host key, or the provisioned recovery key) — it cannot live in the repo it decrypts. This is the
|
||
*only* permitted "not in git" secret, provisioned to the host at creation.
|
||
- **Still external (not the VM's git, by nature):** the DNS records + the TLS-passthrough gateway
|
||
(network infra) — documented as preconditions. (IaC for those is out of scope — see §7.)
|
||
- **Token discipline preserved:** only the cert *artifact* enters git (encrypted, in `cc-ci-secrets`);
|
||
the **Gandi DNS token never enters any repo or the agent**. Renewal = operator re-issues the cert
|
||
out-of-band, then commits the new sops-encrypted cert to `cc-ci-secrets` (versioned, reproducible).
|
||
|
||
---
|
||
|
||
## 3. Definition of Done
|
||
|
||
Terminates only when every item holds **and the Adversary has independently re-verified each within
|
||
24h, from a cold start** (logged in `REVIEW.md`):
|
||
|
||
- [ ] **C1 — Secrets-repo split.** A separate private `cc-ci-secrets` repo holds **only** the
|
||
sops-encrypted secrets (+ `.sops.yaml`), consumed by the base via a flake input. The base
|
||
`cc-ci` stays one well-parameterized repo — instance vars (domain, gateway, recipients) remain
|
||
changeable parameters in the base, **not** moved out (only secrets are external).
|
||
`nixosConfigurations.cc-ci` still builds byte-identically to the running system.
|
||
- [ ] **C2 — Cert in git.** The wildcard cert+key are sops secrets in `cc-ci-secrets`, decrypted
|
||
at activation to the cert path + Traefik secret; the prior "operator drops a cert file" step is
|
||
gone. Verified: a rebuild serves valid TLS from the git-sourced cert.
|
||
- [ ] **C3 — All secrets in git (one exception).** Every infra/app secret (cert, Drone OAuth/RPC,
|
||
webhook HMAC, registry creds, host age recipients) is sops-encrypted in git. The **only**
|
||
out-of-band secret is the bootstrap age key — documented precisely, nothing else.
|
||
- [ ] **C4 — Genuine throwaway-VM live rebuild.** On a blank NixOS VM (Incus, `terraform-ci`),
|
||
provisioned with *only* the bootstrap age key, the loops `git clone` base+secrets and run
|
||
`nixos-rebuild switch`; the system activates and the reconcile oneshots converge
|
||
(swarm/proxy/drone/bridge/dashboard), all secrets incl. the cert decrypt, with **no manual step
|
||
not in `docs/install.md`**.
|
||
**The true proof is a clean-room repeat (C4 done right):** the Adversary **deletes** any
|
||
existing throwaway VM, **creates a brand-new blank VM via Incus**, and runs the *entire* install
|
||
from scratch (clone base+secrets → provision age key → `nixos-rebuild switch` → everything comes
|
||
up) — proving reproducibility on a genuinely fresh machine, with **no residue** from the
|
||
Builder's setup attempt masking a gap. Done **cold** by the Adversary, with logged evidence
|
||
(VM id, the exact commands from `docs/install.md`, convergence + TLS-from-git-cert proof).
|
||
- [ ] **C5 — Honest D8.** The D8 evidence is rewritten: byte-identical closure (static) **plus** the
|
||
live throwaway-VM rebuild (dynamic). The "infeasible by design" wording is removed. If any
|
||
single aspect genuinely can't be reproduced, it is a narrowly-scoped, Adversary-signed-off
|
||
limitation with the maximal tested subset (bar per Phase-1b §7.1 / Adversary mandate) — not a
|
||
blanket "infeasible."
|
||
- [ ] **C6 — Resource fit + cleanup.** `cc-nix-test` resized **6 GB→4 GB** and the throwaway VM
|
||
created at **4 GB**, within the **~12 GB running-RAM guideline** (cc-nix-test 4 + lichen-staging 4
|
||
+ throwaway 4 = 12 ≤ 16 GB physical on b1; the guideline is doc-only, not an enforced project
|
||
limit). The throwaway VM is **destroyed** after the test (no leftover). Final `cc-nix-test`
|
||
sizing decided and applied (keep 4 GB, restore to 6 GB, or promote the rebuilt VM — record in
|
||
`DECISIONS.md`).
|
||
- [ ] **C7 — Docs.** `docs/install.md`, `docs/secrets.md`, `architecture.md`, and the main plan's
|
||
cert/secret references (§1.5/§4.0/§4.4) updated to the new model: clone base+instance + provision
|
||
the age key + (external) DNS/gateway → one `nixos-rebuild switch`. A new engineer can stand up a
|
||
fresh instance from the docs.
|
||
|
||
When C1–C7 hold and are Adversary-verified, write `## DONE` to Phase-1c `STATUS.md`.
|
||
|
||
---
|
||
|
||
## 4. Incus capability (granted for this phase only)
|
||
|
||
The loops normally only `ssh cc-ci`. For 1c they MAY drive Incus on **b1** (resize `cc-nix-test`;
|
||
create/destroy ONE throwaway VM in `terraform-ci`), using the mTLS certs at
|
||
`/srv/incus-terraform-nix-vm-creator/terraform-secrets/` (b1 is reachable directly from the VM —
|
||
direct tailnet peer, no proxy) — see the incus skill (`/srv/incus-terraform-nix-vm-creator/skills/incus-terraform/SKILL.md`)
|
||
and [[cc-ci-vm-incus]]. Guardrails: only `terraform-ci`; keep total running RAM within the **~12 GB
|
||
guideline** (doc-only — terraform-ci has no enforced `limits.memory`; b1 is 16 GB physical) — hence
|
||
`cc-nix-test`→4 GB + throwaway 4 GB + lichen-staging 4 GB = 12 GB; **destroy the throwaway VM when
|
||
done**; never touch other projects/instances; live-memory changes need stop→set→start (hotplug times
|
||
out — see memory).
|
||
|
||
---
|
||
|
||
## 5. Method (ordered; each milestone ends with an Adversary gate)
|
||
|
||
1. **W1 — Headroom.** Resize `cc-nix-test` 6 GB→**4 GB** (stop→set→start) so a **4 GB** throwaway VM
|
||
fits within the ~12 GB running guideline (4 + lichen 4 + throwaway 4). *Accept:* b1 has room;
|
||
cc-nix-test healthy at 4 GB (avoid heavy recipe CI during 1c). *(Final sizing decided in W6.)*
|
||
2. **W2 — Secrets repo + cert into git.** Create the private `cc-ci-secrets` repo; move **all secrets**
|
||
into sops there — including the **wildcard cert+key** (read from the current `/var/lib/ci-certs/live`)
|
||
and the existing `secrets/secrets.yaml` contents; keep instance vars parameterized in the base;
|
||
wire the base flake to consume `cc-ci-secrets` (flake input). *Accept:* `nixos-rebuild build` of the
|
||
restructured config is **byte-identical** to the running system (zero drift), and `cc-nix-test`
|
||
`nixos-rebuild switch`es cleanly onto the new structure with TLS still served from the git cert.
|
||
3. **W3 — Throwaway VM.** Create a blank NixOS VM in `terraform-ci` (the incus-base image), sized
|
||
**4 GB**. *Accept:* VM reachable; bootstrap age key provisioned by the documented mechanism only.
|
||
4. **W4 — Reproducible live rebuild.** On the throwaway VM: clone base+instance, `nixos-rebuild
|
||
switch`, watch oneshots converge, secrets+cert decrypt. *Accept:* system fully up with **no step
|
||
outside `docs/install.md`**; capture evidence.
|
||
5. **W5 — Adversary clean-room proof + honest D8.** The Adversary **deletes** the Builder's throwaway
|
||
VM, **creates a brand-new blank VM**, and runs the full install from scratch per `docs/install.md`
|
||
(clone base+secrets → provision age key → `nixos-rebuild switch` → all up) — a genuinely fresh
|
||
machine, no residue. Then rewrites the D8 evidence (static byte-identical + this live clean-room
|
||
rebuild), removing "infeasible by design." *Accept:* Adversary logs a real D8 live-rebuild PASS on
|
||
a freshly-created VM (or a narrow, signed-off limitation per §3 C5).
|
||
6. **W6 — Cleanup + docs + final sizing.** Destroy the throwaway VM; update all docs (C7); decide and
|
||
apply final `cc-nix-test` sizing. *Accept:* no leftover VM/secret leak; docs match; flip Phase-1c
|
||
`STATUS.md` to `## DONE`.
|
||
|
||
---
|
||
|
||
## 6. Guardrails (inherit Phase 1 §9 + Phase 1b §7.1 / Adversary mandate)
|
||
|
||
- **Don't fake the rebuild.** "Infeasible/can't reproduce" is allowed only for a true,
|
||
narrowly-scoped blocker with the maximal tested subset and **Adversary sign-off** — the host-key
|
||
and DNS/cert reasons are explicitly *not* valid (the recovery key + the cert-in-git fix remove them).
|
||
- **Exactly one out-of-band secret.** The bootstrap age key. Everything else in git, encrypted. If
|
||
the loops find another secret that "has to" be out-of-band, that's a finding to design away, not
|
||
accept.
|
||
- **Gandi token stays out of repo/agent** — only the cert artifact is committed (encrypted). Renewal
|
||
is operator-issues-then-commits.
|
||
- **No plaintext secret leaks into the base (or the store).** Instance *vars* (domain, gateway) may
|
||
live in the base as parameters — that's fine; what must NOT leak is any *secret* (cert/keys/tokens):
|
||
those stay encrypted in `cc-ci-secrets`. The Adversary greps the base + the Nix store for plaintext
|
||
secret material.
|
||
- **Incus guardrails** (§4): terraform-ci only, respect the RAM cap, destroy the throwaway VM, don't
|
||
touch other instances.
|
||
- **No weakened tests / no drift** — the restructured config must remain byte-identical to running
|
||
(zero drift) and all of D1–D10 must still hold after the refactor.
|
||
|
||
---
|
||
|
||
## 7. Open decisions (log in DECISIONS.md)
|
||
- **Flake input vs git submodule** for the `cc-ci-secrets` repo (default: flake input).
|
||
- **Bootstrap-key provisioning** for a new VM: provision the off-box recovery age key to the host
|
||
(decrypt as-is) vs generate the new host's key + re-encrypt secrets to it. (Recovery key is
|
||
simpler for a clone; per-host re-encrypt is cleaner long-term.)
|
||
- **Final `cc-nix-test` sizing** after the test: restore to 6 GB, or **promote the freshly-rebuilt
|
||
reproducible VM** to be the canonical cc-ci and retire the old one.
|
||
- **DNS/gateway as IaC** (terraform for the Gandi records + the gateway) — likely a separate future
|
||
item ([[IDEAS]]), out of 1c scope; 1c keeps them as documented external preconditions.
|
||
- How the loops fetch the private `cc-ci-secrets` repo during rebuild (bot token-in-URL vs a read
|
||
deploy key) — it's private under `recipe-maintainers` (bot is admin).
|