From 70f108d2fac368ee9ae56b4a8e360edb5da8e7b0 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Wed, 27 May 2026 18:37:02 +0100 Subject: [PATCH] 1c/W4 DONE: genuine throwaway-VM live rebuild (single switch, 0 failed, byte-identical, TLS leaf==git cert); Gate W4 CLAIMED + install.md updated Co-Authored-By: Claude Opus 4.7 (1M context) --- BACKLOG-1c.md | 7 +++-- JOURNAL-1c.md | 24 +++++++++++++++ STATUS-1c.md | 34 +++++++++++++-------- docs/install.md | 79 +++++++++++++++++++++++++++++++++---------------- 4 files changed, 103 insertions(+), 41 deletions(-) diff --git a/BACKLOG-1c.md b/BACKLOG-1c.md index 0edecb8..6cb3e33 100644 --- a/BACKLOG-1c.md +++ b/BACKLOG-1c.md @@ -19,9 +19,10 @@ Method W1–W6 from the phase plan §5. Each milestone ends with an Adversary ga 0 failed units, all stacks 1/1, cert survived reboot via sops, TLS 200. Running RAM 8 GB. - [x] **W3 — Throwaway VM.** `ccci-throwaway` (incus-base, 4 GB/20 GB) reachable at 100.126.124.86 (used live TS_AUTH_KEY; workspace key stale). Bootstrap age key provisioned in W4. -- [ ] **W4 — Reproducible live rebuild.** On throwaway VM: clone base+secrets, `nixos-rebuild switch`, - watch oneshots converge, secrets+cert decrypt. Accept: fully up, no step outside docs/install.md; - capture evidence. **Gate W4 CLAIMED.** +- [x] **W4 — Reproducible live rebuild.** Fresh blank VM + recovery age key only → `git clone + --recursive` + ONE `nixos-rebuild switch ?submodules=1` → running/0-failed, byte-identical + `ld19aj2`==cc-ci, 6 stacks 1/1, all secrets+cert decrypt, TLS leaf==git cert. Found+fixed a + concurrent-abra race (serialized reconcilers). **Gate W4 CLAIMED** (awaiting Adversary W5). - [ ] **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 independently; rewrites D8 evidence (static+live), removes "infeasible by design". Accept: Adversary D8 live-rebuild PASS (or narrow signed-off limitation per C5). diff --git a/JOURNAL-1c.md b/JOURNAL-1c.md index 5c16401..2dded8f 100644 --- a/JOURNAL-1c.md +++ b/JOURNAL-1c.md @@ -266,3 +266,27 @@ This is the LAST planned config change before W4 completes (config stable ld19aj live → traefik swarm secret → served leaf. The rebuilt host serves the git-sourced cert. Next: recreate throwaway FRESH with fixed config to prove SINGLE nixos-rebuild switch converges (0 failed). + +## 2026-05-27 — W4 DONE: genuine throwaway-VM live rebuild, SINGLE switch converges (Gate W4 CLAIMED) + +**Authoritative C4 proof on a FRESH blank VM** (destroyed the pre-fix VM, recreated clean; cloud-init +used the LIVE TS_AUTH_KEY so it auto-joined the tailnet — no manual tailscale step): +- Provisioned ONLY `/var/lib/sops-nix/key.txt` = recovery age key (pub == `age1cmk26…` == &master) — + the single out-of-band secret. `git clone --recursive` base+secrets (submodule 2312f1c, secrets ENC). +- **One** `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (detached + --no-block) → `ccci-rebuild` Result=**success** (~15 min, 2.8 G peak < 4 GB). +- **`systemctl is-system-running` → running, 0 failed units** (the serialization fix works: single + switch converges, no manual re-runs). Toplevel **`ld19aj2…` == cc-ci** (byte-identical). +- **All 6 stacks 1/1**: traefik app+socket-proxy, drone, ccci-bridge, ccci-dashboard, backups. +- **All secrets decrypted via the recovery key**; wildcard cert sops-decrypted from git → + `/var/lib/ci-certs/live/fullchain.pem` (symlink→/run/secrets, sha256 `c1d96d61…`). +- **TLS from git cert (local, per C4 standard):** `curl --resolve probe.ci.commoninternet.net:443: + 127.0.0.1` → http=404 (no route, expected) **ssl_verify=0**; served leaf sha256 fingerprint + **== git-cert leaf == `57:8D:67:9E:FE:89:…:B8:A6`** (Adversary reference). Full chain of custody. + +So: blank NixOS host + the two git repos + the one bootstrap age key + external DNS/gateway → one +`nixos-rebuild switch` → working cc-ci. No undocumented manual step. This closes D8 honestly (static +byte-identical closure + live throwaway rebuild). install.md updated to this validated procedure. + +Destroying the throwaway now (frees RAM for the Adversary's independent W5 cold rebuild; C6 no-leftover). +Gate W4 CLAIMED — awaiting Adversary cold W5 (their own fresh VM). diff --git a/STATUS-1c.md b/STATUS-1c.md index 4e8a289..bbf6e40 100644 --- a/STATUS-1c.md +++ b/STATUS-1c.md @@ -9,14 +9,15 @@ The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY — not this p Now: make the VM fully reproducible from git (secrets+cert in a private `cc-ci-secrets` repo) and perform a genuine throwaway-VM live rebuild to close D8 honestly. -## In flight — W4 (throwaway live rebuild) -- W1 DONE (cc-nix-test 6→4 GB, healthy). W2 PASS (Adversary cold). W3 DONE (VM reachable). -- W4 Step A DONE: cc-ci on final config with `sops.age.keyFile` + serialized abra reconcilers → - byte-identical **`ld19aj2…`** (zero drift). (config evolved vh6vwxbl→izsmiajw→ld19aj2; ld19aj2 is final.) -- W4 Step B (1st run, pre-fix): blank VM built **izsmiajw==cc-ci byte-identical** from git + recovery - key; cert+secrets decrypted; TLS leaf == git cert (`57:8D:…:B8:A6`). Found+fixed concurrent-abra - race (serialized reconcilers). **Now: fresh throwaway booting → prove SINGLE switch converges (0 failed).** -- Then claim **Gate W4**. +## In flight — W4 DONE, Gate W4 CLAIMED +- W1 DONE (cc-nix-test 6→4 GB). W2 PASS (Adversary cold). W3 DONE (VM reachable). +- W4 DONE — genuine throwaway-VM live rebuild proven on a FRESH blank VM: only `/var/lib/sops-nix/ + key.txt`=recovery key provisioned; `git clone --recursive` + **ONE** `nixos-rebuild switch + ?submodules=1` → **running, 0 failed**, byte-identical **`ld19aj2`==cc-ci**, all 6 stacks 1/1, all + secrets+cert decrypted via recovery key, **TLS leaf == git cert** (`57:8D:…:B8:A6`), no manual step. + (Final config = ld19aj2: `sops.age.keyFile` + serialized abra reconcilers fixing a fresh-host race.) +- Throwaway destroyed (frees RAM for Adversary W5; C6 no-leftover). install.md updated to this procedure. +- Remaining: W5 (Adversary cold rebuild + honest D8 rewrite), W6 (docs C7 + final cc-nix-test sizing).
W2 detail (PASS) ## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed @@ -31,10 +32,19 @@ perform a genuine throwaway-VM live rebuild to close D8 honestly.
## Gate -**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified: byte-identical -`vh6vwxbl`==running from a fresh recursive clone (zero drift), cert sops-decrypted from git + live TLS -served from git cert (leaf fingerprint match), no plaintext leak in base/store. No regression, no VETO. -Now proceeding: **W1 (resize) → W3 (throwaway VM) → W4 (live rebuild).** +**Gate: W4 — CLAIMED, awaiting Adversary @2026-05-27 ~18:45Z.** Genuine throwaway-VM live rebuild +(C4/C5/D8). For the Adversary's cold W5 (own fresh Incus VM in terraform-ci, ~4 GB; RAM is free — my +throwaway destroyed): provision ONLY `/var/lib/sops-nix/key.txt` = recovery age key (`age1cmk26…` +private half, from `/srv/cc-ci/.sops/master-age.txt`); `git clone --recursive` base+secrets (bot +creds); `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (per docs/install.md). +Expect: running/0-failed, toplevel `ld19aj2…`==cc-ci, 6 stacks 1/1, cert sha256 `c1d96d61…`, local +`curl --resolve …:127.0.0.1` ssl_verify=0 with served leaf == git cert `57:8D:…:B8:A6`. Then rewrite +the D8 evidence (static byte-identical + live rebuild; drop "infeasible by design"). My evidence: +JOURNAL-1c 2026-05-27 W4 entry. (Note: throwaway base VM = Incus image; live TS_AUTH_KEY in cloud-init.) + +**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified (byte-identical, cert +from git + TLS leaf-match, no plaintext leak). Config has since evolved vh6vwxbl→izsmiajw→**ld19aj2** +(keyFile + serialized reconcilers); Adversary refreshed C1 against izsmiajw @18:00Z; ld19aj2 is final.
prior **Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.** diff --git a/docs/install.md b/docs/install.md index ea6c24c..a185eb2 100644 --- a/docs/install.md +++ b/docs/install.md @@ -1,20 +1,37 @@ # Installing cc-ci from scratch -> WORK IN PROGRESS — grows with each milestone; the full from-scratch rebuild is verified at M9 (D8). +> The full from-scratch rebuild is **verified** (Phase-1c / D8): a blank NixOS Incus VM, given the two +> repos + the single bootstrap age key, becomes a fully-converged cc-ci via one `nixos-rebuild switch`. -cc-ci is declared **entirely** as a NixOS flake (this repo). Bringing up the box is just -**clone + `nixos-rebuild switch`** + the operator preconditions — no manual post-steps. The proxy -(traefik) and Drone server are deployed by **idempotent-reconcile systemd oneshots** (`modules/ -proxy.nix`, `modules/drone.nix`) that converge the swarm to the desired state on every activation -and boot (and self-heal drift), mirroring `swarm-init`. Target: a NixOS 24.11 host reachable as -`cc-ci` over SSH (root). +cc-ci is declared **entirely** as a NixOS flake — base config in this repo (`cc-ci`) and **all +secrets (incl. the wildcard TLS cert) sops-encrypted in a private companion repo `cc-ci-secrets`, +mounted as a git submodule at `secrets/`**. Bringing up the box is: **clone `--recursive` + provision +the one bootstrap age key + `nixos-rebuild switch`** + the external DNS/gateway — no manual +post-steps. The proxy (traefik), Drone, comment-bridge, dashboard and backupbot are deployed by +**idempotent-reconcile systemd oneshots** that converge the swarm on every activation/boot (and +self-heal drift), mirroring `swarm-init`; they are **serialized** (proxy→drone→bridge→dashboard→ +backupbot) so a single switch converges on a blank host. Target: a NixOS 24.11 host reachable over SSH (root). +*(Verified on a throwaway Incus VM: blank host + the two repos + the age key → one `nixos-rebuild +switch` → fully converged cc-ci, 0 failed units — see DECISIONS.md Phase-1c / D8.)* -## Operator preconditions (class-A1, see DECISIONS.md / docs/baseline.md) +## Preconditions -- Wildcard TLS cert at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` - (`*.ci.commoninternet.net` + `ci.commoninternet.net`). **Renewed out-of-band; never ACME here.** +**The one out-of-band secret (provision before the first rebuild):** +- The **bootstrap age key** at `/var/lib/sops-nix/key.txt` (mode 0600). It must be a sops recipient + of `cc-ci-secrets/secrets.yaml`. Two cases: + - **Canonical cc-ci:** its SSH host key is already a recipient — also works via `age.sshKeyPaths`; + the keyFile holds the host-derived age identity (`ssh-to-age -private-key -i + /etc/ssh/ssh_host_ed25519_key`). + - **A fresh/cloned host** (different SSH host key, not a recipient): provision the **off-box + recovery age key** (`age1cmk26…`'s private half) there — it decrypts every secret incl. the cert. + Everything else (cert, Drone OAuth/RPC, webhook HMAC) is sops-encrypted **in git** — nothing else + is provisioned out-of-band. + +**External infra (operator-owned, not on the box — class-A1):** - DNS: `*.ci.commoninternet.net` (+ bare) → the **gateway**, which TLS-passthroughs (SNI) to cc-ci. - Firewall path: gateway reaches cc-ci on tcp/80+443 (opened by `modules/swarm.nix`). +- The wildcard cert is **renewed out-of-band** by the operator, who then re-encrypts it into + `cc-ci-secrets` (sops) and rebuilds — the Gandi DNS token never touches the box; **never ACME here.** ## 1. Apply the NixOS flake (this is the whole install) @@ -25,29 +42,39 @@ host SSH key), Docker + single-node Swarm + the `proxy` overlay + firewall 80/44 **Drone exec runner** (`modules/drone-runner.nix`). ```sh -# materialise the repo on the host (the build runs on cc-ci itself — see DECISIONS.md deploy mech) -# e.g. git clone /root/cc-ci (or sync it) -nixos-rebuild switch --flake /root/cc-ci#cc-ci +# 1. Clone base + the private secrets submodule (bot/deploy creds for cc-ci-secrets). +# The submodule provides secrets/secrets.yaml (sops). Use a credential that can read +# recipe-maintainers/cc-ci-secrets, e.g. a per-command header (never persisted): +git clone --recursive https://git.autonomic.zone/recipe-maintainers/cc-ci.git /root/cc-ci +# (if cloned non-recursively: git -C /root/cc-ci submodule update --init) + +# 2. Provision the bootstrap age key (see Preconditions) — the ONE out-of-band secret: +install -m700 -d /var/lib/sops-nix +install -m600 /path/to/bootstrap-age-key /var/lib/sops-nix/key.txt + +# 3. One nixos-rebuild switch. NOTE: ?submodules=1 so the git flake includes secrets/. +nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci' ``` -On activation, the reconcile oneshots (`deploy-proxy`, `deploy-drone`) run automatically and converge -the swarm. Verify: +On activation sops-nix decrypts every secret (incl. the wildcard cert → `/var/lib/ci-certs/live/`), +then the serialized reconcile oneshots converge the swarm. Verify: ```sh -systemctl is-system-running # -> running -docker info --format '{{.Swarm.LocalNodeState}}' # -> active -docker service ls # traefik (app+socket-proxy) + drone, all 1/1 -systemctl is-active deploy-proxy deploy-drone drone-runner-exec # -> active x3 -# wildcard cert served end-to-end via the gateway: -curl -ksv --resolve probe.ci.commoninternet.net:443: https://probe.ci.commoninternet.net/ \ - 2>&1 | grep -E 'subject:|HTTP/' # -> CN=*.ci.commoninternet.net, HTTP 404 (no app router yet) -curl -ks --resolve drone.ci.commoninternet.net:443: \ - -o /dev/null -w '%{http_code}\n' https://drone.ci.commoninternet.net/healthz # -> 200 +systemctl is-system-running # -> running (0 failed units) +docker service ls # traefik app+socket-proxy, drone, bridge, dashboard, backups — all 1/1 +# cert is sops-decrypted FROM GIT to the path traefik serves: +sha256sum /var/lib/ci-certs/live/fullchain.pem # symlink -> /run/secrets/wildcard_cert +# TLS served from the git cert, verified locally on the host (SNI ci.commoninternet.net): +curl -s --resolve probe.ci.commoninternet.net:443:127.0.0.1 \ + -o /dev/null -w 'ssl_verify=%{ssl_verify_result}\n' https://probe.ci.commoninternet.net/ # -> 0 +# (the served leaf fingerprint == the cert in cc-ci-secrets) ``` > Tip: when driving the switch over an SSH session that rides Tailscale, run it as a detached unit so -> it survives a momentary drop, and **use the absolute flake path** (systemd units run with cwd `/`): -> `systemd-run --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci` +> it survives the tailscale restart during activation, and use the absolute flake ref: +> `systemd-run --no-block --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` +> *(On the canonical cc-ci the build source is synced from the admin's clone via `tar | ssh` and built +> as a `path:` flake — no submodule fetch needed there; the `?submodules=1` form is for a git clone.)* ## 2. One-time: link Drone ↔ Gitea (OAuth grant)