1c/W4 DONE: genuine throwaway-VM live rebuild (single switch, 0 failed, byte-identical, TLS leaf==git cert); Gate W4 CLAIMED + install.md updated
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -19,9 +19,10 @@ Method W1–W6 from the phase plan §5. Each milestone ends with an Adversary ga
|
||||
0 failed units, all stacks 1/1, cert survived reboot via sops, TLS 200. Running RAM 8 GB.
|
||||
- [x] **W3 — Throwaway VM.** `ccci-throwaway` (incus-base, 4 GB/20 GB) reachable at 100.126.124.86
|
||||
(used live TS_AUTH_KEY; workspace key stale). Bootstrap age key provisioned in W4.
|
||||
- [ ] **W4 — Reproducible live rebuild.** On throwaway VM: clone base+secrets, `nixos-rebuild switch`,
|
||||
watch oneshots converge, secrets+cert decrypt. Accept: fully up, no step outside docs/install.md;
|
||||
capture evidence. **Gate W4 CLAIMED.**
|
||||
- [x] **W4 — Reproducible live rebuild.** Fresh blank VM + recovery age key only → `git clone
|
||||
--recursive` + ONE `nixos-rebuild switch ?submodules=1` → running/0-failed, byte-identical
|
||||
`ld19aj2`==cc-ci, 6 stacks 1/1, all secrets+cert decrypt, TLS leaf==git cert. Found+fixed a
|
||||
concurrent-abra race (serialized reconcilers). **Gate W4 CLAIMED** (awaiting Adversary W5).
|
||||
- [ ] **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 independently; rewrites D8
|
||||
evidence (static+live), removes "infeasible by design". Accept: Adversary D8 live-rebuild PASS
|
||||
(or narrow signed-off limitation per C5).
|
||||
|
||||
@ -266,3 +266,27 @@ This is the LAST planned config change before W4 completes (config stable ld19aj
|
||||
live → traefik swarm secret → served leaf. The rebuilt host serves the git-sourced cert.
|
||||
|
||||
Next: recreate throwaway FRESH with fixed config to prove SINGLE nixos-rebuild switch converges (0 failed).
|
||||
|
||||
## 2026-05-27 — W4 DONE: genuine throwaway-VM live rebuild, SINGLE switch converges (Gate W4 CLAIMED)
|
||||
|
||||
**Authoritative C4 proof on a FRESH blank VM** (destroyed the pre-fix VM, recreated clean; cloud-init
|
||||
used the LIVE TS_AUTH_KEY so it auto-joined the tailnet — no manual tailscale step):
|
||||
- Provisioned ONLY `/var/lib/sops-nix/key.txt` = recovery age key (pub == `age1cmk26…` == &master) —
|
||||
the single out-of-band secret. `git clone --recursive` base+secrets (submodule 2312f1c, secrets ENC).
|
||||
- **One** `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (detached
|
||||
--no-block) → `ccci-rebuild` Result=**success** (~15 min, 2.8 G peak < 4 GB).
|
||||
- **`systemctl is-system-running` → running, 0 failed units** (the serialization fix works: single
|
||||
switch converges, no manual re-runs). Toplevel **`ld19aj2…` == cc-ci** (byte-identical).
|
||||
- **All 6 stacks 1/1**: traefik app+socket-proxy, drone, ccci-bridge, ccci-dashboard, backups.
|
||||
- **All secrets decrypted via the recovery key**; wildcard cert sops-decrypted from git →
|
||||
`/var/lib/ci-certs/live/fullchain.pem` (symlink→/run/secrets, sha256 `c1d96d61…`).
|
||||
- **TLS from git cert (local, per C4 standard):** `curl --resolve probe.ci.commoninternet.net:443:
|
||||
127.0.0.1` → http=404 (no route, expected) **ssl_verify=0**; served leaf sha256 fingerprint
|
||||
**== git-cert leaf == `57:8D:67:9E:FE:89:…:B8:A6`** (Adversary reference). Full chain of custody.
|
||||
|
||||
So: blank NixOS host + the two git repos + the one bootstrap age key + external DNS/gateway → one
|
||||
`nixos-rebuild switch` → working cc-ci. No undocumented manual step. This closes D8 honestly (static
|
||||
byte-identical closure + live throwaway rebuild). install.md updated to this validated procedure.
|
||||
|
||||
Destroying the throwaway now (frees RAM for the Adversary's independent W5 cold rebuild; C6 no-leftover).
|
||||
Gate W4 CLAIMED — awaiting Adversary cold W5 (their own fresh VM).
|
||||
|
||||
34
STATUS-1c.md
34
STATUS-1c.md
@ -9,14 +9,15 @@ The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY — not this p
|
||||
Now: make the VM fully reproducible from git (secrets+cert in a private `cc-ci-secrets` repo) and
|
||||
perform a genuine throwaway-VM live rebuild to close D8 honestly.
|
||||
|
||||
## In flight — W4 (throwaway live rebuild)
|
||||
- W1 DONE (cc-nix-test 6→4 GB, healthy). W2 PASS (Adversary cold). W3 DONE (VM reachable).
|
||||
- W4 Step A DONE: cc-ci on final config with `sops.age.keyFile` + serialized abra reconcilers →
|
||||
byte-identical **`ld19aj2…`** (zero drift). (config evolved vh6vwxbl→izsmiajw→ld19aj2; ld19aj2 is final.)
|
||||
- W4 Step B (1st run, pre-fix): blank VM built **izsmiajw==cc-ci byte-identical** from git + recovery
|
||||
key; cert+secrets decrypted; TLS leaf == git cert (`57:8D:…:B8:A6`). Found+fixed concurrent-abra
|
||||
race (serialized reconcilers). **Now: fresh throwaway booting → prove SINGLE switch converges (0 failed).**
|
||||
- Then claim **Gate W4**.
|
||||
## In flight — W4 DONE, Gate W4 CLAIMED
|
||||
- W1 DONE (cc-nix-test 6→4 GB). W2 PASS (Adversary cold). W3 DONE (VM reachable).
|
||||
- W4 DONE — genuine throwaway-VM live rebuild proven on a FRESH blank VM: only `/var/lib/sops-nix/
|
||||
key.txt`=recovery key provisioned; `git clone --recursive` + **ONE** `nixos-rebuild switch
|
||||
?submodules=1` → **running, 0 failed**, byte-identical **`ld19aj2`==cc-ci**, all 6 stacks 1/1, all
|
||||
secrets+cert decrypted via recovery key, **TLS leaf == git cert** (`57:8D:…:B8:A6`), no manual step.
|
||||
(Final config = ld19aj2: `sops.age.keyFile` + serialized abra reconcilers fixing a fresh-host race.)
|
||||
- Throwaway destroyed (frees RAM for Adversary W5; C6 no-leftover). install.md updated to this procedure.
|
||||
- Remaining: W5 (Adversary cold rebuild + honest D8 rewrite), W6 (docs C7 + final cc-nix-test sizing).
|
||||
|
||||
<details><summary>W2 detail (PASS)</summary>
|
||||
## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed
|
||||
@ -31,10 +32,19 @@ perform a genuine throwaway-VM live rebuild to close D8 honestly.
|
||||
</details>
|
||||
|
||||
## Gate
|
||||
**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified: byte-identical
|
||||
`vh6vwxbl`==running from a fresh recursive clone (zero drift), cert sops-decrypted from git + live TLS
|
||||
served from git cert (leaf fingerprint match), no plaintext leak in base/store. No regression, no VETO.
|
||||
Now proceeding: **W1 (resize) → W3 (throwaway VM) → W4 (live rebuild).**
|
||||
**Gate: W4 — CLAIMED, awaiting Adversary @2026-05-27 ~18:45Z.** Genuine throwaway-VM live rebuild
|
||||
(C4/C5/D8). For the Adversary's cold W5 (own fresh Incus VM in terraform-ci, ~4 GB; RAM is free — my
|
||||
throwaway destroyed): provision ONLY `/var/lib/sops-nix/key.txt` = recovery age key (`age1cmk26…`
|
||||
private half, from `/srv/cc-ci/.sops/master-age.txt`); `git clone --recursive` base+secrets (bot
|
||||
creds); `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (per docs/install.md).
|
||||
Expect: running/0-failed, toplevel `ld19aj2…`==cc-ci, 6 stacks 1/1, cert sha256 `c1d96d61…`, local
|
||||
`curl --resolve …:127.0.0.1` ssl_verify=0 with served leaf == git cert `57:8D:…:B8:A6`. Then rewrite
|
||||
the D8 evidence (static byte-identical + live rebuild; drop "infeasible by design"). My evidence:
|
||||
JOURNAL-1c 2026-05-27 W4 entry. (Note: throwaway base VM = Incus image; live TS_AUTH_KEY in cloud-init.)
|
||||
|
||||
**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified (byte-identical, cert
|
||||
from git + TLS leaf-match, no plaintext leak). Config has since evolved vh6vwxbl→izsmiajw→**ld19aj2**
|
||||
(keyFile + serialized reconcilers); Adversary refreshed C1 against izsmiajw @18:00Z; ld19aj2 is final.
|
||||
|
||||
<details><summary>prior</summary>
|
||||
**Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.**
|
||||
|
||||
@ -1,20 +1,37 @@
|
||||
# Installing cc-ci from scratch
|
||||
|
||||
> WORK IN PROGRESS — grows with each milestone; the full from-scratch rebuild is verified at M9 (D8).
|
||||
> The full from-scratch rebuild is **verified** (Phase-1c / D8): a blank NixOS Incus VM, given the two
|
||||
> repos + the single bootstrap age key, becomes a fully-converged cc-ci via one `nixos-rebuild switch`.
|
||||
|
||||
cc-ci is declared **entirely** as a NixOS flake (this repo). Bringing up the box is just
|
||||
**clone + `nixos-rebuild switch`** + the operator preconditions — no manual post-steps. The proxy
|
||||
(traefik) and Drone server are deployed by **idempotent-reconcile systemd oneshots** (`modules/
|
||||
proxy.nix`, `modules/drone.nix`) that converge the swarm to the desired state on every activation
|
||||
and boot (and self-heal drift), mirroring `swarm-init`. Target: a NixOS 24.11 host reachable as
|
||||
`cc-ci` over SSH (root).
|
||||
cc-ci is declared **entirely** as a NixOS flake — base config in this repo (`cc-ci`) and **all
|
||||
secrets (incl. the wildcard TLS cert) sops-encrypted in a private companion repo `cc-ci-secrets`,
|
||||
mounted as a git submodule at `secrets/`**. Bringing up the box is: **clone `--recursive` + provision
|
||||
the one bootstrap age key + `nixos-rebuild switch`** + the external DNS/gateway — no manual
|
||||
post-steps. The proxy (traefik), Drone, comment-bridge, dashboard and backupbot are deployed by
|
||||
**idempotent-reconcile systemd oneshots** that converge the swarm on every activation/boot (and
|
||||
self-heal drift), mirroring `swarm-init`; they are **serialized** (proxy→drone→bridge→dashboard→
|
||||
backupbot) so a single switch converges on a blank host. Target: a NixOS 24.11 host reachable over SSH (root).
|
||||
*(Verified on a throwaway Incus VM: blank host + the two repos + the age key → one `nixos-rebuild
|
||||
switch` → fully converged cc-ci, 0 failed units — see DECISIONS.md Phase-1c / D8.)*
|
||||
|
||||
## Operator preconditions (class-A1, see DECISIONS.md / docs/baseline.md)
|
||||
## Preconditions
|
||||
|
||||
- Wildcard TLS cert at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
|
||||
(`*.ci.commoninternet.net` + `ci.commoninternet.net`). **Renewed out-of-band; never ACME here.**
|
||||
**The one out-of-band secret (provision before the first rebuild):**
|
||||
- The **bootstrap age key** at `/var/lib/sops-nix/key.txt` (mode 0600). It must be a sops recipient
|
||||
of `cc-ci-secrets/secrets.yaml`. Two cases:
|
||||
- **Canonical cc-ci:** its SSH host key is already a recipient — also works via `age.sshKeyPaths`;
|
||||
the keyFile holds the host-derived age identity (`ssh-to-age -private-key -i
|
||||
/etc/ssh/ssh_host_ed25519_key`).
|
||||
- **A fresh/cloned host** (different SSH host key, not a recipient): provision the **off-box
|
||||
recovery age key** (`age1cmk26…`'s private half) there — it decrypts every secret incl. the cert.
|
||||
Everything else (cert, Drone OAuth/RPC, webhook HMAC) is sops-encrypted **in git** — nothing else
|
||||
is provisioned out-of-band.
|
||||
|
||||
**External infra (operator-owned, not on the box — class-A1):**
|
||||
- DNS: `*.ci.commoninternet.net` (+ bare) → the **gateway**, which TLS-passthroughs (SNI) to cc-ci.
|
||||
- Firewall path: gateway reaches cc-ci on tcp/80+443 (opened by `modules/swarm.nix`).
|
||||
- The wildcard cert is **renewed out-of-band** by the operator, who then re-encrypts it into
|
||||
`cc-ci-secrets` (sops) and rebuilds — the Gandi DNS token never touches the box; **never ACME here.**
|
||||
|
||||
## 1. Apply the NixOS flake (this is the whole install)
|
||||
|
||||
@ -25,29 +42,39 @@ host SSH key), Docker + single-node Swarm + the `proxy` overlay + firewall 80/44
|
||||
**Drone exec runner** (`modules/drone-runner.nix`).
|
||||
|
||||
```sh
|
||||
# materialise the repo on the host (the build runs on cc-ci itself — see DECISIONS.md deploy mech)
|
||||
# e.g. git clone <repo> /root/cc-ci (or sync it)
|
||||
nixos-rebuild switch --flake /root/cc-ci#cc-ci
|
||||
# 1. Clone base + the private secrets submodule (bot/deploy creds for cc-ci-secrets).
|
||||
# The submodule provides secrets/secrets.yaml (sops). Use a credential that can read
|
||||
# recipe-maintainers/cc-ci-secrets, e.g. a per-command header (never persisted):
|
||||
git clone --recursive https://git.autonomic.zone/recipe-maintainers/cc-ci.git /root/cc-ci
|
||||
# (if cloned non-recursively: git -C /root/cc-ci submodule update --init)
|
||||
|
||||
# 2. Provision the bootstrap age key (see Preconditions) — the ONE out-of-band secret:
|
||||
install -m700 -d /var/lib/sops-nix
|
||||
install -m600 /path/to/bootstrap-age-key /var/lib/sops-nix/key.txt
|
||||
|
||||
# 3. One nixos-rebuild switch. NOTE: ?submodules=1 so the git flake includes secrets/.
|
||||
nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'
|
||||
```
|
||||
|
||||
On activation, the reconcile oneshots (`deploy-proxy`, `deploy-drone`) run automatically and converge
|
||||
the swarm. Verify:
|
||||
On activation sops-nix decrypts every secret (incl. the wildcard cert → `/var/lib/ci-certs/live/`),
|
||||
then the serialized reconcile oneshots converge the swarm. Verify:
|
||||
|
||||
```sh
|
||||
systemctl is-system-running # -> running
|
||||
docker info --format '{{.Swarm.LocalNodeState}}' # -> active
|
||||
docker service ls # traefik (app+socket-proxy) + drone, all 1/1
|
||||
systemctl is-active deploy-proxy deploy-drone drone-runner-exec # -> active x3
|
||||
# wildcard cert served end-to-end via the gateway:
|
||||
curl -ksv --resolve probe.ci.commoninternet.net:443:<gateway-ip> https://probe.ci.commoninternet.net/ \
|
||||
2>&1 | grep -E 'subject:|HTTP/' # -> CN=*.ci.commoninternet.net, HTTP 404 (no app router yet)
|
||||
curl -ks --resolve drone.ci.commoninternet.net:443:<gateway-ip> \
|
||||
-o /dev/null -w '%{http_code}\n' https://drone.ci.commoninternet.net/healthz # -> 200
|
||||
systemctl is-system-running # -> running (0 failed units)
|
||||
docker service ls # traefik app+socket-proxy, drone, bridge, dashboard, backups — all 1/1
|
||||
# cert is sops-decrypted FROM GIT to the path traefik serves:
|
||||
sha256sum /var/lib/ci-certs/live/fullchain.pem # symlink -> /run/secrets/wildcard_cert
|
||||
# TLS served from the git cert, verified locally on the host (SNI ci.commoninternet.net):
|
||||
curl -s --resolve probe.ci.commoninternet.net:443:127.0.0.1 \
|
||||
-o /dev/null -w 'ssl_verify=%{ssl_verify_result}\n' https://probe.ci.commoninternet.net/ # -> 0
|
||||
# (the served leaf fingerprint == the cert in cc-ci-secrets)
|
||||
```
|
||||
|
||||
> Tip: when driving the switch over an SSH session that rides Tailscale, run it as a detached unit so
|
||||
> it survives a momentary drop, and **use the absolute flake path** (systemd units run with cwd `/`):
|
||||
> `systemd-run --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci`
|
||||
> it survives the tailscale restart during activation, and use the absolute flake ref:
|
||||
> `systemd-run --no-block --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'`
|
||||
> *(On the canonical cc-ci the build source is synced from the admin's clone via `tar | ssh` and built
|
||||
> as a `path:` flake — no submodule fetch needed there; the `?submodules=1` form is for a git clone.)*
|
||||
|
||||
## 2. One-time: link Drone ↔ Gitea (OAuth grant)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user