1c/W4 DONE: genuine throwaway-VM live rebuild (single switch, 0 failed, byte-identical, TLS leaf==git cert); Gate W4 CLAIMED + install.md updated
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -19,9 +19,10 @@ Method W1–W6 from the phase plan §5. Each milestone ends with an Adversary ga
|
|||||||
0 failed units, all stacks 1/1, cert survived reboot via sops, TLS 200. Running RAM 8 GB.
|
0 failed units, all stacks 1/1, cert survived reboot via sops, TLS 200. Running RAM 8 GB.
|
||||||
- [x] **W3 — Throwaway VM.** `ccci-throwaway` (incus-base, 4 GB/20 GB) reachable at 100.126.124.86
|
- [x] **W3 — Throwaway VM.** `ccci-throwaway` (incus-base, 4 GB/20 GB) reachable at 100.126.124.86
|
||||||
(used live TS_AUTH_KEY; workspace key stale). Bootstrap age key provisioned in W4.
|
(used live TS_AUTH_KEY; workspace key stale). Bootstrap age key provisioned in W4.
|
||||||
- [ ] **W4 — Reproducible live rebuild.** On throwaway VM: clone base+secrets, `nixos-rebuild switch`,
|
- [x] **W4 — Reproducible live rebuild.** Fresh blank VM + recovery age key only → `git clone
|
||||||
watch oneshots converge, secrets+cert decrypt. Accept: fully up, no step outside docs/install.md;
|
--recursive` + ONE `nixos-rebuild switch ?submodules=1` → running/0-failed, byte-identical
|
||||||
capture evidence. **Gate W4 CLAIMED.**
|
`ld19aj2`==cc-ci, 6 stacks 1/1, all secrets+cert decrypt, TLS leaf==git cert. Found+fixed a
|
||||||
|
concurrent-abra race (serialized reconcilers). **Gate W4 CLAIMED** (awaiting Adversary W5).
|
||||||
- [ ] **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 independently; rewrites D8
|
- [ ] **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 independently; rewrites D8
|
||||||
evidence (static+live), removes "infeasible by design". Accept: Adversary D8 live-rebuild PASS
|
evidence (static+live), removes "infeasible by design". Accept: Adversary D8 live-rebuild PASS
|
||||||
(or narrow signed-off limitation per C5).
|
(or narrow signed-off limitation per C5).
|
||||||
|
|||||||
@ -266,3 +266,27 @@ This is the LAST planned config change before W4 completes (config stable ld19aj
|
|||||||
live → traefik swarm secret → served leaf. The rebuilt host serves the git-sourced cert.
|
live → traefik swarm secret → served leaf. The rebuilt host serves the git-sourced cert.
|
||||||
|
|
||||||
Next: recreate throwaway FRESH with fixed config to prove SINGLE nixos-rebuild switch converges (0 failed).
|
Next: recreate throwaway FRESH with fixed config to prove SINGLE nixos-rebuild switch converges (0 failed).
|
||||||
|
|
||||||
|
## 2026-05-27 — W4 DONE: genuine throwaway-VM live rebuild, SINGLE switch converges (Gate W4 CLAIMED)
|
||||||
|
|
||||||
|
**Authoritative C4 proof on a FRESH blank VM** (destroyed the pre-fix VM, recreated clean; cloud-init
|
||||||
|
used the LIVE TS_AUTH_KEY so it auto-joined the tailnet — no manual tailscale step):
|
||||||
|
- Provisioned ONLY `/var/lib/sops-nix/key.txt` = recovery age key (pub == `age1cmk26…` == &master) —
|
||||||
|
the single out-of-band secret. `git clone --recursive` base+secrets (submodule 2312f1c, secrets ENC).
|
||||||
|
- **One** `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (detached
|
||||||
|
--no-block) → `ccci-rebuild` Result=**success** (~15 min, 2.8 G peak < 4 GB).
|
||||||
|
- **`systemctl is-system-running` → running, 0 failed units** (the serialization fix works: single
|
||||||
|
switch converges, no manual re-runs). Toplevel **`ld19aj2…` == cc-ci** (byte-identical).
|
||||||
|
- **All 6 stacks 1/1**: traefik app+socket-proxy, drone, ccci-bridge, ccci-dashboard, backups.
|
||||||
|
- **All secrets decrypted via the recovery key**; wildcard cert sops-decrypted from git →
|
||||||
|
`/var/lib/ci-certs/live/fullchain.pem` (symlink→/run/secrets, sha256 `c1d96d61…`).
|
||||||
|
- **TLS from git cert (local, per C4 standard):** `curl --resolve probe.ci.commoninternet.net:443:
|
||||||
|
127.0.0.1` → http=404 (no route, expected) **ssl_verify=0**; served leaf sha256 fingerprint
|
||||||
|
**== git-cert leaf == `57:8D:67:9E:FE:89:…:B8:A6`** (Adversary reference). Full chain of custody.
|
||||||
|
|
||||||
|
So: blank NixOS host + the two git repos + the one bootstrap age key + external DNS/gateway → one
|
||||||
|
`nixos-rebuild switch` → working cc-ci. No undocumented manual step. This closes D8 honestly (static
|
||||||
|
byte-identical closure + live throwaway rebuild). install.md updated to this validated procedure.
|
||||||
|
|
||||||
|
Destroying the throwaway now (frees RAM for the Adversary's independent W5 cold rebuild; C6 no-leftover).
|
||||||
|
Gate W4 CLAIMED — awaiting Adversary cold W5 (their own fresh VM).
|
||||||
|
|||||||
34
STATUS-1c.md
34
STATUS-1c.md
@ -9,14 +9,15 @@ The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY — not this p
|
|||||||
Now: make the VM fully reproducible from git (secrets+cert in a private `cc-ci-secrets` repo) and
|
Now: make the VM fully reproducible from git (secrets+cert in a private `cc-ci-secrets` repo) and
|
||||||
perform a genuine throwaway-VM live rebuild to close D8 honestly.
|
perform a genuine throwaway-VM live rebuild to close D8 honestly.
|
||||||
|
|
||||||
## In flight — W4 (throwaway live rebuild)
|
## In flight — W4 DONE, Gate W4 CLAIMED
|
||||||
- W1 DONE (cc-nix-test 6→4 GB, healthy). W2 PASS (Adversary cold). W3 DONE (VM reachable).
|
- W1 DONE (cc-nix-test 6→4 GB). W2 PASS (Adversary cold). W3 DONE (VM reachable).
|
||||||
- W4 Step A DONE: cc-ci on final config with `sops.age.keyFile` + serialized abra reconcilers →
|
- W4 DONE — genuine throwaway-VM live rebuild proven on a FRESH blank VM: only `/var/lib/sops-nix/
|
||||||
byte-identical **`ld19aj2…`** (zero drift). (config evolved vh6vwxbl→izsmiajw→ld19aj2; ld19aj2 is final.)
|
key.txt`=recovery key provisioned; `git clone --recursive` + **ONE** `nixos-rebuild switch
|
||||||
- W4 Step B (1st run, pre-fix): blank VM built **izsmiajw==cc-ci byte-identical** from git + recovery
|
?submodules=1` → **running, 0 failed**, byte-identical **`ld19aj2`==cc-ci**, all 6 stacks 1/1, all
|
||||||
key; cert+secrets decrypted; TLS leaf == git cert (`57:8D:…:B8:A6`). Found+fixed concurrent-abra
|
secrets+cert decrypted via recovery key, **TLS leaf == git cert** (`57:8D:…:B8:A6`), no manual step.
|
||||||
race (serialized reconcilers). **Now: fresh throwaway booting → prove SINGLE switch converges (0 failed).**
|
(Final config = ld19aj2: `sops.age.keyFile` + serialized abra reconcilers fixing a fresh-host race.)
|
||||||
- Then claim **Gate W4**.
|
- Throwaway destroyed (frees RAM for Adversary W5; C6 no-leftover). install.md updated to this procedure.
|
||||||
|
- Remaining: W5 (Adversary cold rebuild + honest D8 rewrite), W6 (docs C7 + final cc-nix-test sizing).
|
||||||
|
|
||||||
<details><summary>W2 detail (PASS)</summary>
|
<details><summary>W2 detail (PASS)</summary>
|
||||||
## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed
|
## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed
|
||||||
@ -31,10 +32,19 @@ perform a genuine throwaway-VM live rebuild to close D8 honestly.
|
|||||||
</details>
|
</details>
|
||||||
|
|
||||||
## Gate
|
## Gate
|
||||||
**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified: byte-identical
|
**Gate: W4 — CLAIMED, awaiting Adversary @2026-05-27 ~18:45Z.** Genuine throwaway-VM live rebuild
|
||||||
`vh6vwxbl`==running from a fresh recursive clone (zero drift), cert sops-decrypted from git + live TLS
|
(C4/C5/D8). For the Adversary's cold W5 (own fresh Incus VM in terraform-ci, ~4 GB; RAM is free — my
|
||||||
served from git cert (leaf fingerprint match), no plaintext leak in base/store. No regression, no VETO.
|
throwaway destroyed): provision ONLY `/var/lib/sops-nix/key.txt` = recovery age key (`age1cmk26…`
|
||||||
Now proceeding: **W1 (resize) → W3 (throwaway VM) → W4 (live rebuild).**
|
private half, from `/srv/cc-ci/.sops/master-age.txt`); `git clone --recursive` base+secrets (bot
|
||||||
|
creds); `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (per docs/install.md).
|
||||||
|
Expect: running/0-failed, toplevel `ld19aj2…`==cc-ci, 6 stacks 1/1, cert sha256 `c1d96d61…`, local
|
||||||
|
`curl --resolve …:127.0.0.1` ssl_verify=0 with served leaf == git cert `57:8D:…:B8:A6`. Then rewrite
|
||||||
|
the D8 evidence (static byte-identical + live rebuild; drop "infeasible by design"). My evidence:
|
||||||
|
JOURNAL-1c 2026-05-27 W4 entry. (Note: throwaway base VM = Incus image; live TS_AUTH_KEY in cloud-init.)
|
||||||
|
|
||||||
|
**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified (byte-identical, cert
|
||||||
|
from git + TLS leaf-match, no plaintext leak). Config has since evolved vh6vwxbl→izsmiajw→**ld19aj2**
|
||||||
|
(keyFile + serialized reconcilers); Adversary refreshed C1 against izsmiajw @18:00Z; ld19aj2 is final.
|
||||||
|
|
||||||
<details><summary>prior</summary>
|
<details><summary>prior</summary>
|
||||||
**Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.**
|
**Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.**
|
||||||
|
|||||||
@ -1,20 +1,37 @@
|
|||||||
# Installing cc-ci from scratch
|
# Installing cc-ci from scratch
|
||||||
|
|
||||||
> WORK IN PROGRESS — grows with each milestone; the full from-scratch rebuild is verified at M9 (D8).
|
> The full from-scratch rebuild is **verified** (Phase-1c / D8): a blank NixOS Incus VM, given the two
|
||||||
|
> repos + the single bootstrap age key, becomes a fully-converged cc-ci via one `nixos-rebuild switch`.
|
||||||
|
|
||||||
cc-ci is declared **entirely** as a NixOS flake (this repo). Bringing up the box is just
|
cc-ci is declared **entirely** as a NixOS flake — base config in this repo (`cc-ci`) and **all
|
||||||
**clone + `nixos-rebuild switch`** + the operator preconditions — no manual post-steps. The proxy
|
secrets (incl. the wildcard TLS cert) sops-encrypted in a private companion repo `cc-ci-secrets`,
|
||||||
(traefik) and Drone server are deployed by **idempotent-reconcile systemd oneshots** (`modules/
|
mounted as a git submodule at `secrets/`**. Bringing up the box is: **clone `--recursive` + provision
|
||||||
proxy.nix`, `modules/drone.nix`) that converge the swarm to the desired state on every activation
|
the one bootstrap age key + `nixos-rebuild switch`** + the external DNS/gateway — no manual
|
||||||
and boot (and self-heal drift), mirroring `swarm-init`. Target: a NixOS 24.11 host reachable as
|
post-steps. The proxy (traefik), Drone, comment-bridge, dashboard and backupbot are deployed by
|
||||||
`cc-ci` over SSH (root).
|
**idempotent-reconcile systemd oneshots** that converge the swarm on every activation/boot (and
|
||||||
|
self-heal drift), mirroring `swarm-init`; they are **serialized** (proxy→drone→bridge→dashboard→
|
||||||
|
backupbot) so a single switch converges on a blank host. Target: a NixOS 24.11 host reachable over SSH (root).
|
||||||
|
*(Verified on a throwaway Incus VM: blank host + the two repos + the age key → one `nixos-rebuild
|
||||||
|
switch` → fully converged cc-ci, 0 failed units — see DECISIONS.md Phase-1c / D8.)*
|
||||||
|
|
||||||
## Operator preconditions (class-A1, see DECISIONS.md / docs/baseline.md)
|
## Preconditions
|
||||||
|
|
||||||
- Wildcard TLS cert at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
|
**The one out-of-band secret (provision before the first rebuild):**
|
||||||
(`*.ci.commoninternet.net` + `ci.commoninternet.net`). **Renewed out-of-band; never ACME here.**
|
- The **bootstrap age key** at `/var/lib/sops-nix/key.txt` (mode 0600). It must be a sops recipient
|
||||||
|
of `cc-ci-secrets/secrets.yaml`. Two cases:
|
||||||
|
- **Canonical cc-ci:** its SSH host key is already a recipient — also works via `age.sshKeyPaths`;
|
||||||
|
the keyFile holds the host-derived age identity (`ssh-to-age -private-key -i
|
||||||
|
/etc/ssh/ssh_host_ed25519_key`).
|
||||||
|
- **A fresh/cloned host** (different SSH host key, not a recipient): provision the **off-box
|
||||||
|
recovery age key** (`age1cmk26…`'s private half) there — it decrypts every secret incl. the cert.
|
||||||
|
Everything else (cert, Drone OAuth/RPC, webhook HMAC) is sops-encrypted **in git** — nothing else
|
||||||
|
is provisioned out-of-band.
|
||||||
|
|
||||||
|
**External infra (operator-owned, not on the box — class-A1):**
|
||||||
- DNS: `*.ci.commoninternet.net` (+ bare) → the **gateway**, which TLS-passthroughs (SNI) to cc-ci.
|
- DNS: `*.ci.commoninternet.net` (+ bare) → the **gateway**, which TLS-passthroughs (SNI) to cc-ci.
|
||||||
- Firewall path: gateway reaches cc-ci on tcp/80+443 (opened by `modules/swarm.nix`).
|
- Firewall path: gateway reaches cc-ci on tcp/80+443 (opened by `modules/swarm.nix`).
|
||||||
|
- The wildcard cert is **renewed out-of-band** by the operator, who then re-encrypts it into
|
||||||
|
`cc-ci-secrets` (sops) and rebuilds — the Gandi DNS token never touches the box; **never ACME here.**
|
||||||
|
|
||||||
## 1. Apply the NixOS flake (this is the whole install)
|
## 1. Apply the NixOS flake (this is the whole install)
|
||||||
|
|
||||||
@ -25,29 +42,39 @@ host SSH key), Docker + single-node Swarm + the `proxy` overlay + firewall 80/44
|
|||||||
**Drone exec runner** (`modules/drone-runner.nix`).
|
**Drone exec runner** (`modules/drone-runner.nix`).
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
# materialise the repo on the host (the build runs on cc-ci itself — see DECISIONS.md deploy mech)
|
# 1. Clone base + the private secrets submodule (bot/deploy creds for cc-ci-secrets).
|
||||||
# e.g. git clone <repo> /root/cc-ci (or sync it)
|
# The submodule provides secrets/secrets.yaml (sops). Use a credential that can read
|
||||||
nixos-rebuild switch --flake /root/cc-ci#cc-ci
|
# recipe-maintainers/cc-ci-secrets, e.g. a per-command header (never persisted):
|
||||||
|
git clone --recursive https://git.autonomic.zone/recipe-maintainers/cc-ci.git /root/cc-ci
|
||||||
|
# (if cloned non-recursively: git -C /root/cc-ci submodule update --init)
|
||||||
|
|
||||||
|
# 2. Provision the bootstrap age key (see Preconditions) — the ONE out-of-band secret:
|
||||||
|
install -m700 -d /var/lib/sops-nix
|
||||||
|
install -m600 /path/to/bootstrap-age-key /var/lib/sops-nix/key.txt
|
||||||
|
|
||||||
|
# 3. One nixos-rebuild switch. NOTE: ?submodules=1 so the git flake includes secrets/.
|
||||||
|
nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'
|
||||||
```
|
```
|
||||||
|
|
||||||
On activation, the reconcile oneshots (`deploy-proxy`, `deploy-drone`) run automatically and converge
|
On activation sops-nix decrypts every secret (incl. the wildcard cert → `/var/lib/ci-certs/live/`),
|
||||||
the swarm. Verify:
|
then the serialized reconcile oneshots converge the swarm. Verify:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
systemctl is-system-running # -> running
|
systemctl is-system-running # -> running (0 failed units)
|
||||||
docker info --format '{{.Swarm.LocalNodeState}}' # -> active
|
docker service ls # traefik app+socket-proxy, drone, bridge, dashboard, backups — all 1/1
|
||||||
docker service ls # traefik (app+socket-proxy) + drone, all 1/1
|
# cert is sops-decrypted FROM GIT to the path traefik serves:
|
||||||
systemctl is-active deploy-proxy deploy-drone drone-runner-exec # -> active x3
|
sha256sum /var/lib/ci-certs/live/fullchain.pem # symlink -> /run/secrets/wildcard_cert
|
||||||
# wildcard cert served end-to-end via the gateway:
|
# TLS served from the git cert, verified locally on the host (SNI ci.commoninternet.net):
|
||||||
curl -ksv --resolve probe.ci.commoninternet.net:443:<gateway-ip> https://probe.ci.commoninternet.net/ \
|
curl -s --resolve probe.ci.commoninternet.net:443:127.0.0.1 \
|
||||||
2>&1 | grep -E 'subject:|HTTP/' # -> CN=*.ci.commoninternet.net, HTTP 404 (no app router yet)
|
-o /dev/null -w 'ssl_verify=%{ssl_verify_result}\n' https://probe.ci.commoninternet.net/ # -> 0
|
||||||
curl -ks --resolve drone.ci.commoninternet.net:443:<gateway-ip> \
|
# (the served leaf fingerprint == the cert in cc-ci-secrets)
|
||||||
-o /dev/null -w '%{http_code}\n' https://drone.ci.commoninternet.net/healthz # -> 200
|
|
||||||
```
|
```
|
||||||
|
|
||||||
> Tip: when driving the switch over an SSH session that rides Tailscale, run it as a detached unit so
|
> Tip: when driving the switch over an SSH session that rides Tailscale, run it as a detached unit so
|
||||||
> it survives a momentary drop, and **use the absolute flake path** (systemd units run with cwd `/`):
|
> it survives the tailscale restart during activation, and use the absolute flake ref:
|
||||||
> `systemd-run --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci`
|
> `systemd-run --no-block --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'`
|
||||||
|
> *(On the canonical cc-ci the build source is synced from the admin's clone via `tar | ssh` and built
|
||||||
|
> as a `path:` flake — no submodule fetch needed there; the `?submodules=1` form is for a git clone.)*
|
||||||
|
|
||||||
## 2. One-time: link Drone ↔ Gitea (OAuth grant)
|
## 2. One-time: link Drone ↔ Gitea (OAuth grant)
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user