Files
cc-ci/STATUS-1c.md
autonomic-bot 3d86e31730
All checks were successful
continuous-integration/drone/push Build is passing
1c/E2E-TESTME: PASS (E1-E6) — clean-room VM serves a real !testme run end-to-end over the public domain
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 19:43:08 +01:00

13 KiB
Raw Blame History

STATUS — Phase 1c (full git reproducibility + genuine D8 live rebuild)

Phase plan (SSOT): /srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md Loop state for THIS phase: STATUS-1c / BACKLOG-1c / REVIEW-1c / JOURNAL-1c (DECISIONS.md shared). The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY — not this phase's state.

Phase

1c kickoff — Phase 1 is DONE & Adversary-signed-off (1c10fa5; all D1D10 PASS, no VETO). Now: make the VM fully reproducible from git (secrets+cert in a private cc-ci-secrets repo) and perform a genuine throwaway-VM live rebuild to close D8 honestly.

In flight — W4 DONE, Gate W4 CLAIMED

  • W1 DONE (cc-nix-test 6→4 GB). W2 PASS (Adversary cold). W3 DONE (VM reachable).
  • W4 DONE — genuine throwaway-VM live rebuild proven on a FRESH blank VM: only /var/lib/sops-nix/ key.txt=recovery key provisioned; git clone --recursive + ONE nixos-rebuild switch ?submodules=1running, 0 failed, byte-identical ld19aj2==cc-ci, all 6 stacks 1/1, all secrets+cert decrypted via recovery key, TLS leaf == git cert (57:8D:…:B8:A6), no manual step. (Final config = ld19aj2: sops.age.keyFile + serialized abra reconcilers fixing a fresh-host race.)
  • Throwaway destroyed (frees RAM for Adversary W5; C6 no-leftover). install.md updated to this procedure.
  • Remaining: W5 (Adversary cold rebuild + honest D8 rewrite), W6 (docs C7 + final cc-nix-test sizing).
W2 detail (PASS) ## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed - [x] **W2 step 1:** private `recipe-maintainers/cc-ci-secrets` created + populated (6 infra secrets + wildcard cert/key, sops, both recipients; sha256 byte-perfect) + pushed. - [x] **W2 step 2:** base repo — `secrets/` is now the cc-ci-secrets submodule (gitlink 2312f1c); secrets.nix adds `wildcard_cert`/`wildcard_key` → `/var/lib/ci-certs/live/*`; proxy.nix reframed. Pushed f79e542. Switched live cc-ci (toplevel `vh6vwxbl…`). **Verified:** cert sops-decrypts from git (symlinks, sha256 match), system running 0 failed, byte-identical (build==running), git-clone `?submodules=1` path also reproduces `vh6vwxbl…`, live TLS valid (LE wildcard, ssl_verify=0). - (Recovery-key `sops.age.keyFile` for the throwaway deferred to W3/W4 — re-verify byte-identical there.)

Gate

Gate: W4 — PASS @2026-05-27 18:55Z (Adversary, cold independent rebuild). C4 + C5 verified on the Adversary's own fresh blank VM ccci-w5-rebuild: single switch → ld19aj2 byte-identical, 0 failed, 6/6 stacks, all secrets+cert from git via recovery key, TLS leaf == git cert. C1C5 all Adversary-PASS, no VETO. D8 honest (infeasible superseded). Narrow signed-off limitation: Drone↔Gitea OAuth grant (install.md §2 manual post-step) — validated functionally by E2E-TESTME next. Now (Builder): swap (ccci-w5-rebuild @ 100.97.167.73 → cc-nix-test) + run E2E-TESTME (E1E6).

prior W4 CLAIMED **Gate: W4 — CLAIMED, awaiting Adversary @2026-05-27 ~18:45Z.** Genuine throwaway-VM live rebuild (C4/C5/D8). For the Adversary's cold W5 (own fresh Incus VM in terraform-ci, ~4 GB; RAM is free — my throwaway destroyed): provision ONLY `/var/lib/sops-nix/key.txt` = recovery age key (`age1cmk26…` private half, from `/srv/cc-ci/.sops/master-age.txt`); `git clone --recursive` base+secrets (bot creds); `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (per docs/install.md). Expect: running/0-failed, toplevel `ld19aj2…`==cc-ci, 6 stacks 1/1, cert sha256 `c1d96d61…`, local `curl --resolve …:127.0.0.1` ssl_verify=0 with served leaf == git cert `57:8D:…:B8:A6`. Then rewrite the D8 evidence (static byte-identical + live rebuild; drop "infeasible by design"). My evidence: JOURNAL-1c 2026-05-27 W4 entry. (Note: throwaway base VM = Incus image; live TS_AUTH_KEY in cloud-init.)

Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold). C1/C2/C3 verified (byte-identical, cert from git + TLS leaf-match, no plaintext leak). Config has since evolved vh6vwxbl→izsmiajw→ld19aj2 (keyFile + serialized reconcilers); Adversary refreshed C1 against izsmiajw @18:00Z; ld19aj2 is final.

prior **Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.** Acceptance to verify (cold): (1) byte-identical `nixos-rebuild build .#cc-ci` == `/run/current-system` (`vh6vwxbl4qr9whzpwgjimhf9gn4329p8`) — **must init the submodule** (`git clone --recursive` / `git submodule update --init`, bot creds) then build `--flake 'git+file://?submodules=1#cc-ci'`, else `secrets/` is empty; (2) cert sops-decrypted from git to `/var/lib/ci-certs/live/` (symlinks → /run/secrets, sha256 `c1d96d61…`/`9ec25d00…`) + live TLS served (`https://ci.commoninternet.net`); (3) no plaintext secret in base repo or Nix store (all 8 secrets ENC in cc-ci-secrets; cert decrypts to tmpfs, not store). See JOURNAL-1c 2026-05-27 W2a entry for full evidence.

Definition of Done (C1C7 — see phase plan §3)

  • C1 — Secrets-repo split (Adversary-PASS 16:55Z; re-exercised cold on blank host at C4)
  • C2 — Cert in git (Adversary-PASS 16:55Z; re-exercised at C4)
  • C3 — All secrets in git, one exception = bootstrap age key (Adversary-PASS 16:55Z; keyFile-on-throwaway at W4)
  • C4 — Genuine throwaway-VM live rebuild (Incus terraform-ci, only age key provisioned)
  • C5 — Honest D8 (static byte-identical + live rebuild; "infeasible by design" removed)
  • C6 — Resource fit + cleanup (cc-nix-test 6→4 GB, throwaway 4 GB, destroyed after; final sizing decided)
  • C7 — Docs (install.md/secrets.md/architecture.md + main plan refs updated to new model)

E2E-TESTME — PASS @2026-05-27 (functional acceptance of D8/clean-room)

Real !testme on the rebuilt-from-git VM (swapped in as cc-nix-test) over the PUBLIC domain: E1 public 200/ssl_verify=0; E2 bridge→new Drone build #4 (>baseline #3, not manual); E3 app cust-bdddd9.ci.commoninternet.net EXTERNAL via gateway → HTTP/2 200, ssl_verify=0, real nginx body, CN=*.ci.commoninternet.net cert; E4 build #4 success, log shows real install/upgrade/backup (Playwright incl.) all passed, no softening; E5 clean undeploy (0 residual); E6 bridge PR comment " passed →…/cc-ci/4" + dashboard custom-html/success/#4. Evidence: JOURNAL-1c. Caught+fixed the Drone-bot-token reproducibility gap (af46aca) en route. Adversary independently verifies E1-E6. Remaining: swap-back; re-deploy af46aca to cc-ci (byte-identical at new toplevel cqym8knj…).

🔴 SWAP ACTIVE (2026-05-27 ~19:25Z) — public gateway points at the REBUILT VM (reversible)

State: cc-nix-test (MagicDNS) → 100.97.167.73 (rebuilt ccci-w5-rebuild); original cc-ci renamed cc-nix-test-orig @ 100.90.116.4, still running (swap-back target). Public ci.commoninternet.net now served by the rebuilt VM (P2 verified 200/ssl_verify=0). Doing E2E-TESTME. E2E progress (2026-05-27 ~19:45Z): E1 PASS (public 200/ssl_verify=0). Original's bridge PAUSED (ccci-bridge_app 1/0 on cc-nix-test-orig). Rebuilt VM Drone OAuth done (admin=true, cc-ci active) — needed a script fix (auto-approve, committed ee585ef). Clean-room finding (committed af46aca): DRONE_USER_CREATE lacked token: → rebuilt Drone's bot token ≠ sops bridge_drone_token → bridge 401. Fix injects the sops token. NOT yet applied to the rebuilt VM (a no-op rebuild ran with old config first). NEXT: (1) git pull af46aca on rebuilt VM + nixos-rebuild switch (applies token); (2) verify bot token == sops (else docker volume rm Drone DB + redeploy so DRONE_USER_CREATE recreates the bot w/ token; then re-run OAuth bootstrap); (3) run !testme on custom-html#2 (head db9a9502) → verify E2E6; (4) swap-back; (5) re-deploy af46aca to cc-ci + re-verify byte-identical (Adversary re-checks C1). ssh cc-ci (pinned 100.90.116.4) = the ORIGINAL (cc-nix-test-orig); reach the rebuilt VM via 100.97.167.73 or cc-nix-test MagicDNS. SWAP-BACK when e2e done: rebuilt VM → tailscale set --hostname=ccci-w5-rebuild; then ssh cc-ci 'tailscale set --hostname=cc-nix-test'; restore original's bridge (docker service scale ccci-bridge_app=1 on the original — paused during e2e to avoid dual-trigger). Keep both VMs running.

⚠️ Operator override — do NOT destroy the FINAL throwaway VM (read before W5/W6 cleanup)

The operator (2026-05-27) will repurpose the final W5/C4-C5 clean-room throwaway VM as the new cc-nix-test for a live real-traffic test. So: KEEP that VM running after W5 PASSes — do NOT tear it down in C6/W6. Defer its teardown until the operator explicitly says otherwise. This overrides the plan's "destroy the throwaway" for that one VM. (Adversary: please do not destroy your W5 VM on PASS.) This also settles C6 final sizing = promote the rebuilt VM. All other cleanup is normal (Builder's first throwaway already destroyed). See DECISIONS.md Phase-1c.

Pending functional-acceptance e2e — E2E-TESTME (BUILDER owns swap+test; gated on C4/C5 PASS)

Authority: /srv/cc-ci/cc-ci-plan/test-e2e-testme-acceptance.md (supersedes inline wording). MY test to execute end-to-end (incl. the tailnet swap — no orchestrator signal); Adversary independently verifies but does NOT rename nodes (actor/critic split — only ONE loop renames). Target VM = the ADVERSARY's kept-running W5 VM (Incus instance ccci-w5-rebuild; mine was destroyed). WAIT for: (i) Adversary W5 PASS in REVIEW-1c, AND (ii) the Adversary records that VM's Incus instance + CURRENT tailscale IP in REVIEW-1c/STATUS — only then swap. Sequence:

  1. Swap (Builder, 2 reversible tailscale set --hostname, ORDER MATTERS): (a) ssh cc-ci 'tailscale set --hostname=cc-nix-test-orig' — original aside, keep running (swap-back); ssh cc-ci (pinned IP 100.90.116.4) keeps hitting the ORIGINAL. (b) Adversary's W5 VM (ccci-w5-rebuild) → cc-nix-test, using the IP the Adversary recorded (re-confirm online via tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock status), then ssh -i …/vm_ssh_key -o ProxyCommand='nc -X 5 -x 127.0.0.1:1055 %h %p' root@<ip> 'tailscale set --hostname=cc-nix-test'. After swap, cc-nix-test.taila4a0bf.ts.net → that VM tailnet-wide (gateway auto-follows ~10s); target !testme/deploys by MagicDNS name, NOT raw IP (raw IP = original).
  2. Verify P1+P2: tailscale … status | grep cc-nix-test → throwaway IP; curl https://ci.commoninternet.net/200 ssl_verify=0.
  3. Run E2E-TESTME (spec §2; E1E6 below). 4. Swap-back when done (reversible): rebuilt VM → its old name, then ssh cc-ci 'tailscale set --hostname=cc-nix-test' (restores original; gateway re-follows). Watch-out (handle at execution): the ORIGINAL (cc-nix-test-orig) stays up with its bridge polling Gitea — to avoid duplicate builds/PR-comments, pause its bridge during the e2e (docker service scale ccci-bridge_app=0 on the original, restore after); and the rebuilt VM's Drone needs the one-time OAuth bootstrap (install.md §2) before it can clone/build. Then: !testme as the bot on one fast enrolled recipe (e.g. custom-html) and verify the real path. Pass criteria (all): E1 self-check 200/valid cert on rebuilt VM; E2 new Drone build via the bridge (run# > baseline, not a manual trigger); E3 app answers an EXTERNAL request at <app>.ci.commoninternet.net through the gateway (real 200 + valid cert + app content, NOT localhost, NOT a Traefik 404); E4 real test assertions pass, build success (no softening); E5 clean undeploy (no residual stack); E6 result reported back + dashboard updated. Evidence → JOURNAL-1c, verdict → STATUS-1c/REVIEW-1c as E2E-TESTME PASS. On failure: it's a clean-room finding — fix in git source (base / cc-ci-secrets), NOT the live VM, then re-run.

Blocked

(none)

Notes

  • Current secret layout: secrets/secrets.yaml (6 infra secrets), recipients = host age key (ssh-to-age of cc-ci's ed25519 host key) + off-box master recovery key (/srv/cc-ci/.sops/master-age.txt, sandbox-only). .sops.yaml at repo root.
  • Wildcard cert currently out-of-band at /var/lib/ci-certs/live/{fullchain.pem,privkey.pem} (operator-provided, LE, next renewal ~2026-08-24); proxy.nix reads it from there. 1c moves it into sops-in-git, decrypted back to that path at activation.
  • Sandbox host has NO sops/nix/age — sops ops run on cc-ci (has nix + host age key) or via the master key with a sops binary fetched on cc-ci.
  • cc-nix-test == the live cc-ci server (100.90.116.4); resizing it (W1) briefly stops it.