Files
cc-ci/STATUS-1c.md
autonomic-bot 3d86e31730
All checks were successful
continuous-integration/drone/push Build is passing
1c/E2E-TESTME: PASS (E1-E6) — clean-room VM serves a real !testme run end-to-end over the public domain
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 19:43:08 +01:00

159 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 1c (full git reproducibility + genuine D8 live rebuild)
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md`
**Loop state for THIS phase:** STATUS-1c / BACKLOG-1c / REVIEW-1c / JOURNAL-1c (DECISIONS.md shared).
The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY — not this phase's state.
## Phase
**1c kickoff** — Phase 1 is DONE & Adversary-signed-off (1c10fa5; all D1D10 PASS, no VETO).
Now: make the VM fully reproducible from git (secrets+cert in a private `cc-ci-secrets` repo) and
perform a genuine throwaway-VM live rebuild to close D8 honestly.
## In flight — W4 DONE, Gate W4 CLAIMED
- W1 DONE (cc-nix-test 6→4 GB). W2 PASS (Adversary cold). W3 DONE (VM reachable).
- W4 DONE — genuine throwaway-VM live rebuild proven on a FRESH blank VM: only `/var/lib/sops-nix/
key.txt`=recovery key provisioned; `git clone --recursive` + **ONE** `nixos-rebuild switch
?submodules=1` → **running, 0 failed**, byte-identical **`ld19aj2`==cc-ci**, all 6 stacks 1/1, all
secrets+cert decrypted via recovery key, **TLS leaf == git cert** (`57:8D:…:B8:A6`), no manual step.
(Final config = ld19aj2: `sops.age.keyFile` + serialized abra reconcilers fixing a fresh-host race.)
- Throwaway destroyed (frees RAM for Adversary W5; C6 no-leftover). install.md updated to this procedure.
- Remaining: W5 (Adversary cold rebuild + honest D8 rewrite), W6 (docs C7 + final cc-nix-test sizing).
<details><summary>W2 detail (PASS)</summary>
## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed
- [x] **W2 step 1:** private `recipe-maintainers/cc-ci-secrets` created + populated (6 infra secrets
+ wildcard cert/key, sops, both recipients; sha256 byte-perfect) + pushed.
- [x] **W2 step 2:** base repo — `secrets/` is now the cc-ci-secrets submodule (gitlink 2312f1c);
secrets.nix adds `wildcard_cert`/`wildcard_key` → `/var/lib/ci-certs/live/*`; proxy.nix reframed.
Pushed f79e542. Switched live cc-ci (toplevel `vh6vwxbl…`). **Verified:** cert sops-decrypts from
git (symlinks, sha256 match), system running 0 failed, byte-identical (build==running), git-clone
`?submodules=1` path also reproduces `vh6vwxbl…`, live TLS valid (LE wildcard, ssl_verify=0).
- (Recovery-key `sops.age.keyFile` for the throwaway deferred to W3/W4 — re-verify byte-identical there.)
</details>
## Gate
**Gate: W4 — PASS @2026-05-27 18:55Z (Adversary, cold independent rebuild).** C4 + C5 verified on the
Adversary's own fresh blank VM `ccci-w5-rebuild`: single switch → `ld19aj2` byte-identical, 0 failed,
6/6 stacks, all secrets+cert from git via recovery key, TLS leaf == git cert. **C1C5 all
Adversary-PASS, no VETO.** D8 honest (infeasible superseded). Narrow signed-off limitation: Drone↔Gitea
OAuth grant (install.md §2 manual post-step) — validated functionally by E2E-TESTME next.
**Now (Builder): swap (`ccci-w5-rebuild @ 100.97.167.73` → cc-nix-test) + run E2E-TESTME (E1E6).**
<details><summary>prior W4 CLAIMED</summary>
**Gate: W4 — CLAIMED, awaiting Adversary @2026-05-27 ~18:45Z.** Genuine throwaway-VM live rebuild
(C4/C5/D8). For the Adversary's cold W5 (own fresh Incus VM in terraform-ci, ~4 GB; RAM is free — my
throwaway destroyed): provision ONLY `/var/lib/sops-nix/key.txt` = recovery age key (`age1cmk26…`
private half, from `/srv/cc-ci/.sops/master-age.txt`); `git clone --recursive` base+secrets (bot
creds); `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (per docs/install.md).
Expect: running/0-failed, toplevel `ld19aj2…`==cc-ci, 6 stacks 1/1, cert sha256 `c1d96d61…`, local
`curl --resolve …:127.0.0.1` ssl_verify=0 with served leaf == git cert `57:8D:…:B8:A6`. Then rewrite
the D8 evidence (static byte-identical + live rebuild; drop "infeasible by design"). My evidence:
JOURNAL-1c 2026-05-27 W4 entry. (Note: throwaway base VM = Incus image; live TS_AUTH_KEY in cloud-init.)
</details>
**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified (byte-identical, cert
from git + TLS leaf-match, no plaintext leak). Config has since evolved vh6vwxbl→izsmiajw→**ld19aj2**
(keyFile + serialized reconcilers); Adversary refreshed C1 against izsmiajw @18:00Z; ld19aj2 is final.
<details><summary>prior</summary>
**Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.**
Acceptance to verify (cold): (1) byte-identical `nixos-rebuild build .#cc-ci` == `/run/current-system`
(`vh6vwxbl4qr9whzpwgjimhf9gn4329p8`) — **must init the submodule** (`git clone --recursive` / `git
submodule update --init`, bot creds) then build `--flake 'git+file://<clone>?submodules=1#cc-ci'`, else
`secrets/` is empty; (2) cert sops-decrypted from git to `/var/lib/ci-certs/live/` (symlinks → /run/secrets,
sha256 `c1d96d61…`/`9ec25d00…`) + live TLS served (`https://ci.commoninternet.net`); (3) no plaintext
secret in base repo or Nix store (all 8 secrets ENC in cc-ci-secrets; cert decrypts to tmpfs, not store).
See JOURNAL-1c 2026-05-27 W2a entry for full evidence.
</details>
## Definition of Done (C1C7 — see phase plan §3)
- [x] C1 — Secrets-repo split (Adversary-PASS 16:55Z; re-exercised cold on blank host at C4)
- [x] C2 — Cert in git (Adversary-PASS 16:55Z; re-exercised at C4)
- [x] C3 — All secrets in git, one exception = bootstrap age key (Adversary-PASS 16:55Z; keyFile-on-throwaway at W4)
- [ ] C4 — Genuine throwaway-VM live rebuild (Incus terraform-ci, only age key provisioned)
- [ ] C5 — Honest D8 (static byte-identical + live rebuild; "infeasible by design" removed)
- [ ] C6 — Resource fit + cleanup (cc-nix-test 6→4 GB, throwaway 4 GB, destroyed after; final sizing decided)
- [ ] C7 — Docs (install.md/secrets.md/architecture.md + main plan refs updated to new model)
## ✅ E2E-TESTME — PASS @2026-05-27 (functional acceptance of D8/clean-room)
Real `!testme` on the rebuilt-from-git VM (swapped in as cc-nix-test) over the PUBLIC domain:
**E1** public 200/ssl_verify=0; **E2** bridge→new Drone build #4 (>baseline #3, not manual); **E3**
app `cust-bdddd9.ci.commoninternet.net` EXTERNAL via gateway → HTTP/2 200, ssl_verify=0, real nginx
body, `CN=*.ci.commoninternet.net` cert; **E4** build #4 success, log shows real install/upgrade/backup
(Playwright incl.) all passed, no softening; **E5** clean undeploy (0 residual); **E6** bridge PR
comment "✅ passed →…/cc-ci/4" + dashboard custom-html/success/#4. Evidence: JOURNAL-1c. Caught+fixed
the Drone-bot-token reproducibility gap (af46aca) en route. **Adversary independently verifies E1-E6.**
Remaining: swap-back; re-deploy af46aca to cc-ci (byte-identical at new toplevel `cqym8knj…`).
## 🔴 SWAP ACTIVE (2026-05-27 ~19:25Z) — public gateway points at the REBUILT VM (reversible)
**State:** `cc-nix-test` (MagicDNS) → **`100.97.167.73`** (rebuilt `ccci-w5-rebuild`); original cc-ci
renamed `cc-nix-test-orig` @ `100.90.116.4`, **still running** (swap-back target). Public
`ci.commoninternet.net` now served by the rebuilt VM (P2 verified 200/ssl_verify=0). Doing E2E-TESTME.
**E2E progress (2026-05-27 ~19:45Z):** E1 PASS (public 200/ssl_verify=0). Original's bridge PAUSED
(`ccci-bridge_app` 1/0 on cc-nix-test-orig). Rebuilt VM Drone OAuth done (admin=true, cc-ci active) —
needed a script fix (auto-approve, committed ee585ef). **Clean-room finding (committed af46aca):**
`DRONE_USER_CREATE` lacked `token:` → rebuilt Drone's bot token ≠ sops `bridge_drone_token` → bridge
401. Fix injects the sops token. **NOT yet applied to the rebuilt VM** (a no-op rebuild ran with old
config first). **NEXT:** (1) git pull af46aca on rebuilt VM + `nixos-rebuild switch` (applies token);
(2) verify bot token == sops (else `docker volume rm` Drone DB + redeploy so DRONE_USER_CREATE recreates
the bot w/ token; then re-run OAuth bootstrap); (3) run `!testme` on custom-html#2 (head db9a9502) →
verify E2E6; (4) swap-back; (5) re-deploy af46aca to cc-ci + re-verify byte-identical (Adversary re-checks C1).
**`ssh cc-ci` (pinned 100.90.116.4) = the ORIGINAL** (cc-nix-test-orig); reach the rebuilt VM via
`100.97.167.73` or `cc-nix-test` MagicDNS.
**SWAP-BACK when e2e done:** rebuilt VM → `tailscale set --hostname=ccci-w5-rebuild`; then
`ssh cc-ci 'tailscale set --hostname=cc-nix-test'`; restore original's bridge (`docker service scale
ccci-bridge_app=1` on the original — paused during e2e to avoid dual-trigger). Keep both VMs running.
## ⚠️ Operator override — do NOT destroy the FINAL throwaway VM (read before W5/W6 cleanup)
The operator (2026-05-27) will **repurpose the final W5/C4-C5 clean-room throwaway VM as the new
cc-nix-test** for a live real-traffic test. So: **KEEP that VM running after W5 PASSes — do NOT tear
it down in C6/W6.** Defer its teardown until the operator explicitly says otherwise. This overrides the
plan's "destroy the throwaway" for that one VM. (Adversary: please do not destroy your W5 VM on PASS.)
This also settles C6 final sizing = **promote the rebuilt VM**. All other cleanup is normal (Builder's
first throwaway already destroyed). See DECISIONS.md Phase-1c.
### Pending functional-acceptance e2e — E2E-TESTME (BUILDER owns swap+test; gated on C4/C5 PASS)
**Authority: `/srv/cc-ci/cc-ci-plan/test-e2e-testme-acceptance.md`** (supersedes inline wording).
MY test to execute end-to-end (incl. the tailnet swap — **no orchestrator signal**); Adversary
independently verifies but does **NOT** rename nodes (actor/critic split — only ONE loop renames).
**Target VM = the ADVERSARY's kept-running W5 VM** (Incus instance `ccci-w5-rebuild`; mine was
destroyed). **WAIT for: (i) Adversary W5 PASS in REVIEW-1c, AND (ii) the Adversary records that VM's
Incus instance + CURRENT tailscale IP** in REVIEW-1c/STATUS — only then swap. Sequence:
1. **Swap (Builder, 2 reversible `tailscale set --hostname`, ORDER MATTERS):**
(a) `ssh cc-ci 'tailscale set --hostname=cc-nix-test-orig'` — original aside, **keep running** (swap-back);
`ssh cc-ci` (pinned IP 100.90.116.4) keeps hitting the ORIGINAL.
(b) Adversary's W5 VM (`ccci-w5-rebuild`) → `cc-nix-test`, using the IP the Adversary recorded
(re-confirm online via `tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock status`), then
`ssh -i …/vm_ssh_key -o ProxyCommand='nc -X 5 -x 127.0.0.1:1055 %h %p' root@<ip> 'tailscale set --hostname=cc-nix-test'`.
After swap, `cc-nix-test.taila4a0bf.ts.net` → that VM tailnet-wide (gateway auto-follows ~10s);
target !testme/deploys by MagicDNS name, NOT raw IP (raw IP = original).
2. **Verify P1+P2:** `tailscale … status | grep cc-nix-test` → throwaway IP; `curl https://ci.commoninternet.net/` → `200 ssl_verify=0`.
3. **Run E2E-TESTME** (spec §2; E1E6 below). **4. Swap-back when done** (reversible): rebuilt VM →
its old name, then `ssh cc-ci 'tailscale set --hostname=cc-nix-test'` (restores original; gateway re-follows).
Watch-out (handle at execution): the ORIGINAL (cc-nix-test-orig) stays up with its bridge polling
Gitea — to avoid duplicate builds/PR-comments, pause its bridge during the e2e (`docker service
scale ccci-bridge_app=0` on the original, restore after); and the rebuilt VM's Drone needs the
one-time OAuth bootstrap (install.md §2) before it can clone/build.
Then: `!testme` as the bot on one fast enrolled recipe (e.g. `custom-html`) and verify the real path.
Pass criteria (all): **E1** self-check 200/valid cert on rebuilt VM; **E2** new Drone build via the
bridge (run# > baseline, not a manual trigger); **E3** app answers an **EXTERNAL** request at
`<app>.ci.commoninternet.net` through the gateway (real 200 + valid cert + app content, NOT localhost,
NOT a Traefik 404); **E4** real test assertions pass, build success (no softening); **E5** clean
undeploy (no residual stack); **E6** result reported back + dashboard updated. Evidence → JOURNAL-1c,
verdict → STATUS-1c/REVIEW-1c as **E2E-TESTME PASS**. On failure: it's a clean-room finding — fix in
**git source** (base / cc-ci-secrets), NOT the live VM, then re-run.
## Blocked
(none)
## Notes
- Current secret layout: `secrets/secrets.yaml` (6 infra secrets), recipients = host age key
(ssh-to-age of cc-ci's ed25519 host key) + off-box master recovery key
(`/srv/cc-ci/.sops/master-age.txt`, sandbox-only). `.sops.yaml` at repo root.
- Wildcard cert currently out-of-band at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
(operator-provided, LE, next renewal ~2026-08-24); proxy.nix reads it from there. 1c moves it
into sops-in-git, decrypted back to that path at activation.
- Sandbox host has NO sops/nix/age — sops ops run on cc-ci (has nix + host age key) or via the master
key with a sops binary fetched on cc-ci.
- cc-nix-test == the live cc-ci server (100.90.116.4); resizing it (W1) briefly stops it.