Files
cc-ci/machine-docs/STATUS-1c.md
autonomic-bot 992d87cfcd refactor(1b): RL6 — move Builder protocol files into machine-docs/ (README stays root)
git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator
decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and
docs/install.md -> machine-docs/...

Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by
every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so
it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 22:35:30 +01:00

196 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 1c (full git reproducibility + genuine D8 live rebuild)
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md`
**Loop state for THIS phase:** STATUS-1c / BACKLOG-1c / REVIEW-1c / JOURNAL-1c (DECISIONS.md shared).
The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY — not this phase's state.
## DONE
**Phase 1c COMPLETE @2026-05-27.** All Definition-of-Done items **C1C7 + E2E-TESTME** are
Adversary-PASS within 24h (REVIEW-1c: W2 16:55Z, W5/C4/C5 18:55Z, E2E + C1C6 b301b03, C7 9e0f72a),
**no standing VETO, no open `[adversary]` findings** (ADV-1c-1 closed). Final Builder health check:
cc-ci `running`/0-failed, **byte-identical build==running==`cqym8knjg7nkly1wdgwkyr873fm8scfl` (ZERO
DRIFT)**, 6 stacks, cert sops-from-git `c1d96d61…`, public TLS `ci.commoninternet.net` 200/ssl_verify=0.
The VM is now fully reproducible from git: blank NixOS host + the two repos (`cc-ci` +
`cc-ci-secrets` submodule) + the one bootstrap age key → a single `nixos-rebuild switch` → a
working cc-ci that serves a real `!testme` run end-to-end over the public domain (proven on a
throwaway VM, cold, by both loops). D8 closed honestly (static byte-identical closure + live rebuild;
"infeasible by design" withdrawn). Found+fixed two real reproducibility gaps en route: the
concurrent-`abra` reconcile race (serialized) and the non-deterministic Drone bot token
(`DRONE_USER_CREATE token:`).
- [x] C1 secrets-repo split · [x] C2 cert-in-git · [x] C3 all-secrets-in-git (1 bootstrap key) ·
[x] C4 throwaway live rebuild · [x] C5 honest D8 · [x] C6 resize+sizing (promote rebuilt VM) ·
[x] C7 docs · [x] E2E-TESTME (E1E6).
Open items handed to the operator (not 1c-gating): physical promotion of `ccci-w5-rebuild` → cc-nix-test
(its bridge paused, stack up — restore at promotion); plan.md §4.0/§4.4 still carry pre-1c cert wording
(out-of-repo; superseding note added at §1.5). Adversary will append its final cold sign-off.
<details><summary>pre-DONE phase note</summary>
**1c — Builder COMPLETE; only ADV-1c-1 (C7 re-verify) between here and DONE.** All addressed.</details>
## In flight — W4 DONE, Gate W4 CLAIMED
- W1 DONE (cc-nix-test 6→4 GB). W2 PASS (Adversary cold). W3 DONE (VM reachable).
- W4 DONE — genuine throwaway-VM live rebuild proven on a FRESH blank VM: only `/var/lib/sops-nix/
key.txt`=recovery key provisioned; `git clone --recursive` + **ONE** `nixos-rebuild switch
?submodules=1` → **running, 0 failed**, byte-identical **`ld19aj2`==cc-ci**, all 6 stacks 1/1, all
secrets+cert decrypted via recovery key, **TLS leaf == git cert** (`57:8D:…:B8:A6`), no manual step.
(Final config = ld19aj2: `sops.age.keyFile` + serialized abra reconcilers fixing a fresh-host race.)
- Throwaway destroyed (frees RAM for Adversary W5; C6 no-leftover). install.md updated to this procedure.
- Remaining: W5 (Adversary cold rebuild + honest D8 rewrite), W6 (docs C7 + final cc-nix-test sizing).
<details><summary>W2 detail (PASS)</summary>
## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed
- [x] **W2 step 1:** private `recipe-maintainers/cc-ci-secrets` created + populated (6 infra secrets
+ wildcard cert/key, sops, both recipients; sha256 byte-perfect) + pushed.
- [x] **W2 step 2:** base repo — `secrets/` is now the cc-ci-secrets submodule (gitlink 2312f1c);
secrets.nix adds `wildcard_cert`/`wildcard_key` → `/var/lib/ci-certs/live/*`; proxy.nix reframed.
Pushed f79e542. Switched live cc-ci (toplevel `vh6vwxbl…`). **Verified:** cert sops-decrypts from
git (symlinks, sha256 match), system running 0 failed, byte-identical (build==running), git-clone
`?submodules=1` path also reproduces `vh6vwxbl…`, live TLS valid (LE wildcard, ssl_verify=0).
- (Recovery-key `sops.age.keyFile` for the throwaway deferred to W3/W4 — re-verify byte-identical there.)
</details>
## 🟢 CONFIG FINAL @2026-05-27 ~20:05Z — toplevel `cqym8knjg7nkly1wdgwkyr873fm8scfl`
cc-ci switched to the FINAL config (secrets-split + cert-in-git + `sops.age.keyFile` + serialized abra
reconcilers + Drone-token fix). **Byte-identical: build==running==`cqym8knj…` (ZERO DRIFT)**, system
running 0 failed, bridge→Drone token OK. **No more config changes planned.**
**For the Adversary's final DONE verification:** (a) re-confirm **C1 byte-identical at `cqym8knj`**
(supersedes the ld19aj2 18:00Z / 18:55Z clocks — the only delta is the Drone-token fix af46aca);
(b) independently verify **E1E6** (E2E-TESTME — real `!testme`; note: requires the swap, OR verify
against the run #4 evidence + a fresh trigger; the rebuilt VM `ccci-w5-rebuild` is up with bridge
paused). C4/C5 hold (the rebuilt VM is also at `cqym8knj`; a fresh rebuild from the current repo
reproduces it). No VETO expected.
## Gate
**Gate: W4 — PASS @2026-05-27 18:55Z (Adversary, cold independent rebuild).** C4 + C5 verified on the
Adversary's own fresh blank VM `ccci-w5-rebuild`: single switch → `ld19aj2` byte-identical, 0 failed,
6/6 stacks, all secrets+cert from git via recovery key, TLS leaf == git cert. **C1C5 all
Adversary-PASS, no VETO.** D8 honest (infeasible superseded). Narrow signed-off limitation: Drone↔Gitea
OAuth grant (install.md §2 manual post-step) — validated functionally by E2E-TESTME next.
**Now (Builder): swap (`ccci-w5-rebuild @ 100.97.167.73` → cc-nix-test) + run E2E-TESTME (E1E6).**
<details><summary>prior W4 CLAIMED</summary>
**Gate: W4 — CLAIMED, awaiting Adversary @2026-05-27 ~18:45Z.** Genuine throwaway-VM live rebuild
(C4/C5/D8). For the Adversary's cold W5 (own fresh Incus VM in terraform-ci, ~4 GB; RAM is free — my
throwaway destroyed): provision ONLY `/var/lib/sops-nix/key.txt` = recovery age key (`age1cmk26…`
private half, from `/srv/cc-ci/.sops/master-age.txt`); `git clone --recursive` base+secrets (bot
creds); `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (per docs/install.md).
Expect: running/0-failed, toplevel `ld19aj2…`==cc-ci, 6 stacks 1/1, cert sha256 `c1d96d61…`, local
`curl --resolve …:127.0.0.1` ssl_verify=0 with served leaf == git cert `57:8D:…:B8:A6`. Then rewrite
the D8 evidence (static byte-identical + live rebuild; drop "infeasible by design"). My evidence:
JOURNAL-1c 2026-05-27 W4 entry. (Note: throwaway base VM = Incus image; live TS_AUTH_KEY in cloud-init.)
</details>
**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified (byte-identical, cert
from git + TLS leaf-match, no plaintext leak). Config has since evolved vh6vwxbl→izsmiajw→**ld19aj2**
(keyFile + serialized reconcilers); Adversary refreshed C1 against izsmiajw @18:00Z; ld19aj2 is final.
<details><summary>prior</summary>
**Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.**
Acceptance to verify (cold): (1) byte-identical `nixos-rebuild build .#cc-ci` == `/run/current-system`
(`vh6vwxbl4qr9whzpwgjimhf9gn4329p8`) — **must init the submodule** (`git clone --recursive` / `git
submodule update --init`, bot creds) then build `--flake 'git+file://<clone>?submodules=1#cc-ci'`, else
`secrets/` is empty; (2) cert sops-decrypted from git to `/var/lib/ci-certs/live/` (symlinks → /run/secrets,
sha256 `c1d96d61…`/`9ec25d00…`) + live TLS served (`https://ci.commoninternet.net`); (3) no plaintext
secret in base repo or Nix store (all 8 secrets ENC in cc-ci-secrets; cert decrypts to tmpfs, not store).
See JOURNAL-1c 2026-05-27 W2a entry for full evidence.
</details>
## Definition of Done (C1C7 — see phase plan §3)
- [x] C1 — Secrets-repo split (Adversary-PASS 16:55Z; re-exercised cold on blank host at C4)
- [x] C2 — Cert in git (Adversary-PASS 16:55Z; re-exercised at C4)
- [x] C3 — All secrets in git, one exception = bootstrap age key (Adversary-PASS 16:55Z; keyFile-on-throwaway at W4)
- [x] C4 — Genuine throwaway-VM live rebuild (Adversary-PASS W5 18:55Z, cold; rebuilt VM at cqym8knj)
- [x] C5 — Honest D8 (Adversary-PASS W5; static+live, "infeasible" superseded; narrow OAuth limitation signed off)
- [x] C6 — cc-nix-test 6→4 GB; first throwaway destroyed; final sizing = PROMOTE rebuilt VM (operator override, kept)
- [~] C7 — install.md/secrets.md/architecture.md + plan.md done; Adversary re-verify of architecture.md pending (ADV-1c-1, addressed 6276bfd)
## ✅ E2E-TESTME — PASS @2026-05-27 (functional acceptance of D8/clean-room)
Real `!testme` on the rebuilt-from-git VM (swapped in as cc-nix-test) over the PUBLIC domain:
**E1** public 200/ssl_verify=0; **E2** bridge→new Drone build #4 (>baseline #3, not manual); **E3**
app `cust-bdddd9.ci.commoninternet.net` EXTERNAL via gateway → HTTP/2 200, ssl_verify=0, real nginx
body, `CN=*.ci.commoninternet.net` cert; **E4** build #4 success, log shows real install/upgrade/backup
(Playwright incl.) all passed, no softening; **E5** clean undeploy (0 residual); **E6** bridge PR
comment "✅ passed →…/cc-ci/4" + dashboard custom-html/success/#4. Evidence: JOURNAL-1c. Caught+fixed
the Drone-bot-token reproducibility gap (af46aca) en route. **Adversary independently verifies E1-E6.**
Remaining: swap-back; re-deploy af46aca to cc-ci (byte-identical at new toplevel `cqym8knj…`).
## SWAP REVERTED (2026-05-27 ~20:00Z) — public back on the ORIGINAL cc-ci
E2E-TESTME passed; swapped back: `cc-nix-test` (MagicDNS) → `100.90.116.4` (original), public
`ci.commoninternet.net` → 200 ssl_verify=0 (original); original bridge restored 1/1, healthy. The
rebuilt VM `ccci-w5-rebuild` @ `100.97.167.73` is **kept running** (C6 override, operator promotes it)
with its **bridge paused** (`ccci-bridge_app` 0) to avoid dual-trigger on real PRs (operator restores
at promotion). Remaining: re-deploy af46aca (Drone-token fix, toplevel `cqym8knj…`) to the original cc-ci
→ re-verify byte-identical; Adversary re-checks C1 + verifies E1-E6.
<details><summary>swap-active history</summary>
Public gateway pointed at the rebuilt VM (`100.97.167.73`) during the e2e; original was cc-nix-test-orig.</details>
**E2E progress (2026-05-27 ~19:45Z):** E1 PASS (public 200/ssl_verify=0). Original's bridge PAUSED
(`ccci-bridge_app` 1/0 on cc-nix-test-orig). Rebuilt VM Drone OAuth done (admin=true, cc-ci active) —
needed a script fix (auto-approve, committed ee585ef). **Clean-room finding (committed af46aca):**
`DRONE_USER_CREATE` lacked `token:` → rebuilt Drone's bot token ≠ sops `bridge_drone_token` → bridge
401. Fix injects the sops token. **NOT yet applied to the rebuilt VM** (a no-op rebuild ran with old
config first). **NEXT:** (1) git pull af46aca on rebuilt VM + `nixos-rebuild switch` (applies token);
(2) verify bot token == sops (else `docker volume rm` Drone DB + redeploy so DRONE_USER_CREATE recreates
the bot w/ token; then re-run OAuth bootstrap); (3) run `!testme` on custom-html#2 (head db9a9502) →
verify E2E6; (4) swap-back; (5) re-deploy af46aca to cc-ci + re-verify byte-identical (Adversary re-checks C1).
**`ssh cc-ci` (pinned 100.90.116.4) = the ORIGINAL** (cc-nix-test-orig); reach the rebuilt VM via
`100.97.167.73` or `cc-nix-test` MagicDNS.
**SWAP-BACK when e2e done:** rebuilt VM → `tailscale set --hostname=ccci-w5-rebuild`; then
`ssh cc-ci 'tailscale set --hostname=cc-nix-test'`; restore original's bridge (`docker service scale
ccci-bridge_app=1` on the original — paused during e2e to avoid dual-trigger). Keep both VMs running.
## ⚠️ Operator override — do NOT destroy the FINAL throwaway VM (read before W5/W6 cleanup)
The operator (2026-05-27) will **repurpose the final W5/C4-C5 clean-room throwaway VM as the new
cc-nix-test** for a live real-traffic test. So: **KEEP that VM running after W5 PASSes — do NOT tear
it down in C6/W6.** Defer its teardown until the operator explicitly says otherwise. This overrides the
plan's "destroy the throwaway" for that one VM. (Adversary: please do not destroy your W5 VM on PASS.)
This also settles C6 final sizing = **promote the rebuilt VM**. All other cleanup is normal (Builder's
first throwaway already destroyed). See DECISIONS.md Phase-1c.
### Pending functional-acceptance e2e — E2E-TESTME (BUILDER owns swap+test; gated on C4/C5 PASS)
**Authority: `/srv/cc-ci/cc-ci-plan/test-e2e-testme-acceptance.md`** (supersedes inline wording).
MY test to execute end-to-end (incl. the tailnet swap — **no orchestrator signal**); Adversary
independently verifies but does **NOT** rename nodes (actor/critic split — only ONE loop renames).
**Target VM = the ADVERSARY's kept-running W5 VM** (Incus instance `ccci-w5-rebuild`; mine was
destroyed). **WAIT for: (i) Adversary W5 PASS in REVIEW-1c, AND (ii) the Adversary records that VM's
Incus instance + CURRENT tailscale IP** in REVIEW-1c/STATUS — only then swap. Sequence:
1. **Swap (Builder, 2 reversible `tailscale set --hostname`, ORDER MATTERS):**
(a) `ssh cc-ci 'tailscale set --hostname=cc-nix-test-orig'` — original aside, **keep running** (swap-back);
`ssh cc-ci` (pinned IP 100.90.116.4) keeps hitting the ORIGINAL.
(b) Adversary's W5 VM (`ccci-w5-rebuild`) → `cc-nix-test`, using the IP the Adversary recorded
(re-confirm online via `tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock status`), then
`ssh -i …/vm_ssh_key -o ProxyCommand='nc -X 5 -x 127.0.0.1:1055 %h %p' root@<ip> 'tailscale set --hostname=cc-nix-test'`.
After swap, `cc-nix-test.taila4a0bf.ts.net` → that VM tailnet-wide (gateway auto-follows ~10s);
target !testme/deploys by MagicDNS name, NOT raw IP (raw IP = original).
2. **Verify P1+P2:** `tailscale … status | grep cc-nix-test` → throwaway IP; `curl https://ci.commoninternet.net/` → `200 ssl_verify=0`.
3. **Run E2E-TESTME** (spec §2; E1E6 below). **4. Swap-back when done** (reversible): rebuilt VM →
its old name, then `ssh cc-ci 'tailscale set --hostname=cc-nix-test'` (restores original; gateway re-follows).
Watch-out (handle at execution): the ORIGINAL (cc-nix-test-orig) stays up with its bridge polling
Gitea — to avoid duplicate builds/PR-comments, pause its bridge during the e2e (`docker service
scale ccci-bridge_app=0` on the original, restore after); and the rebuilt VM's Drone needs the
one-time OAuth bootstrap (install.md §2) before it can clone/build.
Then: `!testme` as the bot on one fast enrolled recipe (e.g. `custom-html`) and verify the real path.
Pass criteria (all): **E1** self-check 200/valid cert on rebuilt VM; **E2** new Drone build via the
bridge (run# > baseline, not a manual trigger); **E3** app answers an **EXTERNAL** request at
`<app>.ci.commoninternet.net` through the gateway (real 200 + valid cert + app content, NOT localhost,
NOT a Traefik 404); **E4** real test assertions pass, build success (no softening); **E5** clean
undeploy (no residual stack); **E6** result reported back + dashboard updated. Evidence → JOURNAL-1c,
verdict → STATUS-1c/REVIEW-1c as **E2E-TESTME PASS**. On failure: it's a clean-room finding — fix in
**git source** (base / cc-ci-secrets), NOT the live VM, then re-run.
## Blocked
(none)
## Notes
- Current secret layout: `secrets/secrets.yaml` (6 infra secrets), recipients = host age key
(ssh-to-age of cc-ci's ed25519 host key) + off-box master recovery key
(`/srv/cc-ci/.sops/master-age.txt`, sandbox-only). `.sops.yaml` at repo root.
- Wildcard cert currently out-of-band at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
(operator-provided, LE, next renewal ~2026-08-24); proxy.nix reads it from there. 1c moves it
into sops-in-git, decrypted back to that path at activation.
- Sandbox host has NO sops/nix/age — sops ops run on cc-ci (has nix + host age key) or via the master
key with a sops binary fetched on cc-ci.
- cc-nix-test == the live cc-ci server (100.90.116.4); resizing it (W1) briefly stops it.