git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and docs/install.md -> machine-docs/... Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
196 lines
16 KiB
Markdown
196 lines
16 KiB
Markdown
# STATUS — Phase 1c (full git reproducibility + genuine D8 live rebuild)
|
||
|
||
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md`
|
||
**Loop state for THIS phase:** STATUS-1c / BACKLOG-1c / REVIEW-1c / JOURNAL-1c (DECISIONS.md shared).
|
||
The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY — not this phase's state.
|
||
|
||
## DONE
|
||
**Phase 1c COMPLETE @2026-05-27.** All Definition-of-Done items **C1–C7 + E2E-TESTME** are
|
||
Adversary-PASS within 24h (REVIEW-1c: W2 16:55Z, W5/C4/C5 18:55Z, E2E + C1–C6 b301b03, C7 9e0f72a),
|
||
**no standing VETO, no open `[adversary]` findings** (ADV-1c-1 closed). Final Builder health check:
|
||
cc-ci `running`/0-failed, **byte-identical build==running==`cqym8knjg7nkly1wdgwkyr873fm8scfl` (ZERO
|
||
DRIFT)**, 6 stacks, cert sops-from-git `c1d96d61…`, public TLS `ci.commoninternet.net` 200/ssl_verify=0.
|
||
|
||
The VM is now fully reproducible from git: blank NixOS host + the two repos (`cc-ci` +
|
||
`cc-ci-secrets` submodule) + the one bootstrap age key → a single `nixos-rebuild switch` → a
|
||
working cc-ci that serves a real `!testme` run end-to-end over the public domain (proven on a
|
||
throwaway VM, cold, by both loops). D8 closed honestly (static byte-identical closure + live rebuild;
|
||
"infeasible by design" withdrawn). Found+fixed two real reproducibility gaps en route: the
|
||
concurrent-`abra` reconcile race (serialized) and the non-deterministic Drone bot token
|
||
(`DRONE_USER_CREATE token:`).
|
||
|
||
- [x] C1 secrets-repo split · [x] C2 cert-in-git · [x] C3 all-secrets-in-git (1 bootstrap key) ·
|
||
[x] C4 throwaway live rebuild · [x] C5 honest D8 · [x] C6 resize+sizing (promote rebuilt VM) ·
|
||
[x] C7 docs · [x] E2E-TESTME (E1–E6).
|
||
|
||
Open items handed to the operator (not 1c-gating): physical promotion of `ccci-w5-rebuild` → cc-nix-test
|
||
(its bridge paused, stack up — restore at promotion); plan.md §4.0/§4.4 still carry pre-1c cert wording
|
||
(out-of-repo; superseding note added at §1.5). Adversary will append its final cold sign-off.
|
||
|
||
<details><summary>pre-DONE phase note</summary>
|
||
**1c — Builder COMPLETE; only ADV-1c-1 (C7 re-verify) between here and DONE.** All addressed.</details>
|
||
|
||
## In flight — W4 DONE, Gate W4 CLAIMED
|
||
- W1 DONE (cc-nix-test 6→4 GB). W2 PASS (Adversary cold). W3 DONE (VM reachable).
|
||
- W4 DONE — genuine throwaway-VM live rebuild proven on a FRESH blank VM: only `/var/lib/sops-nix/
|
||
key.txt`=recovery key provisioned; `git clone --recursive` + **ONE** `nixos-rebuild switch
|
||
?submodules=1` → **running, 0 failed**, byte-identical **`ld19aj2`==cc-ci**, all 6 stacks 1/1, all
|
||
secrets+cert decrypted via recovery key, **TLS leaf == git cert** (`57:8D:…:B8:A6`), no manual step.
|
||
(Final config = ld19aj2: `sops.age.keyFile` + serialized abra reconcilers fixing a fresh-host race.)
|
||
- Throwaway destroyed (frees RAM for Adversary W5; C6 no-leftover). install.md updated to this procedure.
|
||
- Remaining: W5 (Adversary cold rebuild + honest D8 rewrite), W6 (docs C7 + final cc-nix-test sizing).
|
||
|
||
<details><summary>W2 detail (PASS)</summary>
|
||
## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed
|
||
- [x] **W2 step 1:** private `recipe-maintainers/cc-ci-secrets` created + populated (6 infra secrets
|
||
+ wildcard cert/key, sops, both recipients; sha256 byte-perfect) + pushed.
|
||
- [x] **W2 step 2:** base repo — `secrets/` is now the cc-ci-secrets submodule (gitlink 2312f1c);
|
||
secrets.nix adds `wildcard_cert`/`wildcard_key` → `/var/lib/ci-certs/live/*`; proxy.nix reframed.
|
||
Pushed f79e542. Switched live cc-ci (toplevel `vh6vwxbl…`). **Verified:** cert sops-decrypts from
|
||
git (symlinks, sha256 match), system running 0 failed, byte-identical (build==running), git-clone
|
||
`?submodules=1` path also reproduces `vh6vwxbl…`, live TLS valid (LE wildcard, ssl_verify=0).
|
||
- (Recovery-key `sops.age.keyFile` for the throwaway deferred to W3/W4 — re-verify byte-identical there.)
|
||
</details>
|
||
|
||
## 🟢 CONFIG FINAL @2026-05-27 ~20:05Z — toplevel `cqym8knjg7nkly1wdgwkyr873fm8scfl`
|
||
cc-ci switched to the FINAL config (secrets-split + cert-in-git + `sops.age.keyFile` + serialized abra
|
||
reconcilers + Drone-token fix). **Byte-identical: build==running==`cqym8knj…` (ZERO DRIFT)**, system
|
||
running 0 failed, bridge→Drone token OK. **No more config changes planned.**
|
||
**For the Adversary's final DONE verification:** (a) re-confirm **C1 byte-identical at `cqym8knj`**
|
||
(supersedes the ld19aj2 18:00Z / 18:55Z clocks — the only delta is the Drone-token fix af46aca);
|
||
(b) independently verify **E1–E6** (E2E-TESTME — real `!testme`; note: requires the swap, OR verify
|
||
against the run #4 evidence + a fresh trigger; the rebuilt VM `ccci-w5-rebuild` is up with bridge
|
||
paused). C4/C5 hold (the rebuilt VM is also at `cqym8knj`; a fresh rebuild from the current repo
|
||
reproduces it). No VETO expected.
|
||
|
||
## Gate
|
||
**Gate: W4 — PASS @2026-05-27 18:55Z (Adversary, cold independent rebuild).** C4 + C5 verified on the
|
||
Adversary's own fresh blank VM `ccci-w5-rebuild`: single switch → `ld19aj2` byte-identical, 0 failed,
|
||
6/6 stacks, all secrets+cert from git via recovery key, TLS leaf == git cert. **C1–C5 all
|
||
Adversary-PASS, no VETO.** D8 honest (infeasible superseded). Narrow signed-off limitation: Drone↔Gitea
|
||
OAuth grant (install.md §2 manual post-step) — validated functionally by E2E-TESTME next.
|
||
**Now (Builder): swap (`ccci-w5-rebuild @ 100.97.167.73` → cc-nix-test) + run E2E-TESTME (E1–E6).**
|
||
|
||
<details><summary>prior W4 CLAIMED</summary>
|
||
**Gate: W4 — CLAIMED, awaiting Adversary @2026-05-27 ~18:45Z.** Genuine throwaway-VM live rebuild
|
||
(C4/C5/D8). For the Adversary's cold W5 (own fresh Incus VM in terraform-ci, ~4 GB; RAM is free — my
|
||
throwaway destroyed): provision ONLY `/var/lib/sops-nix/key.txt` = recovery age key (`age1cmk26…`
|
||
private half, from `/srv/cc-ci/.sops/master-age.txt`); `git clone --recursive` base+secrets (bot
|
||
creds); `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (per docs/install.md).
|
||
Expect: running/0-failed, toplevel `ld19aj2…`==cc-ci, 6 stacks 1/1, cert sha256 `c1d96d61…`, local
|
||
`curl --resolve …:127.0.0.1` ssl_verify=0 with served leaf == git cert `57:8D:…:B8:A6`. Then rewrite
|
||
the D8 evidence (static byte-identical + live rebuild; drop "infeasible by design"). My evidence:
|
||
JOURNAL-1c 2026-05-27 W4 entry. (Note: throwaway base VM = Incus image; live TS_AUTH_KEY in cloud-init.)
|
||
</details>
|
||
|
||
**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified (byte-identical, cert
|
||
from git + TLS leaf-match, no plaintext leak). Config has since evolved vh6vwxbl→izsmiajw→**ld19aj2**
|
||
(keyFile + serialized reconcilers); Adversary refreshed C1 against izsmiajw @18:00Z; ld19aj2 is final.
|
||
|
||
<details><summary>prior</summary>
|
||
**Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.**
|
||
Acceptance to verify (cold): (1) byte-identical `nixos-rebuild build .#cc-ci` == `/run/current-system`
|
||
(`vh6vwxbl4qr9whzpwgjimhf9gn4329p8`) — **must init the submodule** (`git clone --recursive` / `git
|
||
submodule update --init`, bot creds) then build `--flake 'git+file://<clone>?submodules=1#cc-ci'`, else
|
||
`secrets/` is empty; (2) cert sops-decrypted from git to `/var/lib/ci-certs/live/` (symlinks → /run/secrets,
|
||
sha256 `c1d96d61…`/`9ec25d00…`) + live TLS served (`https://ci.commoninternet.net`); (3) no plaintext
|
||
secret in base repo or Nix store (all 8 secrets ENC in cc-ci-secrets; cert decrypts to tmpfs, not store).
|
||
See JOURNAL-1c 2026-05-27 W2a entry for full evidence.
|
||
</details>
|
||
|
||
## Definition of Done (C1–C7 — see phase plan §3)
|
||
- [x] C1 — Secrets-repo split (Adversary-PASS 16:55Z; re-exercised cold on blank host at C4)
|
||
- [x] C2 — Cert in git (Adversary-PASS 16:55Z; re-exercised at C4)
|
||
- [x] C3 — All secrets in git, one exception = bootstrap age key (Adversary-PASS 16:55Z; keyFile-on-throwaway at W4)
|
||
- [x] C4 — Genuine throwaway-VM live rebuild (Adversary-PASS W5 18:55Z, cold; rebuilt VM at cqym8knj)
|
||
- [x] C5 — Honest D8 (Adversary-PASS W5; static+live, "infeasible" superseded; narrow OAuth limitation signed off)
|
||
- [x] C6 — cc-nix-test 6→4 GB; first throwaway destroyed; final sizing = PROMOTE rebuilt VM (operator override, kept)
|
||
- [~] C7 — install.md/secrets.md/architecture.md + plan.md done; Adversary re-verify of architecture.md pending (ADV-1c-1, addressed 6276bfd)
|
||
|
||
## ✅ E2E-TESTME — PASS @2026-05-27 (functional acceptance of D8/clean-room)
|
||
Real `!testme` on the rebuilt-from-git VM (swapped in as cc-nix-test) over the PUBLIC domain:
|
||
**E1** public 200/ssl_verify=0; **E2** bridge→new Drone build #4 (>baseline #3, not manual); **E3**
|
||
app `cust-bdddd9.ci.commoninternet.net` EXTERNAL via gateway → HTTP/2 200, ssl_verify=0, real nginx
|
||
body, `CN=*.ci.commoninternet.net` cert; **E4** build #4 success, log shows real install/upgrade/backup
|
||
(Playwright incl.) all passed, no softening; **E5** clean undeploy (0 residual); **E6** bridge PR
|
||
comment "✅ passed →…/cc-ci/4" + dashboard custom-html/success/#4. Evidence: JOURNAL-1c. Caught+fixed
|
||
the Drone-bot-token reproducibility gap (af46aca) en route. **Adversary independently verifies E1-E6.**
|
||
Remaining: swap-back; re-deploy af46aca to cc-ci (byte-identical at new toplevel `cqym8knj…`).
|
||
|
||
## SWAP REVERTED (2026-05-27 ~20:00Z) — public back on the ORIGINAL cc-ci
|
||
E2E-TESTME passed; swapped back: `cc-nix-test` (MagicDNS) → `100.90.116.4` (original), public
|
||
`ci.commoninternet.net` → 200 ssl_verify=0 (original); original bridge restored 1/1, healthy. The
|
||
rebuilt VM `ccci-w5-rebuild` @ `100.97.167.73` is **kept running** (C6 override, operator promotes it)
|
||
with its **bridge paused** (`ccci-bridge_app` 0) to avoid dual-trigger on real PRs (operator restores
|
||
at promotion). Remaining: re-deploy af46aca (Drone-token fix, toplevel `cqym8knj…`) to the original cc-ci
|
||
→ re-verify byte-identical; Adversary re-checks C1 + verifies E1-E6.
|
||
<details><summary>swap-active history</summary>
|
||
Public gateway pointed at the rebuilt VM (`100.97.167.73`) during the e2e; original was cc-nix-test-orig.</details>
|
||
**E2E progress (2026-05-27 ~19:45Z):** E1 PASS (public 200/ssl_verify=0). Original's bridge PAUSED
|
||
(`ccci-bridge_app` 1/0 on cc-nix-test-orig). Rebuilt VM Drone OAuth done (admin=true, cc-ci active) —
|
||
needed a script fix (auto-approve, committed ee585ef). **Clean-room finding (committed af46aca):**
|
||
`DRONE_USER_CREATE` lacked `token:` → rebuilt Drone's bot token ≠ sops `bridge_drone_token` → bridge
|
||
401. Fix injects the sops token. **NOT yet applied to the rebuilt VM** (a no-op rebuild ran with old
|
||
config first). **NEXT:** (1) git pull af46aca on rebuilt VM + `nixos-rebuild switch` (applies token);
|
||
(2) verify bot token == sops (else `docker volume rm` Drone DB + redeploy so DRONE_USER_CREATE recreates
|
||
the bot w/ token; then re-run OAuth bootstrap); (3) run `!testme` on custom-html#2 (head db9a9502) →
|
||
verify E2–E6; (4) swap-back; (5) re-deploy af46aca to cc-ci + re-verify byte-identical (Adversary re-checks C1).
|
||
**`ssh cc-ci` (pinned 100.90.116.4) = the ORIGINAL** (cc-nix-test-orig); reach the rebuilt VM via
|
||
`100.97.167.73` or `cc-nix-test` MagicDNS.
|
||
**SWAP-BACK when e2e done:** rebuilt VM → `tailscale set --hostname=ccci-w5-rebuild`; then
|
||
`ssh cc-ci 'tailscale set --hostname=cc-nix-test'`; restore original's bridge (`docker service scale
|
||
ccci-bridge_app=1` on the original — paused during e2e to avoid dual-trigger). Keep both VMs running.
|
||
|
||
## ⚠️ Operator override — do NOT destroy the FINAL throwaway VM (read before W5/W6 cleanup)
|
||
The operator (2026-05-27) will **repurpose the final W5/C4-C5 clean-room throwaway VM as the new
|
||
cc-nix-test** for a live real-traffic test. So: **KEEP that VM running after W5 PASSes — do NOT tear
|
||
it down in C6/W6.** Defer its teardown until the operator explicitly says otherwise. This overrides the
|
||
plan's "destroy the throwaway" for that one VM. (Adversary: please do not destroy your W5 VM on PASS.)
|
||
This also settles C6 final sizing = **promote the rebuilt VM**. All other cleanup is normal (Builder's
|
||
first throwaway already destroyed). See DECISIONS.md Phase-1c.
|
||
|
||
### Pending functional-acceptance e2e — E2E-TESTME (BUILDER owns swap+test; gated on C4/C5 PASS)
|
||
**Authority: `/srv/cc-ci/cc-ci-plan/test-e2e-testme-acceptance.md`** (supersedes inline wording).
|
||
MY test to execute end-to-end (incl. the tailnet swap — **no orchestrator signal**); Adversary
|
||
independently verifies but does **NOT** rename nodes (actor/critic split — only ONE loop renames).
|
||
**Target VM = the ADVERSARY's kept-running W5 VM** (Incus instance `ccci-w5-rebuild`; mine was
|
||
destroyed). **WAIT for: (i) Adversary W5 PASS in REVIEW-1c, AND (ii) the Adversary records that VM's
|
||
Incus instance + CURRENT tailscale IP** in REVIEW-1c/STATUS — only then swap. Sequence:
|
||
1. **Swap (Builder, 2 reversible `tailscale set --hostname`, ORDER MATTERS):**
|
||
(a) `ssh cc-ci 'tailscale set --hostname=cc-nix-test-orig'` — original aside, **keep running** (swap-back);
|
||
`ssh cc-ci` (pinned IP 100.90.116.4) keeps hitting the ORIGINAL.
|
||
(b) Adversary's W5 VM (`ccci-w5-rebuild`) → `cc-nix-test`, using the IP the Adversary recorded
|
||
(re-confirm online via `tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock status`), then
|
||
`ssh -i …/vm_ssh_key -o ProxyCommand='nc -X 5 -x 127.0.0.1:1055 %h %p' root@<ip> 'tailscale set --hostname=cc-nix-test'`.
|
||
After swap, `cc-nix-test.taila4a0bf.ts.net` → that VM tailnet-wide (gateway auto-follows ~10s);
|
||
target !testme/deploys by MagicDNS name, NOT raw IP (raw IP = original).
|
||
2. **Verify P1+P2:** `tailscale … status | grep cc-nix-test` → throwaway IP; `curl https://ci.commoninternet.net/` → `200 ssl_verify=0`.
|
||
3. **Run E2E-TESTME** (spec §2; E1–E6 below). **4. Swap-back when done** (reversible): rebuilt VM →
|
||
its old name, then `ssh cc-ci 'tailscale set --hostname=cc-nix-test'` (restores original; gateway re-follows).
|
||
Watch-out (handle at execution): the ORIGINAL (cc-nix-test-orig) stays up with its bridge polling
|
||
Gitea — to avoid duplicate builds/PR-comments, pause its bridge during the e2e (`docker service
|
||
scale ccci-bridge_app=0` on the original, restore after); and the rebuilt VM's Drone needs the
|
||
one-time OAuth bootstrap (install.md §2) before it can clone/build.
|
||
Then: `!testme` as the bot on one fast enrolled recipe (e.g. `custom-html`) and verify the real path.
|
||
Pass criteria (all): **E1** self-check 200/valid cert on rebuilt VM; **E2** new Drone build via the
|
||
bridge (run# > baseline, not a manual trigger); **E3** app answers an **EXTERNAL** request at
|
||
`<app>.ci.commoninternet.net` through the gateway (real 200 + valid cert + app content, NOT localhost,
|
||
NOT a Traefik 404); **E4** real test assertions pass, build success (no softening); **E5** clean
|
||
undeploy (no residual stack); **E6** result reported back + dashboard updated. Evidence → JOURNAL-1c,
|
||
verdict → STATUS-1c/REVIEW-1c as **E2E-TESTME PASS**. On failure: it's a clean-room finding — fix in
|
||
**git source** (base / cc-ci-secrets), NOT the live VM, then re-run.
|
||
|
||
## Blocked
|
||
(none)
|
||
|
||
## Notes
|
||
- Current secret layout: `secrets/secrets.yaml` (6 infra secrets), recipients = host age key
|
||
(ssh-to-age of cc-ci's ed25519 host key) + off-box master recovery key
|
||
(`/srv/cc-ci/.sops/master-age.txt`, sandbox-only). `.sops.yaml` at repo root.
|
||
- Wildcard cert currently out-of-band at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
|
||
(operator-provided, LE, next renewal ~2026-08-24); proxy.nix reads it from there. 1c moves it
|
||
into sops-in-git, decrypted back to that path at activation.
|
||
- Sandbox host has NO sops/nix/age — sops ops run on cc-ci (has nix + host age key) or via the master
|
||
key with a sops binary fetched on cc-ci.
|
||
- cc-nix-test == the live cc-ci server (100.90.116.4); resizing it (W1) briefly stops it.
|