470 lines
38 KiB
Markdown
470 lines
38 KiB
Markdown
# REVIEW-canon — Adversary verdicts for the `canon` (canonical-sweep) phase
|
||
|
||
SSOT for what is being verified: `/srv/cc-ci/cc-ci-plan/plan-phase-canon-canonical-sweep.md`.
|
||
Gates: **M1** (machinery works locally, each piece proven) and **M2** (proven end-to-end in real CI),
|
||
plus the operator-required **samever-orthogonality** proof. `## DONE` only after fresh PASS on both.
|
||
|
||
---
|
||
|
||
## Orientation @ 2026-06-17T06:18Z — Adversary online for canon phase; no gate claimed yet
|
||
|
||
Prior phase `samever` is DONE + Adversary-verified (M1 1310a95, M2 199f5b6, no VETO). The `canon`
|
||
phase has **not** been bootstrapped by the Builder yet: no STATUS-canon.md / BACKLOG-canon.md, no
|
||
`claim(`/`status(canon` commits, no inbox. I am idling per liveness protocol and will verify promptly
|
||
when M1 is CLAIMED (watchdog will ping on the claim).
|
||
|
||
### Independent COLD baseline of the claimed starting state (§1) — captured before any canon work
|
||
|
||
Verified from my own clone + a cold `ssh cc-ci`, NOT from the Builder:
|
||
|
||
- **Enrollment:** exactly **one** recipe sets `WARM_CANONICAL = True` → `custom-html`. (`grep -rl
|
||
'WARM_CANONICAL *= *True' tests/*/recipe_meta.py` → 1 hit.) Matches §1 "only custom-html enrolled".
|
||
- **canonical.json records on cc-ci:** exactly **one**, for `custom-html`:
|
||
`/var/lib/ci-warm/custom-html/canonical.json` =
|
||
`{recipe: custom-html, version: 1.13.0+1.31.1, commit: 2b82ebabde74a9d9b1fd4cb49722a7037b18a176,
|
||
status: idle, ts: 20260617T050314Z}`, retained volume `warm-custom-html_..._content` present.
|
||
- **NOTE — plan §1 is now slightly stale.** The plan (authored 04:43Z) says "ZERO canonical.json
|
||
records exist." That was true at authoring, but the just-completed **samever M2** e2e
|
||
(custom-html two-run) wrote this record at **05:03:14Z**. So there is now exactly one canonical,
|
||
produced by samever's promote path. This is *favorable* evidence for canon M1(A) — the promote
|
||
path already demonstrably writes a real, reusable record + retains the volume for custom-html —
|
||
but the Builder must NOT cite custom-html's pre-existing canonical as proof of canon's *new*
|
||
work (tagged-gate, trigger, all-enrolled, mirror-sync). I will require fresh, canon-attributable
|
||
evidence for each M1/M2 sub-claim.
|
||
- **Timer:** `nightly-sweep.timer` enabled+active, daily `OnCalendar` (NEXT 2026-06-18 03:00:24 UTC),
|
||
last fired 2026-06-17 03:09:20 UTC exit 0. So the timer plumbing works; the job was a near-no-op
|
||
(only custom-html enrolled). Phase must (F) move this to **weekly** and (M2) prove a real fire
|
||
advances canonicals, not exit-0 on an empty set.
|
||
|
||
### What I will adversarially probe when claimed (from the plan, not the Builder's narrative)
|
||
- M1(A): a canon-attributable green cold run writes canonical.json AND `--quick` warm-reattach reuses
|
||
it; promote now ALSO requires a **release tag** — feed an UNTAGGED state, confirm NO promote.
|
||
- M1(C): mirror-sync is *faithful upstream sync only* — never pushes our changes to mirror `main`,
|
||
never disturbs unrelated PRs. Will diff before/after on a mirror.
|
||
- M1(D): trigger keyed on **latest release tag vs canonical version**, NOT commit — new untagged
|
||
commits on `main` with same tag ⇒ SKIP; newer tag ⇒ run cold on that tag.
|
||
- M1(B): all ~21 recipes enrolled; warm-volume disk budget recorded (not silently dropped).
|
||
- M2: full sweep promotes greens / leaves reds intact / skips unchanged; **run-twice ⇒ skip-all**
|
||
determinism; real (non-hollow) timer fire; tagged-promote proof (untagged green ⇒ no promote).
|
||
- samever orthogonality: (a) no-new-tag ⇒ SKIPPED; (b) new-tag ⇒ canonical(older)→new, real delta,
|
||
promote; step-back NEVER fires in the sweep. Construct scenarios if the live set doesn't cover both.
|
||
- §2.G: if plausible's canonical lands at 3.0.1, `UPGRADE_BASE_VERSION` retired cleanly (key +
|
||
resolver branch + docs + tests) AND plausible still resolves base 3.0.1 dynamically + passes — else
|
||
kept with a recorded DECISIONS reason. Will re-derive, not trust.
|
||
- Guardrail: NO AI at runtime (pure script + timer).
|
||
|
||
## Pre-claim code read @ 2026-06-17T06:41Z — M1 still IN PROGRESS (M1.2 not yet committed)
|
||
|
||
Builder has landed 4 of 5 M1 items (27e0628 M1.1, 136100f M1.3, f8c0e53 M1.4+M1.5). M1.2 (the
|
||
release-tag trigger `sweep_decision` + mirror-sync wiring into `nightly_sweep.sweep()`) is **not yet
|
||
committed** — M1 is correctly not-yet-claimed. Read the landed code (NOT JOURNAL); points to scrutinize
|
||
when claimed:
|
||
- **M1.1 (27e0628):** `should_promote_canonical` gained `tagged` param; caller computes
|
||
`tagged = warm_reconcile.is_released_version(recipe, head_version)`. ⚠️ PROBE: the gate checks
|
||
`head_version` (code under test) but `promote_canonical` records `latest_version(recipe_tags(recipe))`
|
||
(newest tag). Confirm these can't diverge — e.g. a manual latest run where `main` sits on a tagged
|
||
commit OLDER than `latest` tag would gate on the older tag yet promote the newer. In the sweep path
|
||
(D) the tag is checked out so head==tag; verify the manual/`RECIPE=<r>` path too.
|
||
- **M1.4 (f8c0e53):** root cause = sweep service ran the nix-STORE runner copy (no `tests/`) so
|
||
`TESTS_DIR` missing → `enrolled_recipes()=[]`. Fix sets `CCCI_REPO=/etc/cc-ci` + `cd` + execs
|
||
`$CCCI_REPO/runner/nightly_sweep.py`. ⚠️ PROBE at M2: confirm `/etc/cc-ci` actually exists on cc-ci,
|
||
has runner/ AND tests/, and is git-pulled before nixos-rebuild (else still hollow). The fix also
|
||
means sweep-logic ships via checkout pull, NOT a store rebuild — verify deploy procedure pulls it.
|
||
- **M1.5 (f8c0e53):** `OnCalendar` daily → `Sun *-*-* 03:00:00`, Persistent kept. Trivial; verify the
|
||
deployed timer shows the weekly schedule after M2.1 nixos-rebuild.
|
||
- **M1.3 (136100f):** enroll all 21 — verify the count is exactly the `used-recipes.md` set and that
|
||
fixtures (custom-html-*-bad, concurrency, regression) were NOT enrolled.
|
||
- **Still owed for M1 claim:** M1.2 `sweep_decision(recipe, latest_tag, canon_version)` →
|
||
run|skip:no-new-version|skip:never-released keyed on `version_key` NOT commit; mirror-sync via
|
||
`open-recipe-pr.sh --reconcile-only` (faithful, vendored); cold-run ON THE TAG. Unit tests for all.
|
||
|
||
---
|
||
|
||
## M1: PASS @ 2026-06-17T07:12Z — machinery cold-verified (claim 626badd, code @ d4cc9e4)
|
||
|
||
Verified from a COLD start: my own clone for code/pure-logic, a fresh independent clone on cc-ci
|
||
(`/tmp/adv-canon` @ 626badd) for the unit suite, and a cold `ssh cc-ci` for live state. I did NOT
|
||
read JOURNAL-canon.md before forming this verdict. Every M1 sub-claim re-derived against the plan,
|
||
not the Builder's narrative.
|
||
|
||
**M1.1 tagged-promote gate (§2.A) — PASS.**
|
||
- Code: `should_promote_canonical` returns `is_enrolled and overall==0 and not quick and not ref and
|
||
tagged`; caller computes `tagged = is_released_version(recipe, head_version)`; `promote_canonical`
|
||
now records the TESTED `head_version` (commit d4cc9e4), not a re-derived `latest_version`. My prior
|
||
PROBE (head_version-vs-latest_version divergence on a manual `RECIPE=<r>` run) is CLOSED by d4cc9e4
|
||
— read the diff, it promotes exactly the tested version.
|
||
- Unit: ran `tests/unit/test_promote.py` myself in the fresh cc-ci clone — all 6 pass, each gate
|
||
clause individually exercised (`test_no_promote_when_untagged` asserts `tagged=False → False`;
|
||
all-conditions asserts `tagged=True → True`). Not hollow.
|
||
- Live PROMOTE: re-derived `git rev-list -n1 1.13.0+1.31.1` = `df2e27339f983a25da548fc8b8d56e9af8645f83`
|
||
and `/var/lib/ci-warm/custom-html/canonical.json` records EXACTLY that commit + version
|
||
`1.13.0+1.31.1`, status idle, retained volume `warm-custom-html_..._content` present. So the promote
|
||
recorded the tag's own commit (correcting samever's earlier `2b82eba` merge-commit record) — the
|
||
divergence fix is live-proven, not just unit-tested.
|
||
- Live UNTAGGED → NO PROMOTE: independently confirmed `1.13.1+1.31.1` is `NOT-A-TAG` in the custom-html
|
||
clone → `is_released_version` returns False → gate blocks. canonical.json is unchanged (still
|
||
df2e273). The full live tagged-vs-untagged e2e is M2.4; at M1 the code + unit + live-not-a-tag +
|
||
unchanged-canonical chain is sufficient.
|
||
|
||
**M1.2 release-tag trigger + faithful mirror-sync (§2.C/§2.D) — PASS.**
|
||
- `sweep_decision` re-derived directly (no pytest) — truth table exactly right and VERSION-keyed, not
|
||
commit-keyed: new>canon→run; equal→skip no-new-version; older→skip; no tag→skip never-released; no
|
||
canon→run(seed). The function takes only (latest_tag, canon_version) — it CANNOT see commits, so new
|
||
untagged commits on `main` can never trigger a run. That IS the operator's refinement.
|
||
- `scripts/recipe-mirror-sync.sh` read in full: pins an explicit coopcloud `upstream` remote, force-
|
||
syncs mirror `main := upstream/main` + all tags, pushes NOTHING of our own. PR close is gated on
|
||
`git merge-tree --write-tree NEW_MAIN_SHA <pr-head>` == upstream `MAIN_TREE` (i.e. the PR's merge is
|
||
a no-op because it's already in upstream) → close; otherwise "left as-is". Faithful, never merges,
|
||
never disturbs unrelated PRs.
|
||
- `nightly_sweep.sweep()` wiring read: per enrolled recipe `mirror_sync → fetch_recipe →
|
||
sweep_decision → run_on_tag` (checkout the release tag + `CCCI_SKIP_FETCH=1` so head IS the tag →
|
||
tagged-gate passes; REF popped → cold → promote allowed). Pure script.
|
||
|
||
**M1.3 all recipes enrolled (§2.B) — PASS.** My `grep -rl 'WARM_CANONICAL = True'` set is EXACTLY the
|
||
21 `used-recipes.md` rows (incl. `uptime-kuma`, the lone `external` row — correctly enrolled for
|
||
CI/canonical even though excluded from weekly upgrade). Fixtures (`custom-html-*-bad`, `concurrency`,
|
||
`regression`) NOT enrolled.
|
||
|
||
**M1.4 hollow-sweep fix — PASS (code; live is M2.1).** `nix/modules/nightly-sweep.nix` exports
|
||
`CCCI_REPO=/etc/cc-ci`, `cd`s there, and execs `$CCCI_REPO/runner/nightly_sweep.py` — the checkout WITH
|
||
`tests/`, replacing the store copy whose missing `tests/` caused `enrolled_recipes()=[]`. Root cause
|
||
correctly addressed in code. ⚠️ CARRIED TO M2: `/etc/cc-ci` is currently STALE — `git -C /etc/cc-ci`
|
||
HEAD is `e60415d` (Phase-3 era), canon code NOT yet there. M2.1 deploy MUST `git -C /etc/cc-ci pull`
|
||
before `nixos-rebuild`, else the deployed timer stays hollow. I will verify the pull + a real fire at
|
||
M2.5.
|
||
|
||
**M1.5 weekly timer (§2.F) — PASS (code).** `OnCalendar = "Sun *-*-* 03:00:00"`, `Persistent = true`.
|
||
Deployed-timer schedule verified at M2.
|
||
|
||
**Guardrail NO-AI-at-runtime — PASS.** grep of `nightly_sweep.py` / `warm_reconcile.py` /
|
||
`recipe-mirror-sync.sh` for anthropic|claude|openai|llm|gpt|ai_ → only one code COMMENT match, zero
|
||
calls. Pure script + systemd timer.
|
||
|
||
**Full unit suite — PASS.** Ran `cc-ci-run -m pytest tests/unit/` in the fresh independent cc-ci clone
|
||
@ 626badd → **295 passed in 5.60s**, matching the claim. Enrolling 21 recipes broke nothing.
|
||
|
||
**Minor narrative note (not a defect):** the claim cites proof-A ts `065027Z` but live canonical ts is
|
||
`065532Z`; promoting the same tag again yields the same version+commit (only ts moves), so this is a
|
||
benign re-run, not a divergence — the recorded version/commit are correct either way.
|
||
|
||
**Verdict: M1 PASS.** No VETO. All M1 DoD items cold-verified; the deployed-state items (M1.4 live,
|
||
M1.5 timer schedule) are honestly scoped by the Builder to M2 and I will hold them there. (Consulted
|
||
JOURNAL-canon.md only AFTER writing this verdict: no surprises — confirms the proof-A/C sequence.)
|
||
|
||
---
|
||
|
||
## Pre-claim observation @ 2026-06-17T07:23Z — M2.1 deploy verified live (NOT a gate verdict)
|
||
|
||
Builder inbox: M1 PASS consumed; M2.1 deploy done; M2.2 full sweep started (long, serial, hours).
|
||
M2 NOT yet claimed — no formal verdict here, just an opportunistic READ-ONLY check that resolves my
|
||
two carried-to-M2 code-only probes (favorable; I'll still re-verify the live proofs at the M2 claim):
|
||
- **/etc/cc-ci now at `3bdd5d1`** (current main; was stale `e60415d` Phase-3 era), with `tests/` +
|
||
`runner/nightly_sweep.py` present → the deploy DID `git -C /etc/cc-ci pull`. My M1.4 "deploy must
|
||
pull or stays hollow" risk is cleared.
|
||
- **Deployed timer:** `systemctl cat nightly-sweep.timer` → `OnCalendar=Sun *-*-* 03:00:00`,
|
||
`Persistent=true` (weekly, live). M1.5 deployed-schedule probe cleared.
|
||
- **Deployed code path is the non-hollow one:** the in-flight sweep (PID 1620630) runs
|
||
`nightly_sweep.sweep()` from `/etc/cc-ci/runner`, and `run_recipe_ci.py` runs from
|
||
`/etc/cc-ci/runner/` — i.e. the checkout WITH `tests/`, not the store copy. Root cause fixed live.
|
||
STILL OWED at the M2 claim (I will cold-verify, not trust the sweep log): canonicals actually promoted
|
||
for greens / reds left intact / no-new-tag skipped (M2.2); run-twice→skip-all (M2.3); live tagged-vs-
|
||
untagged (M2.4); real timer fire advances canonicals via full main() incl. roll (M2.5); samever never
|
||
fires in-sweep (M2.6); disk budget recorded (M2.7); §2.G UPGRADE_BASE_VERSION retirement (M2.8).
|
||
Staying read-only while the sweep is in flight (single node).
|
||
|
||
---
|
||
|
||
## Pre-claim finding @ 2026-06-17T08:40Z — M2.2 sweep: PASS-labelled but promotes mostly FAILING (evidence captured)
|
||
|
||
NOT a verdict (M2 unclaimed). Read-only capture from `/root/canon-verify/_sweep.log` so the evidence
|
||
survives log growth. Per-recipe promote outcomes observed (alphabetical sweep, ~7 recipes deep):
|
||
- bluesky-pds: cold rc=0; `WC5 promote failed: abra app deploy warm-bluesky-pds… failed (1)` → NO canonical; logged `PASS (promoted)`.
|
||
- cryptpad: cold rc=0; `canonical cryptpad advanced to known-good 0.6.0+v2026.5.1` → canonical WRITTEN. ✓ (the only real promote so far)
|
||
- custom-html: SKIP no-new-version (pre-existing canonical). ✓ expected.
|
||
- custom-html-tiny: cold rc=0; `WC5 promote failed: warm-custom-html-tiny… not healthy over HTTPS / (404)` → NO canonical; logged `PASS (promoted)`.
|
||
- discourse: cold rc=142 (deploy timeout — the 51m wedge I flagged) → `FAIL (canonical unchanged)`. Legit red.
|
||
- drone: cold rc=0; `WC5 promote failed: …warm-drone… timed out after 600 seconds` → NO canonical; logged `PASS (promoted)`.
|
||
- ghost: cold rc=0; `WC5 promote failed: abra app new ghost… failed (1)` → NO canonical; logged `PASS (promoted)`.
|
||
- gitea: promote in progress at capture.
|
||
Live `/var/lib/ci-warm/*/canonical.json` = {cryptpad, custom-html} only. NET NEW this sweep = 1 (cryptpad).
|
||
Leftover warm volumes w/ NO registry record: drone, gitea, custom-html-tiny (partial-promote residue).
|
||
|
||
**DEFECT-1 [adversary] (results-label):** `nightly_sweep.sweep()` line ~119 sets
|
||
`results[r] = "PASS (promoted)" if rc==0 else "FAIL …"`. Because `promote_canonical` is non-fatal
|
||
(swallows its own exception so it "never fails a green run"), a FAILED promote still yields rc=0 →
|
||
the summary asserts "PASS (promoted)" when NO canonical was written. The per-recipe results log — the
|
||
DoD's evidence that "canonicals actually promoted for the green recipes" — is therefore UNTRUSTWORTHY.
|
||
Repro: `grep "WC5 promote failed" _sweep.log` vs `grep "PASS (promoted)" _sweep.log` — failed promotes
|
||
appear in BOTH. Fix direction: label from "does a canonical record now exist at the tested version",
|
||
not from rc.
|
||
|
||
**DEFECT-2 [adversary] (promote path failing broadly):** 4 of 5 completed promotes FAILED across 4
|
||
modes (warm `app deploy` failed(1) / timed-out 600s / unhealthy-404 / `app new` failed(1)). Cold CI is
|
||
green for each, so this is specifically the WARM-CANONICAL promote deploy failing — the exact
|
||
end-to-end step this phase exists to make real. Root cause TBD (node contention on the long serial
|
||
run / unclean cold-test teardown / discourse residue / flat 600s warm timeout) — Builder's to diagnose.
|
||
|
||
**Determinism risk (M2.3):** every recipe left without a canonical (bluesky-pds, custom-html-tiny,
|
||
drone, ghost, discourse…) will `sweep_decision(latest, None) → run` on a second sweep, NOT skip — so
|
||
run-twice ≠ skip-all until promotes actually succeed. I will hard-test this at the M2 claim.
|
||
|
||
Sent the Builder a BUILDER-INBOX heads-up (ba28a88). When M2 is claimed I will cold-verify, per recipe,
|
||
that a canonical record exists at the tested tag version (not trust the PASS label), and re-run the
|
||
determinism no-op myself. If promotes are still failing / mislabelled, M2 FAILs.
|
||
|
||
## Pre-claim note @ 2026-06-17T09:11Z — fix f94de22 validated by Builder; M2 re-run in flight (NOT a verdict)
|
||
|
||
Consumed ADVERSARY-INBOX (Builder ~09:10Z): DEFECT-1/DEFECT-2 fix validated live — custom-html-tiny
|
||
PROMOTED (1.2.0+2.43.0, was 404) and ghost PROMOTED (1.4.0+6.45.0-alpine, was app-new dirty-tree FATA);
|
||
label now derives from "canonical record exists at tested version". 7 canonicals claimed (cryptpad,
|
||
custom-html, custom-html-tiny, ghost, gitea, hedgedoc, immich). Full sweep re-run in flight. M2 unclaimed.
|
||
Staying read-only off the node (sweep in flight, single node).
|
||
|
||
**bluesky-pds "documented RED" — must scrutinise at M2 claim, two ways it could be wrong:**
|
||
1. The conservative direction is CORRECT per guardrail (no force-promote; prior known-good kept). But I
|
||
must confirm bluesky has NO stale/partial canonical written, and that it is recorded as an exception
|
||
in DECISIONS (plan §2.B: "don't silently skip" / §4 "documented exception"), not just left silent.
|
||
2. **The real risk:** Builder says warm health fails because traefik doesn't route the WARM domain
|
||
(`warm-bluesky-pds…` → 000) though internal localhost:3000 = 200, and "cold domain worked." I must
|
||
verify this is genuinely bluesky-SPECIFIC and not a warm-canonical-deploy machinery defect (warm
|
||
domain label/overlay/router rule) that could equally hit other recipes — if the warm-domain routing
|
||
is systemically flaky, a recipe could intermittently fail to promote (or, worse, a health probe could
|
||
pass spuriously). At claim I will: (a) confirm OTHER promoted recipes (custom-html-tiny, ghost, immich)
|
||
actually answered 200 over HTTPS on THEIR warm domains during promote (grep ready-probe lines), and
|
||
(b) independently curl a couple of the live warm canonical domains. If warm-domain routing is broadly
|
||
unreliable, the promote evidence is suspect and M2 is not done.
|
||
|
||
## Pre-claim observation @ 2026-06-17T09:34Z — read-only sweep-progress peek (NOT a verdict)
|
||
|
||
Sweep re-run still in flight (proc 1712141 from `/etc/cc-ci/runner`); 7 canonicals on disk. Captured
|
||
from `_sweep.log` so it survives log growth:
|
||
- **DEFECT-1 fix is LIVE and honest:** `sweep: bluesky-pds rc=0 (GREEN-BUT-PROMOTE-FAILED
|
||
(canonical=none, expected 0.3.0+v0.4.219))` — the label no longer claims `PASS (promoted)` on a
|
||
failed promote. Favorable; I will still confirm the label matches the on-disk registry per recipe at
|
||
claim before closing DEFECT-1.
|
||
- `cryptpad / custom-html / custom-html-tiny` → `SKIP no-new-version` (latest tag == canonical). The
|
||
skip path works for promoted recipes.
|
||
- `discourse rc=143 → FAIL (red; canonical unchanged)` — legit red (timeout/SIGTERM), canonical kept.
|
||
- **NEW — `sweep: mirror-sync drone rc=128 (non-fatal — continuing)`:** drone's faithful mirror-sync
|
||
FAILED (git rc=128) yet the sweep proceeded to RUN drone against the un-synced mirror. SCRUTINISE at
|
||
claim: plan §2.C requires the mirror be reconciled to upstream FIRST; a swallowed sync failure means
|
||
the recipe may be tested against a stale mirror (wrong tags/version) — the trigger (D) and tagged
|
||
promote then rest on un-synced state. Is rc=128 a benign "already up to date / no upstream" case or a
|
||
real sync failure? Must check what drone's sync hit and whether the tested tag is genuinely upstream's.
|
||
- **DETERMINISM (M2.3) — central risk crystallising:** bluesky-pds (promote-failed) and discourse (red)
|
||
both end `canonical=none`, so a 2nd sweep → `sweep_decision(latest, None) → RUN`, NOT skip. Plan M2.3
|
||
literally requires run-twice → "SKIPS every recipe." That can hold ONLY if every enrolled recipe
|
||
actually promoted. Red/promote-failed recipes legitimately re-run (no known-good to protect) — which
|
||
is arguably correct behaviour but is NOT "skip every recipe." At the M2 claim I will require the
|
||
Builder's determinism evidence to honestly reconcile this with §3/§5: either (i) every recipe promotes
|
||
so run-twice is a true no-op, or (ii) a reasoned, plan-consistent argument that the no-op property
|
||
applies to the promoted set and red recipes correctly retry — and I'll judge it against the plan, not
|
||
accept a partial skip-all relabelled as success.
|
||
|
||
## Pre-claim observation @ 2026-06-17T10:20Z — TWO concurrent sweeps (transient process state, captured)
|
||
|
||
Read-only `ps` on cc-ci caught a non-serial condition while M2 is mid-development (NOT a verdict; M2
|
||
unclaimed):
|
||
- PID **1712141** = OLD sweep (started 09:10:40, code f94de22) — WEDGED: child PID 1720589
|
||
(`run_recipe_ci.py`, started 09:33:58, alive ~46 min) is the drone cold-dep self-deadlock the
|
||
lock-release fix (655a999) addresses. The old sweep process is still ALIVE, holding cold-test locks.
|
||
- PID **1736506** = NEW sweep (started 10:16:27, code 655a999), already cold-testing recipe 1.
|
||
So at 10:20Z two `nightly_sweep.sweep()` ran simultaneously. This violates §4 SERIAL and, more
|
||
pointedly, **invalidates the documented precondition of `release_app_locks()`** ("serial sweep → no
|
||
concurrent run relies on these locks") — the wedged old run still holds drone/gitea locks, so the two
|
||
can collide. **Any M2 promote/determinism/log evidence from a sweep that overlapped the wedged one is
|
||
non-serial and I will not accept it.** Canonical count is 8 (drone now promoted → lock-release fix
|
||
works), so the fix itself is good; the issue is the leftover concurrent process. Sent BUILDER-INBOX
|
||
asking the Builder to kill the wedged old sweep, confirm a clean single serial run, and regenerate M2
|
||
evidence. **SCRUTINY CARRIED TO CLAIM:** confirm the claimed M2 sweep ran with exactly ONE sweep
|
||
process and no overlap (check run start time vs old-sweep kill time); and verify `release_app_locks()`
|
||
cannot free a lock still guarding a live app under any interleaving the in-flight guard permits.
|
||
|
||
**Update @ 10:24Z:** Builder consumed the alert and acted correctly — SIGKILLed both sweeps + the
|
||
wedged drone child, cleared stale `/run/lock/cc-ci-app-*.lock`, confirmed no leftover warm-*/dep stacks,
|
||
**discarded drone's concurrency-tainted canonical** (promoted by a standalone validation at 10:06:45
|
||
that overlapped the wedged old sweep), kept the 7 single-run canonicals, and relaunched ONE clean serial
|
||
sweep (pid 1741209, code 655a999) as the M2.2 evidence run. Concurrency window was ~10:06–10:24 (old
|
||
sweep 1712141 alive 09:10→killed 10:24). **CARRIED TO CLAIM:** independently confirm each of the 7 kept
|
||
canonicals (cryptpad, custom-html, custom-html-tiny, ghost, gitea, hedgedoc, immich) has a ts OUTSIDE
|
||
the concurrency window and was produced single-run — do NOT take the Builder's accounting on faith;
|
||
check `canonical.json` ts per recipe vs the 09:10–10:24 overlap. And confirm the claimed sweep (1741209)
|
||
ran start→finish with no second sweep process alive.
|
||
|
||
## Pre-claim observation @ 2026-06-17T10:47Z — clean serial sweep progress (NOT a verdict)
|
||
|
||
ONE sweep proc confirmed (serial intact). Transient `_sweep.log` lines captured before rotation:
|
||
- **CONCERN — `drone rc=0 GREEN-BUT-PROMOTE-FAILED (canonical=none, expected 1.9.0+2.26.0)` in the
|
||
CLEAN serial run.** Drone promoted under the discarded tainted validation but FAILS to promote
|
||
clean-serial — and it no longer hangs (returns cleanly), so the lock-release fix (655a999) cured the
|
||
46-min deadlock but drone's warm promote still fails for a DIFFERENT reason (likely warm gitea-dep
|
||
provisioning or warm deploy/health). Net: the lock fix is necessary-but-not-sufficient for drone;
|
||
drone will lack a canonical → hits both promote-evidence and determinism (run-twice) at the claim.
|
||
Builder will see it in their own running log; their diagnose. I'll require drone to either promote
|
||
clean or be a recorded DECISIONS exception (like bluesky) at claim — a silent no-canonical is not OK.
|
||
- **FAVORABLE — `gitea RUN — new release 3.6.0+1.24.2-rootless > canonical 3.5.3+1.24.2-rootless;
|
||
cold-testing tagged release 3.6.0…`** — a LIVE instance of the new-release-tag trigger advancing an
|
||
existing canonical (older→newer TAGGED), i.e. exactly the M2.6 samever-orthogonality path (2):
|
||
canonical(older)→new tagged, real delta, promote-if-green. If gitea promotes to 3.6.0 this is strong
|
||
M2.6 evidence (no constructed scenario needed). VERIFY AT CLAIM: gitea's canonical advances 3.5.3→3.6.0
|
||
with the new tag's own commit, and samever's same-version step-back NEVER fired in the run (the tag
|
||
trigger guarantees vX→vY, Y>X, so no vX→vX). Watch that gitea actually promotes (not GREEN-BUT-FAILED).
|
||
- SKIPs (cryptpad/custom-html/custom-html-tiny/ghost = no-new-version) and discourse rc=143 red:
|
||
consistent with prior runs.
|
||
|
||
## Pre-claim note @ 2026-06-17T10:59Z — two more Builder fixes; M2-evidence-sweep recency criterion
|
||
|
||
Builder landed ca89d44 (promote clears stale warm-stack on FRESH SEED only — fixes the failed-promote
|
||
secret residue, e.g. drone's gitea `client_secret_v1` blocking `abra app secret insert` on retry;
|
||
correctly does NOT teardown when a canonical exists → retained volume safe) and d072d7e (de-enroll
|
||
keycloak — structural collision with the live-warm OIDC provider on `warm-keycloak.ci...`; thorough
|
||
DECISIONS entry; enrolled now 20 + 1 documented exception). Both reasonable. The residue fix is the
|
||
likely root cause of the clean-serial drone promote-fail I flagged.
|
||
**M2-EVIDENCE RECENCY CRITERION (new, checkable):** the in-flight sweep pid 1741209 launched ~10:16 —
|
||
BEFORE ca89d44 (10:51) and d072d7e (10:54) — so its parent-process enrolled set still includes keycloak
|
||
and its sweep logic predates the residue fix (only per-recipe run_recipe_ci.py picks up new code if
|
||
/etc/cc-ci is pulled mid-run; nightly_sweep.sweep()'s enrolled list + decisioning is fixed at launch).
|
||
Therefore the authoritative M2.2 sweep I accept MUST be one launched with /etc/cc-ci at a HEAD that
|
||
contains BOTH fixes, enrolled=20 (keycloak absent), single serial proc. At claim: check the evidence
|
||
sweep's launch time vs these commit times, and confirm drone now PROMOTES (residue fix) or is a recorded
|
||
exception. Also verify ca89d44's fresh-seed teardown can't nuke a shared/retained volume (guarded by
|
||
`if not read_registry(recipe)` — only when no canonical exists, so nothing known-good to lose; confirm).
|
||
|
||
## Pre-claim verification @ 2026-06-17T11:12Z — fresh-seed-teardown × live-keycloak footgun: MITIGATED
|
||
|
||
Identified a real footgun in ca89d44: the fresh-seed branch does `teardown_app(canonical_domain(recipe))`
|
||
for any enrolled recipe lacking a canonical. For keycloak, `canonical_domain` == the LIVE shared OIDC
|
||
provider domain `warm-keycloak.ci...` — so a fresh-seed keycloak promote would have TORN DOWN the live
|
||
provider that lasuite-*/drone depend on. The de-enroll (d072d7e) is precisely what prevents this.
|
||
INDEPENDENTLY VERIFIED (read-only, my own checks, not Builder's word):
|
||
- At HEAD: `tests/keycloak/recipe_meta.py` → `WARM_CANONICAL = False`; `canonical.enrolled_recipes()` =
|
||
**20, keycloak NOT in set** → the post-fix sweep never runs the fresh-seed teardown against keycloak.
|
||
- Live `https://warm-keycloak.ci.commoninternet.net/realms/master` → **200**; services
|
||
`warm-keycloak_..._app` + `_db` both **1/1** → the pre-fix sweep 1741209's keycloak promote attempt
|
||
(old promote, no teardown) did NOT disrupt the live provider. Healthy.
|
||
Conclusion: footgun is structurally mitigated AND live-confirmed unharmed — favorable. STILL CARRY TO
|
||
CLAIM: confirm NO OTHER enrolled recipe's `canonical_domain` collides with a live/shared service (so the
|
||
fresh-seed teardown only ever hits a disposable warm-<recipe> stack), and that the final sweep's keycloak
|
||
absence holds at the sweep's launch HEAD.
|
||
|
||
## Pre-claim observation @ 2026-06-17T11:23Z — pre-fix sweep FINISHED (0 procs); 15 canonicals
|
||
|
||
Final tail of the pre-fix serial sweep (1741209): n8n PASS(3.4.0+2.23.2), plausible
|
||
PASS(3.1.0+v2.0.0), uptime-kuma PASS(3.1.0+2.4.0); **mumble rc=1 FAIL (red; canonical unchanged)**.
|
||
Canonical count = 15. Two new claim-scrutiny points:
|
||
- **mumble — NEW red (rc=1, not a timeout), not previously documented.** Before M2 it must be either
|
||
fixed (promotes clean) or recorded as a DECISIONS exception with a reason — a silent no-canonical is
|
||
not acceptable (same bar I'm holding bluesky/discourse/drone to). Watch for the diagnosis.
|
||
- **plausible promoted at `3.1.0+v2.0.0`, NOT the `3.0.1` the plan §2.G anticipated.** The §2.8
|
||
UPGRADE_BASE_VERSION retirement reasoning ("canonical at 3.0.1 → dynamic base resolves 3.0.1 → pin
|
||
redundant, drop the broken 3.0.0") must be RE-DERIVED against the actual canonical 3.1.0+v2.0.0: at
|
||
claim verify that with plausible's real canonical, the dynamic upgrade base resolves to a correct
|
||
green release (NOT the broken 3.0.0 clickhouse-404 base) and plausible's upgrade tier passes — only
|
||
then is dropping the pin safe. If not, the pin stays with a recorded reason (§2.G GATE).
|
||
Builder's plan next: deploy fixes to /etc/cc-ci, re-promote drone (fresh-seed fix) + retry gitea 3.6.0,
|
||
then launch the FINAL authoritative sweep = the M2.2 evidence (postdates ca89d44+d072d7e, enrolled=20).
|
||
|
||
## Pre-claim @ 2026-06-17T11:35Z — FINAL authoritative sweep launched; recency criterion MET (confirmed)
|
||
|
||
Builder launched the authoritative M2.2 sweep (pid 1960362, ~11:26Z) from `/etc/cc-ci @ 12acf94`. I
|
||
INDEPENDENTLY confirmed `git merge-base --is-ancestor`: **ca89d44 (residue) AND d072d7e (keycloak) are
|
||
both ancestors of 12acf94** → the evidence sweep postdates both fixes, enrolled=20, single serial.
|
||
My M2-evidence recency criterion is satisfied — this run is the legitimate M2.2 evidence. (Still verify
|
||
at claim: it ran start→finish with no second sweep proc.)
|
||
|
||
**Red diagnoses to verify at claim (Builder posture = "red test is information, never weakened" — correct):**
|
||
- discourse: upstream 0.8.1 compose invalid (`sidekiq` → undefined service `discourse`). VERIFY: it's a
|
||
genuine upstream defect (re-read the compose), not our overlay; canonical unchanged.
|
||
- mattermost-lts: `test_restore.py::test_restore_returns_state` FAILED at latest. VERIFY: the test is
|
||
unmodified (git-blame the test vs main; not weakened/xfail'd to dodge), failure is real.
|
||
- mumble: `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` FAILED.
|
||
VERIFY: test unmodified, real failure.
|
||
- bluesky-pds: cold green, warm-promote health 000 (traefik doesn't route warm domain; PDS 200 on
|
||
localhost:3000). VERIFY recipe-specific (not machinery): confirm other promoted recipes DID answer 200
|
||
over HTTPS on their warm domains (already favorable — 15 promoted healthy).
|
||
ALL FOUR must be recorded as DECISIONS exceptions with reasons (not silent no-canonicals) before M2.
|
||
Expected from this sweep: ~14 SKIP (determinism), drone PROMOTES (residue fix), gitea 3.5.3→3.6.0 advance.
|
||
|
||
## Pre-claim findings @ 2026-06-17T11:58Z — final sweep crux outcomes (drone ✓, gitea advance ✗)
|
||
|
||
Cold-read from cc-ci (raw canonical.json, my own check). 16 canonical recipes on disk: cryptpad,
|
||
custom-html, custom-html-tiny, drone, ghost, gitea, hedgedoc, immich, lasuite-{docs,drive,meet}, mailu,
|
||
matrix-synapse, n8n, plausible, uptime-kuma. 16 promoted + 4 documented reds (discourse, mattermost-lts,
|
||
mumble, bluesky-pds) = 20 enrolled. Clean accounting.
|
||
- **drone — PROMOTED CLEAN ✓ (favorable, DEFECT-2 closing evidence).** `/var/lib/ci-warm/drone/
|
||
canonical.json` = `{version 1.9.0+2.26.0, commit 91b27ceb…, status idle, ts 20260617T115046Z}` —
|
||
fresh, from THIS final post-fix sweep; log `sweep: drone rc=0 (PASS (promoted 1.9.0+2.26.0))`. The
|
||
fresh-seed-teardown residue fix (ca89d44) resolved the once-failed-promote secret residue. (At the
|
||
formal claim I'll re-derive that commit == the 1.9.0+2.26.0 tag's commit, and confirm warm reattach.)
|
||
- **gitea — ADVANCE FAILED AGAIN ✗ (CLAIM-BLOCKER for M2.6 + M2.3).** Log: `sweep: gitea RUN — new
|
||
release 3.6.0+1.24.2-rootless > canonical 3.5.3+1.24.2-rootless … rc=0 (GREEN-BUT-PROMOTE-FAILED
|
||
(canonical=3.5.3…, expected 3.6.0…))`. canonical.json still `3.5.3+1.24.2-rootless` (ts 083930Z, OLD)
|
||
— known-good correctly PRESERVED on the failed advance, but the advance did NOT happen. Impact:
|
||
1. **M2.6 not demonstrated:** gitea was the live new-tag→`canonical(older)→new` advance proof. The
|
||
trigger fired (RUN on the newer tag) and old-known-good was kept, but a SUCCESSFUL promote to the
|
||
new tagged version — which §3/§5 M2.6 requires — did not occur. Needs a real fix or the plan's
|
||
alternative (construct custom-html older→new).
|
||
2. **M2.3 determinism dirtied:** on a 2nd sweep `sweep_decision(gitea, 3.6.0, 3.5.3) → RUN`, so gitea
|
||
re-runs — and it is NOT a genuine red (cold test is GREEN; only the warm advance promote times
|
||
out ~600s). So it is NOT covered by "reds correctly retry"; it is a green recipe whose promote
|
||
deterministically fails, which both wastes a CI rerun AND breaks "run-twice → skip-all". A plain
|
||
retry won't fix a deterministic timeout — needs the warm-advance timeout raised / the in-place
|
||
version-bump deploy diagnosed, OR gitea documented like the reds (but it's green, so that's weaker).
|
||
Sending the Builder a heads-up so they don't claim M2 with this open.
|
||
|
||
**Sweep completion @ 12:00:03Z:** authoritative sweep `=== M2.2 FULL SWEEP done rc=0
|
||
2026-06-17T12:00:03Z ===` (ran 11:25:57→12:00:03, ~34m; node idle after, no sweep/run procs). Determinism
|
||
preview already visible IN this run: n8n/plausible/uptime-kuma/immich/lasuite-*/mailu/matrix-synapse all
|
||
`SKIP no-new-version` = the just-promoted recipes correctly skip. Builder consumed my gitea heads-up
|
||
(9303359: "gitea 3.6.0 advance — fixing; drone promoted clean"). Awaiting gitea fix + M2.3/M2.5/M2.6/
|
||
M2.7/M2.8 proofs before any M2 claim.
|
||
|
||
## Pre-claim assessment @ 2026-06-17T12:21Z — gitea-exception diagnosis + M2.3 reframing (my acceptance bar)
|
||
|
||
Builder landed bdc2ec4 (DECISIONS): gitea 3.6.0 warm-advance documented as a RECIPE issue + an M2.3
|
||
determinism reframing. My standard for accepting these at the M2 claim:
|
||
|
||
**gitea 3.6.0 exception — diagnosis plausible; two things I will independently verify (not take on faith):**
|
||
- Builder's isolation claim is the right shape: the warm-ADVANCE machinery is proven via a CONSTRUCTED
|
||
custom-html older→new advance (M2.6), so gitea's failure is gitea-specific not machinery. VERIFY the
|
||
custom-html advance ACTUALLY promoted (canonical advanced old→new, healthy) — that's load-bearing.
|
||
- The gitea crash is `JWT Secret … app.ini: read-only file system`. Cold FRESH 3.6.0 passes; warm
|
||
reattach-advance crashes. VERIFY this is genuinely a gitea-3.6.0/rootless-config + retained-volume
|
||
interaction (e.g. pre-existing 3.5.3 app.ini / rootless-UID), NOT our warm-promote mounting app.ini
|
||
read-only. If OUR machinery makes app.ini read-only (cold doesn't, warm does), it's a MACHINERY defect
|
||
mislabeled as a recipe issue — that would NOT be an acceptable exception and would fail M1(A)/M2.
|
||
Check: how does the warm advance mount/derive app.ini vs the cold install for gitea.
|
||
- gitea correctly KEEPS 3.5.3 (never promote unhealthy) — good; confirm 3.5.3 record + volume intact.
|
||
|
||
**M2.3 reframing — ACCEPTABLE ONLY IF rigorously demonstrated + flagged as a DoD deviation.** Plan
|
||
§3/§5 LITERALLY say run-twice → "SKIPS every recipe … clean no-op". That ideal assumed all-promote;
|
||
reality = 15 promoted-at-latest + 5 that can't (4 genuine/documented reds + gitea recipe-bug). Builder's
|
||
operative property = "no promoted-at-latest recipe re-runs; reds + gitea correctly retry." This is
|
||
plan-consistent in SPIRIT (the no-op's purpose is no needless re-test of good-current recipes) and the
|
||
plan forbids weakening tests to force promotes — so the literal ideal is unachievable honestly. I will
|
||
ACCEPT it IFF: (i) an actual immediate 2nd sweep shows EXACTLY the 15 promoted-at-latest SKIP (no CI
|
||
rerun) and ONLY the documented exceptions (gitea + 4 reds) RUN — I will re-run/inspect this myself, not
|
||
trust a summary; (ii) every re-running recipe has a recorded DECISIONS reason; (iii) it is explicitly
|
||
noted as a deviation from the literal "skip every recipe" so the operator sees it. If a promoted-at-
|
||
latest recipe needlessly re-runs, or an undocumented recipe re-runs, M2.3 FAILs. NOT a veto now — this
|
||
is the bar I'll hold at the claim.
|
||
|
||
## Pre-claim pre-verification @ 2026-06-17T12:34Z — §2.G strip (M2.8) favorable; M2.5 bash-fix needs redeploy
|
||
|
||
- **§2.G UPGRADE_BASE_VERSION retirement (f611dda, 83c183d) — code-level strip CONFIRMED complete.**
|
||
`grep -rn UPGRADE_BASE_VERSION` (excl. machine-docs) → only EXPLANATORY comments/docs remain (testing.md,
|
||
plausible/bluesky-pds/discourse meta comments, test_meta + test_upgrade_base comments, the resolver
|
||
removal comment at run_recipe_ci.py:132) — NO live key/branch. plausible's pin gone (meta comment:
|
||
dynamic base STEPS BACK to newest-published-strictly-older-than-3.1.0 = 3.0.1+v2.0.0 = the correct base,
|
||
avoiding broken 3.0.0); meta KEYS 15→14 (test_meta.py); bluesky-pds comment now points to dynamic base.
|
||
AT CLAIM: run the full unit suite (test_meta/test_upgrade_base green post-strip) + confirm plausible's
|
||
UPGRADE tier actually resolves base 3.0.1+v2.0.0 dynamically AND passes (Builder claims "verified
|
||
dynamic-base green" — re-run it myself). §2.G GATE (keep-if-broken) does NOT apply since plausible works.
|
||
- **M2.5 real timer fire — IN PROGRESS, caught a real bug.** cebd293: the actual timer fire revealed the
|
||
deployed nightly-sweep service was MISSING `bash` in nix runtimeInputs (a manual run wouldn't catch it —
|
||
exactly why "real fire, not manual" is the DoD). Fix adds bash. NOTE: this is a nix module change →
|
||
requires `git -C /etc/cc-ci pull` + `nixos-rebuild switch` to deploy, THEN a fresh real timer fire that
|
||
ADVANCES ≥1 canonical (non-hollow). AT CLAIM: confirm the fix is deployed AND a post-fix real fire
|
||
(systemctl start nightly-sweep.service or the timer) ran the non-hollow job to completion with evidence
|
||
(a canonical ts moved / log shows the 20-recipe sweep), not exit-0 on empty.
|