# JOURNAL — phase `canon` (canonical sweep, make it real) Builder reasoning log. WHY lives here; WHAT/HOW/EXPECTED/WHERE live in STATUS-canon.md. ## 2026-06-17 — bootstrap / code survey Read the phase canon (`plan-phase-canon-canonical-sweep.md`) + plan.md §6.1/§7/§9. Surveyed the existing canonical/sweep machinery before designing. Key findings: ### Clone identity `/srv/cc-ci` is a symlink → `/srv/cc-ci-orch`; the env's two "working dirs" are the same directory. This IS the Builder clone (reflog shows the `claim(M2)`/`status(samever) ## DONE` commits). The Adversary cold-verifies from its own fresh clones. No collision. ### What already works (phase doc is partly stale) - The phase doc says "ZERO canonical.json exist". **Not true any more**: a real canonical for `custom-html` exists on the host at `/var/lib/ci-warm/custom-html/canonical.json` (`version 1.13.0+1.31.1`, commit `2b82eba…`, status idle, ts `20260617T050314Z`) with its retained data volume `warm-custom-html_..._content`. It was produced by a **manual** cold run during the `samever` phase, NOT by the timer. So the *promote primitive* (seed_canonical → write_registry + warmsnap) demonstrably works; the **sweep that should drive it is what's hollow.** ### The real "hollow sweep" defect (root cause, confirmed live) The deployed `nightly-sweep.timer` fired 2026-06-17 03:09 and logged: `===== nightly cold sweep: enrolled canonicals = [] =====` → a true no-op. Cause: `nightly_sweep.py` does `REPO = os.environ.get("CCCI_REPO", "/root/cc-ci")` then `sys.path.insert(0, REPO/runner); from harness import canonical`. The systemd unit (`nix/modules/nightly-sweep.nix`) sets **no `CCCI_REPO`**, and `/root/cc-ci` **does not exist** on the host. So the import falls through to the harness packaged in the **nix store** (`runnerSrc=../../runner` — runner/ only, NO tests/). `meta.TESTS_DIR = ROOT/tests` then points at a nonexistent dir → `enrolled_recipes()` swallows the OSError → `[]`. Even though `custom-html` is enrolled in the repo, the deployed timer never sees it. **This is the machinery that was "specified but never doing anything."** Fix: point the sweep at a real, current checkout that has `tests/`. ### How current code stays live on the host - Normal recipe CI: Drone `exec` pipeline auto-clones cc-ci per build into its workspace, then runs `cc-ci-run runner/run_recipe_ci.py` from that fresh clone → tests/runner always current. - `/etc/cc-ci` is a **git clone** (the nixos flake source: `nixos-rebuild --flake /etc/cc-ci#…`). It is currently STALE (`e60415d`, far behind main) because recent phases only touched `runner/` (picked up by Drone's fresh clone) and needed no nixos-rebuild. The sweep is the first thing that needs `/etc/cc-ci` current. - Plan: sweep service sets `CCCI_REPO=/etc/cc-ci` and runs `nightly_sweep.py` FROM the checkout (change the nix to exec `$CCCI_REPO/runner/nightly_sweep.py`, not the store copy) → after a deploy that does `git -C /etc/cc-ci pull && nixos-rebuild`, the sweep reads current tests/ + runner. This reuses the flake-source checkout (declarative, reproducible) rather than inventing a new clone. ### Promote path (the core, §2.A) - `should_promote_canonical(recipe, ref, overall, quick)` = enrolled & green & cold(not quick) & not-ref (no PR head). `promote_canonical` deploys `latest_version(recipe_tags(recipe))` (the latest git tag) fresh/in-place, waits healthy, undeploys, `seed_canonical` (snapshot + write_registry). - **Tagged-promote addition needed:** the green gate currently tests *whatever fetch_recipe checked out* (catalogue `main` HEAD for a cold run), which can be untagged-ahead of the latest tag, while promote always writes the latest TAG. Per operator: a canonical must only ever be a real release. Add a `tagged` requirement: the tested head version (`abra.head_compose_version`, the compose `version` label) must equal a published release tag (`recipe_tags`). When main HEAD == latest release (the common just-cut case) head_version == latest tag → promote; when main is untagged-ahead → no promote. ### Trigger on a NEW RELEASE TAG (§2.D) + test the tag (not main) - Version ordering is centralized in `warm_reconcile.version_key` / `latest_version` / `newest_older_version` (already used by samever step-back). Reuse them. - Trigger (pure, in the sweep, per recipe): after mirror-sync, `latest = latest_version(recipe_tags)`; `canon = read_registry(recipe).version`. No tag → SKIP (never released). `latest <= canon` (by version_key) → SKIP no-new-version (even if main has untagged commits — we compare tags not commits). `latest > canon` → run cold on the tag. - **Test the TAG cold:** to honour "run CI cold on that tagged version" (and so a green gate proves the exact thing that gets promoted), check out the latest tag in `~/.abra/recipes/` and run with `CCCI_SKIP_FETCH=1` (the existing staging mechanism) → head_version = tag, head_ref = tag commit, REF empty (so `not ref` still holds → promote allowed). The upgrade-base resolver then sees canonical(older) < head(new tag) → real delta (samever step-back never fires: tag>canon by construction). ### samever orthogonality (operator-required) The release-tag trigger guarantees, in the sweep, version-under-test > canonical, so the upgrade base is strictly older → `samever`'s same-version step-back never fires. (a) no new tag → SKIP, no upgrade-tier run; (b) new tag → canonical(older)→new, real delta, promote. samever's same-version behaviour stays owned by the samever phase on the PR path. Will demonstrate both in M2. ### Enroll-all set (§2.B) Authoritative inventory = `cc-ci-plan/used-recipes.md` (21 rows: 20 `weekly` + `uptime-kuma` `external`). NOT the test fixtures (custom-html-bkp-bad / -rst-bad, concurrency, regression, _generic). custom-html-tiny IS in used-recipes (weekly) → enroll it too. ### Disk budget (§2.B watch-item) Host `/`: 150G total, 104G used, **40G free (73%)**. `du` of /var/lib/ci-warm today: custom-html 32K, keycloak 159M. Retaining ~21 fresh-install data volumes should be a few GB; immich/matrix/mailu are the ones to watch. Will measure during the M2 full sweep and record the real budget; raise the VM disk (orchestrator) rather than silently drop recipes if it binds. ### §2.G UPGRADE_BASE_VERSION retirement — gated on M2 `plausible` pins `UPGRADE_BASE_VERSION="3.0.1+v2.0.0"`; `bluesky-pds` only references it in a comment. Retirement requires plausible's canonical to actually land at its latest green release so the dynamic resolver picks the right base — so this is sequenced AFTER M2 promotes plausible. Keep the pin if plausible can't go green dynamically (record why). ## 2026-06-17 — M1 built + live-proven (CLAIMED) All M1 code landed (HEAD d4cc9e4). Reasoning behind the choices: - **Tagged-gate computes `tagged` at the call site, not inside the gate** — keeps `should_promote_canonical` pure (the Adversary anti-anchoring + the existing unit-test contract). `is_released_version` lives in warm_reconcile (owns version logic + recipe_tags I/O). - **Promote the TESTED version (divergence fix, d4cc9e4):** the Adversary's pre-claim probe flagged that the gate checks `head_version` but promote recorded `latest_version(recipe_tags)`. Live proof-A made this concrete and favourable: the OLD record had commit `2b82eba` (a merge-to-main commit), but the tag `1.13.0+1.31.1` actually points to `df2e273`. Recording the tested version's head_ref now writes the TAG commit — strictly more correct. Sweep path was already safe (head==tag), but the manual `RECIPE=` path needed it. - **Why a vendored mirror-sync script, not the nix-store open-recipe-pr.sh:** the recipe clones on cc-ci have INCONSISTENT remotes (n8n: origin=mirror; mumble: origin=coopcloud; ghost/discourse: origin=mirror, no `upstream`). open-recipe-pr.sh assumes origin=coopcloud → would force-sync mirror main to *mirror* main (no-op) for most. The vendored `scripts/recipe-mirror-sync.sh` pins an explicit coopcloud `upstream` remote from the recipe name, syncs main+TAGS (canon needs upstream tags for the trigger), and authes via the bot token (self-contained, not host .git-credentials). Behaviour matches the phase's described open-recipe-pr.sh --reconcile-only (faithful, close merged-upstream PRs, leave unrelated). See DECISIONS. - **Why test the TAG via checkout+CCCI_SKIP_FETCH (run_on_tag), not just REF=tag:** REF alone (no SRC) takes fetch_recipe's `abra recipe fetch` branch (ignores REF) AND would set `ref` → should_promote blocks. Staging the tag in the clone + CCCI_SKIP_FETCH makes head=tag with REF empty → promote allowed, and exercises the real "cold on the tagged release" path. ### Live proof evidence (cc-ci, /root/canon-verify @ d4cc9e4) - proof-A (promote): canonical.json fresh ts 065027Z, commit df2e273 (=tag commit). Note: because custom-html canonical already == latest, run_on_tag here re-promoted an EQUAL version → the samever step-back fired (base 1.11.0+1.29.0). That is an artifact of bypassing the trigger for the proof; the REAL sweep SKIPs equal-version (sweep_decision), so the step-back never fires in the sweep — to be shown live in M2 (canonical(older)→new tag, base=canonical, no step-back). - proof-B (reattach): --quick reattached the retained volume, green (4 tests passed), known-good version+commit UNCHANGED (df2e273); ts re-stamped only by the idle-status write (write_registry stamps ts on every status write) — NOT a promote. - proof-C (untagged→no-promote): green cold run (level 5/5) on an untagged head (label 1.13.1+1.31.1) → 0 promote log lines, canonical.json byte-identical before/after. Tagged-gate works live. ## 2026-06-17 — M2 prep recon (non-advancing, while awaiting M1 verdict) Read-only sweep_decision survey across the 21 enrolled (from existing host clones; the real sweep mirror-syncs+fetches first so tags may differ slightly): - **20 recipes have NO canonical yet → first sweep RUNs (seed) each**; only custom-html SKIPs. - plausible latest tag = **3.0.1+v2.0.0** (== the §2.G UPGRADE_BASE_VERSION pin target) → once the sweep seeds plausible's canonical at 3.0.1, the dynamic base should resolve 3.0.1 and the pin can go. M2 risks to plan for (when M1 PASSes): 1. **Runtime:** 20 full cold deploy/test/teardown runs, several heavy (matrix-synapse, immich, mailu, discourse, ghost, mattermost) at 15-25 min each → a single full sweep likely EXCEEDS the timer's 6h TimeoutStartSec. Options: run M2.2 in the foreground (not the timer) for the full promote proof, raise TimeoutStartSec, and prove the real-timer-fire (M2.5) on a smaller already-canonical set (so the fire advances at least one canonical, not exit-0 on empty). 2. **Disk:** 20 retained data volumes on 40G free. Measure as it runs; raise the VM disk (orchestrator) if it binds rather than dropping recipes (per §2.B). Heavy: immich/matrix/mailu. 3. **Reds are acceptable** (canonical just not advanced) — but maximise greens; investigate any red. 4. Unusual tag formats (ghost 1.3.0+6.42.0-alpine, gitea 3.5.3+1.24.2-rootless, mumble 1.0.0+v1.6.870-0) — version_key parses leading numerics; is_released_version exact-match covers them. ## 2026-06-17 — promote fix validated (DEFECT-1/2 response) Validated f94de22 on the 3 distinct failure classes via run_on_tag from /etc/cc-ci: - custom-html-tiny (install_steps content): PROMOTED 1.2.0+2.43.0 ✓ - ghost (dirty-tree app-new FATA): PROMOTED 1.4.0+6.45.0-alpine ✓ - bluesky-pds (special secret): secret now inserted in promote + deploy succeeds, but warm health fails — PDS is healthy INTERNALLY (200 on localhost:3000) yet not routed via traefik on the warm domain (000). This is a bluesky-specific WARM-DOMAIN ROUTING issue (cold-test domain worked), NOT the promote-wiring bug. Documented as a known red pending follow-up (the sweep leaves it intact per guardrails). DEFECT-1 (label) fixed: sweep result now derives from canonical existence. Full sweep re-run launched (skips the 7 already-promoted = determinism evidence; runs the rest). ## 2026-06-17 ~13:20 — RESUME reconstruction (post-compaction) + real-timer re-fire in flight Reconstructed state from cc-ci (not memory): the parity fix (2c61f2f) is DEPLOYED — the deployed nix-store sweep script `/nix/store/2q6a27hnnmy0.../cc-ci-nightly-sweep` contains `export PATH="/run/current-system/sw/bin:/run/wrappers/bin:$PATH"`. A prior iteration committed 2c61f2f (13:00) → pulled /etc/cc-ci → nixos-rebuild → `systemctl start nightly-sweep.service` (13:01), then handed off. So the **DEFECT-3 production-env re-fire is IN FLIGHT** as the real timer service (PID 2149231, `TriggeredBy: nightly-sweep.timer`, ppid=1, journald socket). Parity precondition CONFIRMED real (not asserted): `git-lfs` → `/run/current-system/sw/bin/git-lfs` (symlink to git-lfs-3.6.1); Drone exec runner `/proc//environ` PATH = `/run/current-system/sw/bin:/run/wrappers/bin` — identical head to the sweep's now-prepended PATH. This fire so far (journalctl -u nightly-sweep.service --since 13:01): - custom-html RUN — new release 1.13.0+1.31.1 > canonical **1.11.0+1.29.0** → **PASS (promoted 1.13.0+1.31.1)** @13:15:17. A real-timer non-hollow promotion + the constructed older→new advance (M2.6 path 2 / M2.5 non-hollow) under the deployed parity env. (custom-html canonical had been reset to 1.11.0 pre-fire to stage the advance.) - cryptpad SKIP, custom-html-tiny SKIP (determinism — promoted-at-latest skip), bluesky-pds GREEN-BUT-PROMOTE-FAILED (documented warm-routing red). - Now at discourse (RUN seed, deploying). CRUX still pending: gitea (8th) must flip cold-GREEN under the parity PATH (git-lfs now present) — that is the DEFECT-3 acceptance criterion. Polling every ~5 min (single node, fire in flight). Not touching the node until it completes. ## 2026-06-17 ~14:40 — production re-fire COMPLETE; DEFECT-3 closed; launching clean determinism 2nd sweep The DEFECT-3 re-fire (nightly-sweep.service, 13:01:01→14:37:22, Result=success, status=0, single serial) completed cleanly under the deployed Drone-parity PATH. **gitea crux RESOLVED:** `test_lfs_roundtrip PASSED` (the test that redded on the missing-git-lfs fire) → gitea cold-GREEN in production env, then the documented app.ini warm-advance exception (3.5.3 kept). So the only reason gitea redded before was the timer-env git-lfs gap, now fixed by host-PATH parity — confirming the fix is the right one (the sweep validates exactly as Drone CI does). No NEW promote failures surfaced that the manual env had masked → DEFECT-3 is the LAST env-parity gap, now closed. custom-html 1.11.0→1.13.0 advance promoted in this real timer fire: this is simultaneously the M2.5 non-hollow real-fire proof AND the M2.6 constructed older→new advance (canonical(older)→new tagged, real delta, samever step-back never fires because tag>canon by construction). 14 promoted-at-latest recipes SKIP no-new-version live = determinism preview inside the production fire. **Why a clean 2nd sweep now (M2.3):** in this fire custom-html was the one promoted recipe that RAN (I'd reset its canonical to 1.11.0 pre-fire to stage the advance). Now it's at 1.13.0 = latest, so all 16 promoted canonicals are at-latest. An immediate 2nd sweep therefore yields the clean run-twice result the plan's M2.3 asks for: the 15 promoted-at-latest SKIP (incl. custom-html), and ONLY the 5 documented exceptions RUN (gitea 3.6.0 advance retry, discourse/mattermost-lts/mumble reds, bluesky warm-routing). Reds re-running is the accepted, DECISIONS-recorded deviation from the literal "skip every recipe" (cannot weaken a test to force a promote). Launching it as the real service again (systemctl start) for max faithfulness; ~96 min (discourse's deterministic 60-min deploy-timeout dominates). Disk budget healthy: ci-warm 1.1G / 16 volumes, 38G free.