Files
cc-ci/machine-docs/JOURNAL-canon.md
autonomic-bot 23c02c59b6
All checks were successful
continuous-integration/drone/push Build is passing
status(canon): bootstrap phase canon — state files, hollow-sweep root cause, M1/M2 backlog
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 06:28:35 +00:00

6.6 KiB

JOURNAL — phase canon (canonical sweep, make it real)

Builder reasoning log. WHY lives here; WHAT/HOW/EXPECTED/WHERE live in STATUS-canon.md.

2026-06-17 — bootstrap / code survey

Read the phase canon (plan-phase-canon-canonical-sweep.md) + plan.md §6.1/§7/§9. Surveyed the existing canonical/sweep machinery before designing. Key findings:

Clone identity

/srv/cc-ci is a symlink → /srv/cc-ci-orch; the env's two "working dirs" are the same directory. This IS the Builder clone (reflog shows the claim(M2)/status(samever) ## DONE commits). The Adversary cold-verifies from its own fresh clones. No collision.

What already works (phase doc is partly stale)

  • The phase doc says "ZERO canonical.json exist". Not true any more: a real canonical for custom-html exists on the host at /var/lib/ci-warm/custom-html/canonical.json (version 1.13.0+1.31.1, commit 2b82eba…, status idle, ts 20260617T050314Z) with its retained data volume warm-custom-html_..._content. It was produced by a manual cold run during the samever phase, NOT by the timer. So the promote primitive (seed_canonical → write_registry + warmsnap) demonstrably works; the sweep that should drive it is what's hollow.

The real "hollow sweep" defect (root cause, confirmed live)

The deployed nightly-sweep.timer fired 2026-06-17 03:09 and logged: ===== nightly cold sweep: enrolled canonicals = [] ===== → a true no-op. Cause: nightly_sweep.py does REPO = os.environ.get("CCCI_REPO", "/root/cc-ci") then sys.path.insert(0, REPO/runner); from harness import canonical. The systemd unit (nix/modules/nightly-sweep.nix) sets no CCCI_REPO, and /root/cc-ci does not exist on the host. So the import falls through to the harness packaged in the nix store (runnerSrc=../../runner — runner/ only, NO tests/). meta.TESTS_DIR = ROOT/tests then points at a nonexistent dir → enrolled_recipes() swallows the OSError → []. Even though custom-html is enrolled in the repo, the deployed timer never sees it. This is the machinery that was "specified but never doing anything." Fix: point the sweep at a real, current checkout that has tests/.

How current code stays live on the host

  • Normal recipe CI: Drone exec pipeline auto-clones cc-ci per build into its workspace, then runs cc-ci-run runner/run_recipe_ci.py from that fresh clone → tests/runner always current.
  • /etc/cc-ci is a git clone (the nixos flake source: nixos-rebuild --flake /etc/cc-ci#…). It is currently STALE (e60415d, far behind main) because recent phases only touched runner/ (picked up by Drone's fresh clone) and needed no nixos-rebuild. The sweep is the first thing that needs /etc/cc-ci current.
  • Plan: sweep service sets CCCI_REPO=/etc/cc-ci and runs nightly_sweep.py FROM the checkout (change the nix to exec $CCCI_REPO/runner/nightly_sweep.py, not the store copy) → after a deploy that does git -C /etc/cc-ci pull && nixos-rebuild, the sweep reads current tests/ + runner. This reuses the flake-source checkout (declarative, reproducible) rather than inventing a new clone.

Promote path (the core, §2.A)

  • should_promote_canonical(recipe, ref, overall, quick) = enrolled & green & cold(not quick) & not-ref (no PR head). promote_canonical deploys latest_version(recipe_tags(recipe)) (the latest git tag) fresh/in-place, waits healthy, undeploys, seed_canonical (snapshot + write_registry).
  • Tagged-promote addition needed: the green gate currently tests whatever fetch_recipe checked out (catalogue main HEAD for a cold run), which can be untagged-ahead of the latest tag, while promote always writes the latest TAG. Per operator: a canonical must only ever be a real release. Add a tagged requirement: the tested head version (abra.head_compose_version, the compose version label) must equal a published release tag (recipe_tags). When main HEAD == latest release (the common just-cut case) head_version == latest tag → promote; when main is untagged-ahead → no promote.

Trigger on a NEW RELEASE TAG (§2.D) + test the tag (not main)

  • Version ordering is centralized in warm_reconcile.version_key / latest_version / newest_older_version (already used by samever step-back). Reuse them.
  • Trigger (pure, in the sweep, per recipe): after mirror-sync, latest = latest_version(recipe_tags); canon = read_registry(recipe).version. No tag → SKIP (never released). latest <= canon (by version_key) → SKIP no-new-version (even if main has untagged commits — we compare tags not commits). latest > canon → run cold on the tag.
  • Test the TAG cold: to honour "run CI cold on that tagged version" (and so a green gate proves the exact thing that gets promoted), check out the latest tag in ~/.abra/recipes/<recipe> and run with CCCI_SKIP_FETCH=1 (the existing staging mechanism) → head_version = tag, head_ref = tag commit, REF empty (so not ref still holds → promote allowed). The upgrade-base resolver then sees canonical(older) < head(new tag) → real delta (samever step-back never fires: tag>canon by construction).

samever orthogonality (operator-required)

The release-tag trigger guarantees, in the sweep, version-under-test > canonical, so the upgrade base is strictly older → samever's same-version step-back never fires. (a) no new tag → SKIP, no upgrade-tier run; (b) new tag → canonical(older)→new, real delta, promote. samever's same-version behaviour stays owned by the samever phase on the PR path. Will demonstrate both in M2.

Enroll-all set (§2.B)

Authoritative inventory = cc-ci-plan/used-recipes.md (21 rows: 20 weekly + uptime-kuma external). NOT the test fixtures (custom-html-bkp-bad / -rst-bad, concurrency, regression, _generic). custom-html-tiny IS in used-recipes (weekly) → enroll it too.

Disk budget (§2.B watch-item)

Host /: 150G total, 104G used, 40G free (73%). du of /var/lib/ci-warm today: custom-html 32K, keycloak 159M. Retaining ~21 fresh-install data volumes should be a few GB; immich/matrix/mailu are the ones to watch. Will measure during the M2 full sweep and record the real budget; raise the VM disk (orchestrator) rather than silently drop recipes if it binds.

§2.G UPGRADE_BASE_VERSION retirement — gated on M2

plausible pins UPGRADE_BASE_VERSION="3.0.1+v2.0.0"; bluesky-pds only references it in a comment. Retirement requires plausible's canonical to actually land at its latest green release so the dynamic resolver picks the right base — so this is sequenced AFTER M2 promotes plausible. Keep the pin if plausible can't go green dynamically (record why).