Files
cc-ci/machine-docs/JOURNAL-canon.md
autonomic-bot 23c02c59b6
All checks were successful
continuous-integration/drone/push Build is passing
status(canon): bootstrap phase canon — state files, hollow-sweep root cause, M1/M2 backlog
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 06:28:35 +00:00

95 lines
6.6 KiB
Markdown

# JOURNAL — phase `canon` (canonical sweep, make it real)
Builder reasoning log. WHY lives here; WHAT/HOW/EXPECTED/WHERE live in STATUS-canon.md.
## 2026-06-17 — bootstrap / code survey
Read the phase canon (`plan-phase-canon-canonical-sweep.md`) + plan.md §6.1/§7/§9. Surveyed the
existing canonical/sweep machinery before designing. Key findings:
### Clone identity
`/srv/cc-ci` is a symlink → `/srv/cc-ci-orch`; the env's two "working dirs" are the same directory.
This IS the Builder clone (reflog shows the `claim(M2)`/`status(samever) ## DONE` commits). The
Adversary cold-verifies from its own fresh clones. No collision.
### What already works (phase doc is partly stale)
- The phase doc says "ZERO canonical.json exist". **Not true any more**: a real canonical for
`custom-html` exists on the host at `/var/lib/ci-warm/custom-html/canonical.json`
(`version 1.13.0+1.31.1`, commit `2b82eba…`, status idle, ts `20260617T050314Z`) with its retained
data volume `warm-custom-html_..._content`. It was produced by a **manual** cold run during the
`samever` phase, NOT by the timer. So the *promote primitive* (seed_canonical → write_registry +
warmsnap) demonstrably works; the **sweep that should drive it is what's hollow.**
### The real "hollow sweep" defect (root cause, confirmed live)
The deployed `nightly-sweep.timer` fired 2026-06-17 03:09 and logged:
`===== nightly cold sweep: enrolled canonicals = [] =====` → a true no-op.
Cause: `nightly_sweep.py` does `REPO = os.environ.get("CCCI_REPO", "/root/cc-ci")` then
`sys.path.insert(0, REPO/runner); from harness import canonical`. The systemd unit
(`nix/modules/nightly-sweep.nix`) sets **no `CCCI_REPO`**, and `/root/cc-ci` **does not exist** on the
host. So the import falls through to the harness packaged in the **nix store** (`runnerSrc=../../runner`
— runner/ only, NO tests/). `meta.TESTS_DIR = ROOT/tests` then points at a nonexistent dir →
`enrolled_recipes()` swallows the OSError → `[]`. Even though `custom-html` is enrolled in the repo,
the deployed timer never sees it. **This is the machinery that was "specified but never doing
anything."** Fix: point the sweep at a real, current checkout that has `tests/`.
### How current code stays live on the host
- Normal recipe CI: Drone `exec` pipeline auto-clones cc-ci per build into its workspace, then runs
`cc-ci-run runner/run_recipe_ci.py` from that fresh clone → tests/runner always current.
- `/etc/cc-ci` is a **git clone** (the nixos flake source: `nixos-rebuild --flake /etc/cc-ci#…`).
It is currently STALE (`e60415d`, far behind main) because recent phases only touched `runner/`
(picked up by Drone's fresh clone) and needed no nixos-rebuild. The sweep is the first thing that
needs `/etc/cc-ci` current.
- Plan: sweep service sets `CCCI_REPO=/etc/cc-ci` and runs `nightly_sweep.py` FROM the checkout
(change the nix to exec `$CCCI_REPO/runner/nightly_sweep.py`, not the store copy) → after a deploy
that does `git -C /etc/cc-ci pull && nixos-rebuild`, the sweep reads current tests/ + runner. This
reuses the flake-source checkout (declarative, reproducible) rather than inventing a new clone.
### Promote path (the core, §2.A)
- `should_promote_canonical(recipe, ref, overall, quick)` = enrolled & green & cold(not quick) &
not-ref (no PR head). `promote_canonical` deploys `latest_version(recipe_tags(recipe))` (the latest
git tag) fresh/in-place, waits healthy, undeploys, `seed_canonical` (snapshot + write_registry).
- **Tagged-promote addition needed:** the green gate currently tests *whatever fetch_recipe checked
out* (catalogue `main` HEAD for a cold run), which can be untagged-ahead of the latest tag, while
promote always writes the latest TAG. Per operator: a canonical must only ever be a real release.
Add a `tagged` requirement: the tested head version (`abra.head_compose_version`, the compose
`version` label) must equal a published release tag (`recipe_tags`). When main HEAD == latest
release (the common just-cut case) head_version == latest tag → promote; when main is untagged-ahead
→ no promote.
### Trigger on a NEW RELEASE TAG (§2.D) + test the tag (not main)
- Version ordering is centralized in `warm_reconcile.version_key` / `latest_version` /
`newest_older_version` (already used by samever step-back). Reuse them.
- Trigger (pure, in the sweep, per recipe): after mirror-sync, `latest = latest_version(recipe_tags)`;
`canon = read_registry(recipe).version`. No tag → SKIP (never released). `latest <= canon` (by
version_key) → SKIP no-new-version (even if main has untagged commits — we compare tags not
commits). `latest > canon` → run cold on the tag.
- **Test the TAG cold:** to honour "run CI cold on that tagged version" (and so a green gate proves
the exact thing that gets promoted), check out the latest tag in `~/.abra/recipes/<recipe>` and run
with `CCCI_SKIP_FETCH=1` (the existing staging mechanism) → head_version = tag, head_ref = tag
commit, REF empty (so `not ref` still holds → promote allowed). The upgrade-base resolver then sees
canonical(older) < head(new tag) real delta (samever step-back never fires: tag>canon by
construction).
### samever orthogonality (operator-required)
The release-tag trigger guarantees, in the sweep, version-under-test > canonical, so the upgrade
base is strictly older → `samever`'s same-version step-back never fires. (a) no new tag → SKIP, no
upgrade-tier run; (b) new tag → canonical(older)→new, real delta, promote. samever's same-version
behaviour stays owned by the samever phase on the PR path. Will demonstrate both in M2.
### Enroll-all set (§2.B)
Authoritative inventory = `cc-ci-plan/used-recipes.md` (21 rows: 20 `weekly` + `uptime-kuma`
`external`). NOT the test fixtures (custom-html-bkp-bad / -rst-bad, concurrency, regression,
_generic). custom-html-tiny IS in used-recipes (weekly) → enroll it too.
### Disk budget (§2.B watch-item)
Host `/`: 150G total, 104G used, **40G free (73%)**. `du` of /var/lib/ci-warm today: custom-html 32K,
keycloak 159M. Retaining ~21 fresh-install data volumes should be a few GB; immich/matrix/mailu are
the ones to watch. Will measure during the M2 full sweep and record the real budget; raise the VM
disk (orchestrator) rather than silently drop recipes if it binds.
### §2.G UPGRADE_BASE_VERSION retirement — gated on M2
`plausible` pins `UPGRADE_BASE_VERSION="3.0.1+v2.0.0"`; `bluesky-pds` only references it in a comment.
Retirement requires plausible's canonical to actually land at its latest green release so the dynamic
resolver picks the right base — so this is sequenced AFTER M2 promotes plausible. Keep the pin if
plausible can't go green dynamically (record why).