Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-canon-canonical-sweep.md
autonomic-bot ee8d30b43e plan(canon): retire UPGRADE_BASE_VERSION (gated) — plausible's pin becomes redundant under the dynamic canonical base
Operator 2026-06-17: UPGRADE_BASE_VERSION is still used (plausible pins 3.0.1+
v2.0.0 to dodge the broken 3.0.0 base; bluesky-pds references it as a future
re-enable). Once canon establishes plausible's canonical at 3.0.1, the dynamic
base resolves correctly without the pin -> strip the key (meta/resolver/docs/
tests) + migrate plausible + update bluesky-pds note. GATED: keep it if
plausible genuinely still needs the escape-hatch (never drop upgrade coverage).
2026-06-17 04:43:23 +00:00

14 KiB

Phase canon — make the canonical sweep actually work (the real "nightly sweep") + verify it

Mission (operator-specified 2026-06-17): the "nightly sweep" was specified in theory but was never actually doing anything — confirmed live: nightly-sweep.timer is deployed and fires green (nightly_sweep.py, last run 2026-06-17 03:09 UTC exit 0), but only custom-html is WARM_CANONICAL -enrolled and ZERO canonical.json records exist — i.e. the machinery has never actually promoted a canonical end-to-end. This phase makes it real and proven, as the substitute for that hollow nightly sweep, with the operator's refinements (2026-06-17):

  1. Sync each recipe mirror's main on git.autonomic.zone/recipe-maintainers/<recipe> to its upstream (git.coopcloud.tech/coop-cloud/<recipe>) first, so the sweep sees true upstream tags/latest.
  2. Trigger on a new RELEASED VERSION, not a new commit. Test a recipe only when its latest release tag on the synced main is newer than its current canonical version — skip when there is no new version, even if main has new untagged commits. The sweep tests releases, not arbitrary commits.
  3. Promote the canonical only to a TAGGED release. A canonical advances only to a version that has a real release tag (a published release) — never to an arbitrary untagged commit.

…then run CI cold-on-main for each recipe and actually promote the canonical for any that pass — and prove the whole thing works. The deliverable is correctness, verified end-to-end — and the operator specifically wants confidence it plays nicely with the samever upgrade-base work (§2 "Plays-nice-with-samever"). Operator decisions (2026-06-17): all recipes enrolled (§2.B), and the cadence is weekly (change the existing daily timer to weekly — a one-line OnCalendar tune; exact day/time is not critical). This REPLACES the hollow nightly sweep; it is not a parallel job.

State files: STATUS-canon.md, BACKLOG-canon.md, REVIEW-canon.md, JOURNAL-canon.md. DECISIONS.md shared.

1. Verified starting state (2026-06-17)

  • nightly-sweep.timer enabled + active (next ~03:00 UTC); nightly_sweep.py runs and exits 0. The timer/service plumbing already works — reuse it, don't rebuild it.
  • Only custom-html sets WARM_CANONICAL = True. The sweep iterates canonical.enrolled_recipes() → essentially one recipe → near-no-op across the fleet.
  • No canonical.json exists on the host → the promote path (should_promote_canonicalpromote_canonicalwrite_registry) has never successfully produced a canonical, even for custom-html. This is the crux of "theory, not actually doing it."
  • The sweep does not reconcile mirrors to upstream, and does not skip-when-unchanged.

2. The work

A. Prove + fix the promote path FIRST (the core). On custom-html (already enrolled), make a green cold-on-latest run actually write canonical.json (recipe/version/commit/status) AND prove a subsequent --quick warm-reattach uses it (deploy_canonical reattaches the retained volume). If it doesn't happen today, find and fix why (this is the real defect behind the hollow sweep). A canonical must demonstrably exist and be reusable before anything else is meaningful.

  • Promote-gate addition (operator 2026-06-17): only promote to a TAGGED release. Extend should_promote_canonical so a promote ALSO requires the tested version to correspond to a published release tag (warm_reconcile.recipe_tags): green + cold + latest + enrolled + tagged. The canonical must always be a real release — never an arbitrary untagged main commit. An untagged state must never be written as a canonical.

B. Enroll ALL recipes (operator decision 2026-06-17). Set WARM_CANONICAL = True for every recipe cc-ci tracks (the used-recipes.md set) — the sweep promotes a canonical for each that passes, not just custom-html.

  • Watch the warm-volume disk budget: ~21 recipes each retaining a data volume on the single node is real disk. Verify headroom, lean on the existing WC8 disk-hygiene / ci-docker-prune, and if disk becomes the binding limit, raise it rather than silently dropping recipes (a fallback if needed: decouple the cheap last-green version record — kept for all — from the expensive retained volume). Default remains all-enrolled.
  • If a specific recipe genuinely cannot be enrolled (e.g. unbounded data, no stable health), record the exception + reason in DECISIONS — don't silently skip it.

C. Add the upstream mirror-sync step. Before the per-recipe CI, reconcile each mirror's main + tags to coopcloud upstream — reuse recipe-upgrade's open-recipe-pr.sh <recipe> --reconcile-only (handles go-git private-mirror auth, fetches coopcloud via an upstream remote, closes already-merged-upstream PRs, leaves unrelated PRs). This is a faithful mirror sync, not a push of our own changes.

D. Trigger on a new RELEASED VERSION (skip when no new version). After sync, compute the recipe's latest release tag version reachable on main and compare it to the canonical record's version:

  • latest release tag == canonical versionskip (SKIP no-new-version) — even if main has new untagged commits. The sweep tests releases, not arbitrary commits.
  • latest release tag newer than canonical → run CI cold on that tagged version → promote on green (tagged, per §2.A).
  • no release tag at all (recipe never released) → skip with a recorded reason. This is the operator's trigger refinement (version/tag-keyed, not commit-keyed) and the determinism property (M2 run-twice → everything skips).

E. Keep it deterministic + AI-free at runtime (it already is — a script + timer). The additions must stay pure code: no AI calls during the run. AI (the loops) only authors + verifies.

F. Make the timer weekly (operator preference): change the existing daily OnCalendar to weekly. The exact day/time is not critical — pick a low-traffic slot; it's a one-line tune. Persistent = true to catch up a missed run. This is the only schedule work; do not over-invest in it.

G. Retire UPGRADE_BASE_VERSION if plausible no longer needs it (operator 2026-06-17). Today it is still used: plausible sets UPGRADE_BASE_VERSION = "3.0.1+v2.0.0" (the old static [-2] default picked 3.0.0, whose clickhouse entrypoint 404s on amd64 → base never converges; the pin forces the newest published 3.0.1); bluesky-pds only references it (in an EXPECTED_NA upgrade-skip note as a future re-enable path). Once this phase enrolls plausible and promotes its canonical to its latest green release (3.0.1), the dynamic base resolves to 3.0.1 on its own — the correct base, avoiding the broken 3.0.0 — so the explicit pin becomes redundant. Therefore:

  • With plausible's canonical established at 3.0.1, remove the pin from tests/plausible/recipe_meta.py and confirm its upgrade tier still resolves the correct base (3.0.1) and passes under the dynamic resolver.
  • If that holds, strip UPGRADE_BASE_VERSION entirely: the meta key (runner/harness/meta.py KEYS), the override branch in resolve_upgrade_base (run_recipe_ci.py), the docs (recipe-customization.md §4/§5, testing.md), and the unit tests (test_meta.py, test_upgrade_base.py); and update bluesky-pds's comment so its re-enable path is the dynamic base, not the removed key.
  • GATE (do not force it): if plausible genuinely still can't get the right base dynamically (e.g. 3.0.1 itself won't cold-deploy green, so no canonical), KEEP UPGRADE_BASE_VERSION as the escape-hatch and record why in DECISIONS — never drop a recipe's upgrade coverage to delete a key.

Plays-nice-with-samever (operator wants this CONFIRMED). The release-tag trigger (D) makes the sweep and samever orthogonal — confirm they don't interfere:

  • In the sweep, a recipe runs only when a new release tag exists, so the version under test is always newer than the canonical → the upgrade tier's base (previous canonical/released version) is strictly older → samever's same-version step-back never fires in the sweep (the tag trigger already prevents a vX→vX run; no-new-version recipes are skipped outright).
  • samever remains the guard for the PR path (!testme), where a PR can carry the same version label as the canonical without cutting a release — that's where the step-back matters, and it's owned/proven by the samever phase, not here. So in the sweep, verify only: (a) no new release tag → recipe SKIPPED (no upgrade-tier run, no promote); (b) new release tag → canonical(older) → new tagged version, a real delta, promote (tagged). The sweep must never promote an untagged version and never run a same-version upgrade.

3. Gates

M1 — machinery works locally, each piece proven. (A) a real canonical.json is produced by a green cold run on ≥1 recipe and reused by a warm reattach — demonstrated, not assumed — and the promote gate now also requires a release tag (untagged → no promote). (C) mirror-sync and (D) the new-release-tag trigger implemented, reusing the existing reconcile + sweep code, with unit tests (trigger = latest release tag vs canonical version, NOT commit; sync invoked per recipe; promote gated on green+cold+latest+enrolled+tagged). (B) all recipes enrolled. Adversary cold-verifies: a canonical actually exists + reattaches; an untagged state never promotes; the trigger skips no-new-tag recipes and runs new-tag ones; sync is faithful-mirror-only; a RED recipe does NOT promote; no AI at runtime.

M2 — proven end-to-end in real CI (the heart of this phase). A full sweep run across the enrolled set on cc-ci: mirrors synced to upstream, canonicals actually promoted for the green recipes (records exist with correct version+commit), red recipes left intact, unchanged recipes skipped — with a per-recipe results log. Determinism proof: run the sweep a SECOND time immediately → it SKIPS every recipe (latest release tag == canonical version for all → skip all) = a clean no-op, no CI rerun. Confirm the deployed timer fires the real (non-hollow) job — after a fire, canonicals have advanced (evidence), not exit-0 on an empty set. Tagged-promote proven: show a green run on an untagged state does NOT promote, and a green run on a tagged release DOES.

samever orthogonality proven (operator-required). Demonstrate, with evidence, the two sweep paths: (1) no new release tag (latest tag == canonical version, even with new untagged commits on main) → recipe SKIPPED — no upgrade-tier run, no promote; (2) new release tag → cold-test the new tagged version, upgrade canonical(older) → new, a real delta, promote (because tagged). Confirm samever's step-back never fires inside the sweep (the tag trigger prevents same-version runs) — its same-version behavior is owned/proven by the samever phase on the PR path. Construct scenarios if the live recipe set doesn't cover both.

No AI in the loop. Fresh Adversary PASS on both milestones → ## DONE.

4. Guardrails

  • Correctness over cadence. The bar is the machinery demonstrably promotes canonicals, syncs mirrors, skips unchanged, and plays nicely with samever. The cadence is decided (weekly) — set it in one OnCalendar line and move on; don't agonize over the exact slot.
  • No AI at runtime — pure script + systemd timer; AI only builds/verifies.
  • Single-node safety: serial; skip the whole run if a Drone/test build is in flight (reuse the existing nightly guard); tear down every deploy; bound total runtime; mind the warm-volume disk budget.
  • Never force-promote / never weaken: promote only on green-cold-latest-enrolled; a red recipe keeps its prior known-good. Never weaken a test to make a recipe promote.
  • Faithful mirror sync only: force-sync main/tags to coopcloud upstream; never push our own changes to mirror main; never merge/disturb unrelated PRs.
  • Nix/host changes (enrollment is recipe-meta; any timer/module tweak is a nixos-rebuild): loops may deploy if clean and verify host health after; else file for the orchestrator. Commit author autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>; push every commit; abra over a pseudo-TTY.

5. Definition of Done

The canonical sweep actually works and is proven: a green cold run on a tagged release produces a real, reusable canonical.json (an untagged state never promotes); the sweep reconciles each recipe mirror's main to upstream, skips recipes with no new release tag (even if main has new untagged commits), runs CI cold on the new tagged version for the rest, and promotes the canonical only to that tagged release — across all enrolled recipes, demonstrated end-to-end in CI, including the run-twice no-op determinism proof, the tagged-promote proof, and a real (non-hollow) timer fire. samever confirmed orthogonal (never fires in the sweep). All recipes enrolled + warm-volume budget recorded. UPGRADE_BASE_VERSION retired (key + resolver branch + docs + tests removed, plausible migrated to the dynamic base) if plausible works without it — else kept with a recorded reason (§2.G). The runtime job is AI-free; it is the substitute for the hollow nightly sweep (not a parallel job). M1 + M2 fresh Adversary PASSes in REVIEW-canon.md.