Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-canon-canonical-sweep.md
autonomic-bot ee8d30b43e plan(canon): retire UPGRADE_BASE_VERSION (gated) — plausible's pin becomes redundant under the dynamic canonical base
Operator 2026-06-17: UPGRADE_BASE_VERSION is still used (plausible pins 3.0.1+
v2.0.0 to dodge the broken 3.0.0 base; bluesky-pds references it as a future
re-enable). Once canon establishes plausible's canonical at 3.0.1, the dynamic
base resolves correctly without the pin -> strip the key (meta/resolver/docs/
tests) + migrate plausible + update bluesky-pds note. GATED: keep it if
plausible genuinely still needs the escape-hatch (never drop upgrade coverage).
2026-06-17 04:43:23 +00:00

174 lines
14 KiB
Markdown

# Phase `canon` — make the canonical sweep actually work (the real "nightly sweep") + verify it
**Mission (operator-specified 2026-06-17):** the "nightly sweep" was specified in theory but **was never
actually doing anything** — confirmed live: `nightly-sweep.timer` is deployed and fires green
(`nightly_sweep.py`, last run 2026-06-17 03:09 UTC exit 0), but **only `custom-html` is `WARM_CANONICAL`
-enrolled and ZERO `canonical.json` records exist** — i.e. the machinery has **never actually promoted a
canonical end-to-end**. This phase makes it **real and proven**, as the **substitute for** that hollow
nightly sweep, with the operator's refinements (2026-06-17):
1. **Sync each recipe mirror's `main`** on `git.autonomic.zone/recipe-maintainers/<recipe>` to its
**upstream** (`git.coopcloud.tech/coop-cloud/<recipe>`) first, so the sweep sees true upstream
tags/latest.
2. **Trigger on a new RELEASED VERSION, not a new commit.** Test a recipe only when its latest **release
tag** on the synced `main` is **newer** than its current canonical version — **skip when there is no
new version**, even if `main` has new *untagged* commits. The sweep tests releases, not arbitrary commits.
3. **Promote the canonical only to a TAGGED release.** A canonical advances only to a version that has a
real release tag (a published release) — never to an arbitrary untagged commit.
…then **run CI cold-on-`main` for each recipe and actually promote the canonical for any that pass**
and **prove the whole thing works**. **The deliverable is correctness, verified end-to-end** — and the
operator specifically wants confidence it **plays nicely with the `samever` upgrade-base work** (§2
"Plays-nice-with-samever"). Operator decisions (2026-06-17): **all recipes enrolled** (§2.B), and the
**cadence is weekly** (change the existing daily timer to weekly — a one-line `OnCalendar` tune; exact
day/time is not critical). This REPLACES the hollow nightly sweep; it is not a parallel job.
State files: `STATUS-canon.md`, `BACKLOG-canon.md`, `REVIEW-canon.md`, `JOURNAL-canon.md`. DECISIONS.md shared.
## 1. Verified starting state (2026-06-17)
- `nightly-sweep.timer` enabled + active (next ~03:00 UTC); `nightly_sweep.py` runs and exits 0. The
timer/service plumbing already works — **reuse it, don't rebuild it.**
- **Only `custom-html` sets `WARM_CANONICAL = True`.** The sweep iterates `canonical.enrolled_recipes()`
→ essentially one recipe → near-no-op across the fleet.
- **No `canonical.json` exists** on the host → the promote path (`should_promote_canonical`
`promote_canonical``write_registry`) has **never successfully produced a canonical**, even for
custom-html. This is the crux of "theory, not actually doing it."
- The sweep does **not** reconcile mirrors to upstream, and does **not** skip-when-unchanged.
## 2. The work
**A. Prove + fix the promote path FIRST (the core).** On `custom-html` (already enrolled), make a green
cold-on-latest run **actually write `canonical.json`** (recipe/version/commit/status) AND prove a
subsequent `--quick` warm-reattach uses it (`deploy_canonical` reattaches the retained volume). If it
doesn't happen today, find and fix why (this is the real defect behind the hollow sweep). A canonical
must demonstrably exist and be reusable before anything else is meaningful.
- **Promote-gate addition (operator 2026-06-17): only promote to a TAGGED release.** Extend
`should_promote_canonical` so a promote ALSO requires the tested version to correspond to a published
**release tag** (`warm_reconcile.recipe_tags`): green + cold + latest + enrolled **+ tagged**. The
canonical must always be a real release — never an arbitrary untagged `main` commit. An untagged state
must never be written as a canonical.
**B. Enroll ALL recipes (operator decision 2026-06-17).** Set `WARM_CANONICAL = True` for **every** recipe
cc-ci tracks (the `used-recipes.md` set) — the sweep promotes a canonical for each that passes, not just
custom-html.
- **Watch the warm-volume disk budget:** ~21 recipes each retaining a data volume on the single node is
real disk. Verify headroom, lean on the existing WC8 disk-hygiene / `ci-docker-prune`, and if disk
becomes the binding limit, **raise it** rather than silently dropping recipes (a fallback if needed:
decouple the cheap last-green *version record* — kept for all — from the expensive retained *volume*).
Default remains all-enrolled.
- If a specific recipe genuinely cannot be enrolled (e.g. unbounded data, no stable health), record the
exception + reason in DECISIONS — don't silently skip it.
**C. Add the upstream mirror-sync step.** Before the per-recipe CI, reconcile each mirror's `main` + tags
to coopcloud upstream — reuse `recipe-upgrade`'s `open-recipe-pr.sh <recipe> --reconcile-only` (handles
go-git private-mirror auth, fetches coopcloud via an `upstream` remote, closes already-merged-upstream
PRs, leaves unrelated PRs). This is a **faithful mirror sync, not a push of our own changes.**
**D. Trigger on a new RELEASED VERSION (skip when no new version).** After sync, compute the recipe's
latest **release tag** version reachable on `main` and compare it to the canonical record's version:
- latest release tag **== canonical version** → **skip** (`SKIP no-new-version`) — *even if `main` has
new untagged commits*. The sweep tests releases, not arbitrary commits.
- latest release tag **newer than** canonical → run CI **cold on that tagged version** → promote on green
(tagged, per §2.A).
- no release tag at all (recipe never released) → skip with a recorded reason.
This is the operator's trigger refinement (version/tag-keyed, **not** commit-keyed) and the determinism
property (M2 run-twice → everything skips).
**E. Keep it deterministic + AI-free at runtime** (it already is — a script + timer). The additions must
stay pure code: no AI calls during the run. AI (the loops) only authors + verifies.
**F. Make the timer weekly** (operator preference): change the existing daily `OnCalendar` to weekly. The
exact day/time is not critical — pick a low-traffic slot; it's a one-line tune. `Persistent = true` to
catch up a missed run. This is the only schedule work; do not over-invest in it.
**G. Retire `UPGRADE_BASE_VERSION` if plausible no longer needs it (operator 2026-06-17).** Today it is
still used: **`plausible`** sets `UPGRADE_BASE_VERSION = "3.0.1+v2.0.0"` (the old static `[-2]` default
picked `3.0.0`, whose clickhouse entrypoint 404s on amd64 → base never converges; the pin forces the
newest published `3.0.1`); **`bluesky-pds`** only *references* it (in an `EXPECTED_NA` upgrade-skip note as
a future re-enable path). Once this phase enrolls plausible and promotes its canonical to its latest green
release (`3.0.1`), the **dynamic base resolves to `3.0.1` on its own** — the correct base, avoiding the
broken `3.0.0` — so the explicit pin becomes redundant. Therefore:
- With plausible's canonical established at `3.0.1`, **remove the pin from `tests/plausible/recipe_meta.py`
and confirm its upgrade tier still resolves the correct base (`3.0.1`) and passes** under the dynamic
resolver.
- If that holds, **strip `UPGRADE_BASE_VERSION` entirely**: the meta key (`runner/harness/meta.py` KEYS),
the override branch in `resolve_upgrade_base` (`run_recipe_ci.py`), the docs (`recipe-customization.md`
§4/§5, `testing.md`), and the unit tests (`test_meta.py`, `test_upgrade_base.py`); and update
`bluesky-pds`'s comment so its re-enable path is the dynamic base, not the removed key.
- **GATE (do not force it):** if plausible genuinely still can't get the right base dynamically (e.g.
`3.0.1` itself won't cold-deploy green, so no canonical), **KEEP `UPGRADE_BASE_VERSION`** as the
escape-hatch and record why in DECISIONS — never drop a recipe's upgrade coverage to delete a key.
**Plays-nice-with-`samever` (operator wants this CONFIRMED).** The release-tag trigger (D) makes the
sweep and `samever` **orthogonal** — confirm they don't interfere:
- In the **sweep**, a recipe runs **only when a new release tag exists**, so the version under test is
always *newer* than the canonical → the upgrade tier's base (previous canonical/released version) is
strictly older → **`samever`'s same-version step-back never fires in the sweep** (the tag trigger
already prevents a `vX→vX` run; no-new-version recipes are skipped outright).
- `samever` remains the guard for the **PR path** (`!testme`), where a PR can carry the same version
label as the canonical without cutting a release — that's where the step-back matters, and it's
owned/proven by the `samever` phase, not here.
So in the sweep, verify only: (a) no new release tag → recipe SKIPPED (no upgrade-tier run, no promote);
(b) new release tag → `canonical(older) → new tagged version`, a real delta, promote (tagged). The sweep
must never promote an untagged version and never run a same-version upgrade.
## 3. Gates
**M1 — machinery works locally, each piece proven.** (A) a real `canonical.json` is produced by a green
cold run on ≥1 recipe and reused by a warm reattach — **demonstrated, not assumed** — and the promote gate
now also requires a **release tag** (untagged → no promote). (C) mirror-sync and (D) the **new-release-tag
trigger** implemented, reusing the existing reconcile + sweep code, with unit tests (trigger = latest
release tag vs canonical version, NOT commit; sync invoked per recipe; promote gated on
green+cold+latest+enrolled+**tagged**). (B) all recipes enrolled. Adversary cold-verifies: a canonical
actually exists + reattaches; an **untagged** state never promotes; the trigger skips no-new-tag recipes
and runs new-tag ones; sync is faithful-mirror-only; a RED recipe does NOT promote; no AI at runtime.
**M2 — proven end-to-end in real CI (the heart of this phase).** A full sweep run across the enrolled set
on cc-ci: mirrors synced to upstream, **canonicals actually promoted for the green recipes** (records
exist with correct version+commit), red recipes left intact, unchanged recipes skipped — with a
per-recipe results log. **Determinism proof: run the sweep a SECOND time immediately → it SKIPS every
recipe** (latest release tag == canonical version for all → skip all) = a clean no-op, no CI rerun.
Confirm the **deployed timer fires the real (non-hollow) job** — after a fire, canonicals have advanced
(evidence), not exit-0 on an empty set. **Tagged-promote proven:** show a green run on an untagged state
does NOT promote, and a green run on a tagged release DOES.
**`samever` orthogonality proven (operator-required).** Demonstrate, with evidence, the two sweep paths:
(1) **no new release tag** (latest tag == canonical version, even with new untagged commits on `main`) →
recipe SKIPPED — no upgrade-tier run, no promote; (2) **new release tag** → cold-test the new tagged
version, upgrade `canonical(older) → new`, a real delta, promote (because tagged). Confirm `samever`'s
step-back **never fires inside the sweep** (the tag trigger prevents same-version runs) — its same-version
behavior is owned/proven by the `samever` phase on the PR path. Construct scenarios if the live recipe set
doesn't cover both.
No AI in the loop. Fresh Adversary PASS on both milestones → `## DONE`.
## 4. Guardrails
- **Correctness over cadence.** The bar is the machinery *demonstrably promotes canonicals, syncs mirrors,
skips unchanged, and plays nicely with `samever`.* The cadence is decided (**weekly**) — set it in one
`OnCalendar` line and move on; don't agonize over the exact slot.
- **No AI at runtime** — pure script + systemd timer; AI only builds/verifies.
- **Single-node safety:** serial; skip the whole run if a Drone/test build is in flight (reuse the
existing nightly guard); tear down every deploy; bound total runtime; mind the warm-volume disk budget.
- **Never force-promote / never weaken:** promote only on green-cold-latest-enrolled; a red recipe keeps
its prior known-good. Never weaken a test to make a recipe promote.
- **Faithful mirror sync only:** force-sync `main`/tags to coopcloud upstream; never push our own changes
to mirror `main`; never merge/disturb unrelated PRs.
- **Nix/host changes** (enrollment is recipe-meta; any timer/module tweak is a nixos-rebuild): loops may
deploy if clean and **verify host health after**; else file for the orchestrator. Commit author
`autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>`; push every commit; abra over a pseudo-TTY.
## 5. Definition of Done
The canonical sweep **actually works and is proven**: a green cold run on a **tagged release** produces a
real, reusable `canonical.json` (an untagged state never promotes); the sweep reconciles each recipe
mirror's `main` to upstream, **skips recipes with no new release tag** (even if `main` has new untagged
commits), runs CI cold on the new tagged version for the rest, and promotes the canonical only to that
tagged release — across all enrolled recipes, demonstrated end-to-end in CI, including the run-twice no-op
determinism proof, the tagged-promote proof, and a real (non-hollow) timer fire. `samever` confirmed
orthogonal (never fires in the sweep). All recipes enrolled + warm-volume budget recorded. `UPGRADE_BASE_VERSION`
retired (key + resolver branch + docs + tests removed, plausible migrated to the dynamic base) **if**
plausible works without it — else kept with a recorded reason (§2.G). The runtime job is AI-free; it is the
substitute for the hollow nightly sweep (not a parallel job). M1 + M2 fresh Adversary PASSes in REVIEW-canon.md.