Operator 2026-06-17: UPGRADE_BASE_VERSION is still used (plausible pins 3.0.1+ v2.0.0 to dodge the broken 3.0.0 base; bluesky-pds references it as a future re-enable). Once canon establishes plausible's canonical at 3.0.1, the dynamic base resolves correctly without the pin -> strip the key (meta/resolver/docs/ tests) + migrate plausible + update bluesky-pds note. GATED: keep it if plausible genuinely still needs the escape-hatch (never drop upgrade coverage).
174 lines
14 KiB
Markdown
174 lines
14 KiB
Markdown
# Phase `canon` — make the canonical sweep actually work (the real "nightly sweep") + verify it
|
|
|
|
**Mission (operator-specified 2026-06-17):** the "nightly sweep" was specified in theory but **was never
|
|
actually doing anything** — confirmed live: `nightly-sweep.timer` is deployed and fires green
|
|
(`nightly_sweep.py`, last run 2026-06-17 03:09 UTC exit 0), but **only `custom-html` is `WARM_CANONICAL`
|
|
-enrolled and ZERO `canonical.json` records exist** — i.e. the machinery has **never actually promoted a
|
|
canonical end-to-end**. This phase makes it **real and proven**, as the **substitute for** that hollow
|
|
nightly sweep, with the operator's refinements (2026-06-17):
|
|
|
|
1. **Sync each recipe mirror's `main`** on `git.autonomic.zone/recipe-maintainers/<recipe>` to its
|
|
**upstream** (`git.coopcloud.tech/coop-cloud/<recipe>`) first, so the sweep sees true upstream
|
|
tags/latest.
|
|
2. **Trigger on a new RELEASED VERSION, not a new commit.** Test a recipe only when its latest **release
|
|
tag** on the synced `main` is **newer** than its current canonical version — **skip when there is no
|
|
new version**, even if `main` has new *untagged* commits. The sweep tests releases, not arbitrary commits.
|
|
3. **Promote the canonical only to a TAGGED release.** A canonical advances only to a version that has a
|
|
real release tag (a published release) — never to an arbitrary untagged commit.
|
|
|
|
…then **run CI cold-on-`main` for each recipe and actually promote the canonical for any that pass** —
|
|
and **prove the whole thing works**. **The deliverable is correctness, verified end-to-end** — and the
|
|
operator specifically wants confidence it **plays nicely with the `samever` upgrade-base work** (§2
|
|
"Plays-nice-with-samever"). Operator decisions (2026-06-17): **all recipes enrolled** (§2.B), and the
|
|
**cadence is weekly** (change the existing daily timer to weekly — a one-line `OnCalendar` tune; exact
|
|
day/time is not critical). This REPLACES the hollow nightly sweep; it is not a parallel job.
|
|
|
|
State files: `STATUS-canon.md`, `BACKLOG-canon.md`, `REVIEW-canon.md`, `JOURNAL-canon.md`. DECISIONS.md shared.
|
|
|
|
## 1. Verified starting state (2026-06-17)
|
|
|
|
- `nightly-sweep.timer` enabled + active (next ~03:00 UTC); `nightly_sweep.py` runs and exits 0. The
|
|
timer/service plumbing already works — **reuse it, don't rebuild it.**
|
|
- **Only `custom-html` sets `WARM_CANONICAL = True`.** The sweep iterates `canonical.enrolled_recipes()`
|
|
→ essentially one recipe → near-no-op across the fleet.
|
|
- **No `canonical.json` exists** on the host → the promote path (`should_promote_canonical` →
|
|
`promote_canonical` → `write_registry`) has **never successfully produced a canonical**, even for
|
|
custom-html. This is the crux of "theory, not actually doing it."
|
|
- The sweep does **not** reconcile mirrors to upstream, and does **not** skip-when-unchanged.
|
|
|
|
## 2. The work
|
|
|
|
**A. Prove + fix the promote path FIRST (the core).** On `custom-html` (already enrolled), make a green
|
|
cold-on-latest run **actually write `canonical.json`** (recipe/version/commit/status) AND prove a
|
|
subsequent `--quick` warm-reattach uses it (`deploy_canonical` reattaches the retained volume). If it
|
|
doesn't happen today, find and fix why (this is the real defect behind the hollow sweep). A canonical
|
|
must demonstrably exist and be reusable before anything else is meaningful.
|
|
- **Promote-gate addition (operator 2026-06-17): only promote to a TAGGED release.** Extend
|
|
`should_promote_canonical` so a promote ALSO requires the tested version to correspond to a published
|
|
**release tag** (`warm_reconcile.recipe_tags`): green + cold + latest + enrolled **+ tagged**. The
|
|
canonical must always be a real release — never an arbitrary untagged `main` commit. An untagged state
|
|
must never be written as a canonical.
|
|
|
|
**B. Enroll ALL recipes (operator decision 2026-06-17).** Set `WARM_CANONICAL = True` for **every** recipe
|
|
cc-ci tracks (the `used-recipes.md` set) — the sweep promotes a canonical for each that passes, not just
|
|
custom-html.
|
|
- **Watch the warm-volume disk budget:** ~21 recipes each retaining a data volume on the single node is
|
|
real disk. Verify headroom, lean on the existing WC8 disk-hygiene / `ci-docker-prune`, and if disk
|
|
becomes the binding limit, **raise it** rather than silently dropping recipes (a fallback if needed:
|
|
decouple the cheap last-green *version record* — kept for all — from the expensive retained *volume*).
|
|
Default remains all-enrolled.
|
|
- If a specific recipe genuinely cannot be enrolled (e.g. unbounded data, no stable health), record the
|
|
exception + reason in DECISIONS — don't silently skip it.
|
|
|
|
**C. Add the upstream mirror-sync step.** Before the per-recipe CI, reconcile each mirror's `main` + tags
|
|
to coopcloud upstream — reuse `recipe-upgrade`'s `open-recipe-pr.sh <recipe> --reconcile-only` (handles
|
|
go-git private-mirror auth, fetches coopcloud via an `upstream` remote, closes already-merged-upstream
|
|
PRs, leaves unrelated PRs). This is a **faithful mirror sync, not a push of our own changes.**
|
|
|
|
**D. Trigger on a new RELEASED VERSION (skip when no new version).** After sync, compute the recipe's
|
|
latest **release tag** version reachable on `main` and compare it to the canonical record's version:
|
|
- latest release tag **== canonical version** → **skip** (`SKIP no-new-version`) — *even if `main` has
|
|
new untagged commits*. The sweep tests releases, not arbitrary commits.
|
|
- latest release tag **newer than** canonical → run CI **cold on that tagged version** → promote on green
|
|
(tagged, per §2.A).
|
|
- no release tag at all (recipe never released) → skip with a recorded reason.
|
|
This is the operator's trigger refinement (version/tag-keyed, **not** commit-keyed) and the determinism
|
|
property (M2 run-twice → everything skips).
|
|
|
|
**E. Keep it deterministic + AI-free at runtime** (it already is — a script + timer). The additions must
|
|
stay pure code: no AI calls during the run. AI (the loops) only authors + verifies.
|
|
|
|
**F. Make the timer weekly** (operator preference): change the existing daily `OnCalendar` to weekly. The
|
|
exact day/time is not critical — pick a low-traffic slot; it's a one-line tune. `Persistent = true` to
|
|
catch up a missed run. This is the only schedule work; do not over-invest in it.
|
|
|
|
**G. Retire `UPGRADE_BASE_VERSION` if plausible no longer needs it (operator 2026-06-17).** Today it is
|
|
still used: **`plausible`** sets `UPGRADE_BASE_VERSION = "3.0.1+v2.0.0"` (the old static `[-2]` default
|
|
picked `3.0.0`, whose clickhouse entrypoint 404s on amd64 → base never converges; the pin forces the
|
|
newest published `3.0.1`); **`bluesky-pds`** only *references* it (in an `EXPECTED_NA` upgrade-skip note as
|
|
a future re-enable path). Once this phase enrolls plausible and promotes its canonical to its latest green
|
|
release (`3.0.1`), the **dynamic base resolves to `3.0.1` on its own** — the correct base, avoiding the
|
|
broken `3.0.0` — so the explicit pin becomes redundant. Therefore:
|
|
- With plausible's canonical established at `3.0.1`, **remove the pin from `tests/plausible/recipe_meta.py`
|
|
and confirm its upgrade tier still resolves the correct base (`3.0.1`) and passes** under the dynamic
|
|
resolver.
|
|
- If that holds, **strip `UPGRADE_BASE_VERSION` entirely**: the meta key (`runner/harness/meta.py` KEYS),
|
|
the override branch in `resolve_upgrade_base` (`run_recipe_ci.py`), the docs (`recipe-customization.md`
|
|
§4/§5, `testing.md`), and the unit tests (`test_meta.py`, `test_upgrade_base.py`); and update
|
|
`bluesky-pds`'s comment so its re-enable path is the dynamic base, not the removed key.
|
|
- **GATE (do not force it):** if plausible genuinely still can't get the right base dynamically (e.g.
|
|
`3.0.1` itself won't cold-deploy green, so no canonical), **KEEP `UPGRADE_BASE_VERSION`** as the
|
|
escape-hatch and record why in DECISIONS — never drop a recipe's upgrade coverage to delete a key.
|
|
|
|
**Plays-nice-with-`samever` (operator wants this CONFIRMED).** The release-tag trigger (D) makes the
|
|
sweep and `samever` **orthogonal** — confirm they don't interfere:
|
|
- In the **sweep**, a recipe runs **only when a new release tag exists**, so the version under test is
|
|
always *newer* than the canonical → the upgrade tier's base (previous canonical/released version) is
|
|
strictly older → **`samever`'s same-version step-back never fires in the sweep** (the tag trigger
|
|
already prevents a `vX→vX` run; no-new-version recipes are skipped outright).
|
|
- `samever` remains the guard for the **PR path** (`!testme`), where a PR can carry the same version
|
|
label as the canonical without cutting a release — that's where the step-back matters, and it's
|
|
owned/proven by the `samever` phase, not here.
|
|
So in the sweep, verify only: (a) no new release tag → recipe SKIPPED (no upgrade-tier run, no promote);
|
|
(b) new release tag → `canonical(older) → new tagged version`, a real delta, promote (tagged). The sweep
|
|
must never promote an untagged version and never run a same-version upgrade.
|
|
|
|
## 3. Gates
|
|
|
|
**M1 — machinery works locally, each piece proven.** (A) a real `canonical.json` is produced by a green
|
|
cold run on ≥1 recipe and reused by a warm reattach — **demonstrated, not assumed** — and the promote gate
|
|
now also requires a **release tag** (untagged → no promote). (C) mirror-sync and (D) the **new-release-tag
|
|
trigger** implemented, reusing the existing reconcile + sweep code, with unit tests (trigger = latest
|
|
release tag vs canonical version, NOT commit; sync invoked per recipe; promote gated on
|
|
green+cold+latest+enrolled+**tagged**). (B) all recipes enrolled. Adversary cold-verifies: a canonical
|
|
actually exists + reattaches; an **untagged** state never promotes; the trigger skips no-new-tag recipes
|
|
and runs new-tag ones; sync is faithful-mirror-only; a RED recipe does NOT promote; no AI at runtime.
|
|
|
|
**M2 — proven end-to-end in real CI (the heart of this phase).** A full sweep run across the enrolled set
|
|
on cc-ci: mirrors synced to upstream, **canonicals actually promoted for the green recipes** (records
|
|
exist with correct version+commit), red recipes left intact, unchanged recipes skipped — with a
|
|
per-recipe results log. **Determinism proof: run the sweep a SECOND time immediately → it SKIPS every
|
|
recipe** (latest release tag == canonical version for all → skip all) = a clean no-op, no CI rerun.
|
|
Confirm the **deployed timer fires the real (non-hollow) job** — after a fire, canonicals have advanced
|
|
(evidence), not exit-0 on an empty set. **Tagged-promote proven:** show a green run on an untagged state
|
|
does NOT promote, and a green run on a tagged release DOES.
|
|
|
|
**`samever` orthogonality proven (operator-required).** Demonstrate, with evidence, the two sweep paths:
|
|
(1) **no new release tag** (latest tag == canonical version, even with new untagged commits on `main`) →
|
|
recipe SKIPPED — no upgrade-tier run, no promote; (2) **new release tag** → cold-test the new tagged
|
|
version, upgrade `canonical(older) → new`, a real delta, promote (because tagged). Confirm `samever`'s
|
|
step-back **never fires inside the sweep** (the tag trigger prevents same-version runs) — its same-version
|
|
behavior is owned/proven by the `samever` phase on the PR path. Construct scenarios if the live recipe set
|
|
doesn't cover both.
|
|
|
|
No AI in the loop. Fresh Adversary PASS on both milestones → `## DONE`.
|
|
|
|
## 4. Guardrails
|
|
|
|
- **Correctness over cadence.** The bar is the machinery *demonstrably promotes canonicals, syncs mirrors,
|
|
skips unchanged, and plays nicely with `samever`.* The cadence is decided (**weekly**) — set it in one
|
|
`OnCalendar` line and move on; don't agonize over the exact slot.
|
|
- **No AI at runtime** — pure script + systemd timer; AI only builds/verifies.
|
|
- **Single-node safety:** serial; skip the whole run if a Drone/test build is in flight (reuse the
|
|
existing nightly guard); tear down every deploy; bound total runtime; mind the warm-volume disk budget.
|
|
- **Never force-promote / never weaken:** promote only on green-cold-latest-enrolled; a red recipe keeps
|
|
its prior known-good. Never weaken a test to make a recipe promote.
|
|
- **Faithful mirror sync only:** force-sync `main`/tags to coopcloud upstream; never push our own changes
|
|
to mirror `main`; never merge/disturb unrelated PRs.
|
|
- **Nix/host changes** (enrollment is recipe-meta; any timer/module tweak is a nixos-rebuild): loops may
|
|
deploy if clean and **verify host health after**; else file for the orchestrator. Commit author
|
|
`autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>`; push every commit; abra over a pseudo-TTY.
|
|
|
|
## 5. Definition of Done
|
|
|
|
The canonical sweep **actually works and is proven**: a green cold run on a **tagged release** produces a
|
|
real, reusable `canonical.json` (an untagged state never promotes); the sweep reconciles each recipe
|
|
mirror's `main` to upstream, **skips recipes with no new release tag** (even if `main` has new untagged
|
|
commits), runs CI cold on the new tagged version for the rest, and promotes the canonical only to that
|
|
tagged release — across all enrolled recipes, demonstrated end-to-end in CI, including the run-twice no-op
|
|
determinism proof, the tagged-promote proof, and a real (non-hollow) timer fire. `samever` confirmed
|
|
orthogonal (never fires in the sweep). All recipes enrolled + warm-volume budget recorded. `UPGRADE_BASE_VERSION`
|
|
retired (key + resolver branch + docs + tests removed, plausible migrated to the dynamic base) **if**
|
|
plausible works without it — else kept with a recorded reason (§2.G). The runtime job is AI-free; it is the
|
|
substitute for the hollow nightly sweep (not a parallel job). M1 + M2 fresh Adversary PASSes in REVIEW-canon.md.
|