diff --git a/cc-ci-plan/launch.sh b/cc-ci-plan/launch.sh index 2b5d44e..7797302 100755 --- a/cc-ci-plan/launch.sh +++ b/cc-ci-plan/launch.sh @@ -7,7 +7,8 @@ # • Adversary (tmux session: cc-ci-adv) working clone /srv/cc-ci/cc-ci-adv # coordinating only through the git repo on git.autonomic.zone. # -# PHASES: the watchdog runs an ordered sequence of sub-phases (default: 1c → 1b → 1d → 1e → 2 → 2b → 3 → 4). +# PHASES: the watchdog runs an ordered sequence of sub-phases (default: 1c → 1b → 1d → 1e → 2w → 2 → 2b → 3 → 4; +# 2w = warm-canonical/--quick, interjected; Phase 2 pauses for it then resumes). # Each phase has its own plan + phase-namespaced loop-state files (STATUS-.md etc.). When a phase's # STATUS-.md shows "## DONE", the watchdog AUTO-TRANSITIONS to the next phase; after the LAST # phase (4, final review/polish/cleanup) it STOPS the loops and exits (end of the whole build). @@ -55,7 +56,7 @@ WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}" # Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order, # auto-transitions on the phase's "## DONE" (in BUILDER_DIR/), and STOPS after the # last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence. -PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}" +PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}" IFS=';' read -r -a PHASES <<< "$PHASES_SPEC" PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}" # -------------------------------------------------------------------------- diff --git a/cc-ci-plan/plan-phase2w-warm-canonical-quick.md b/cc-ci-plan/plan-phase2w-warm-canonical-quick.md new file mode 100644 index 0000000..9c8d103 --- /dev/null +++ b/cc-ci-plan/plan-phase2w-warm-canonical-quick.md @@ -0,0 +1,146 @@ +# cc-ci Phase 2w — Warm canonical deployments + `--quick` CI mode (Autonomous Build Plan) + +**Status:** ACTIVE — **interjected into Phase 2** by operator decision (2026-05-28). Phase 2 +(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (its STATUS-2/BACKLOG-2 state is +preserved); the loops do this phase now, then **Phase 2 resumes automatically** where it left off. +**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2w.md` the watchdog returns to Phase 2. +**Builds on:** the Phase-1d/1e harness (generic suite, deploy-once, override overlays, HC1 upgrade +to PR-head, the sso-dep pattern `plan-sso-dep-testing.md`) and the now-wired Docker Hub auth. +**Owner agents:** Builder + Adversary loops (`plan.md` §6/§7); Adversary cold-verifies. +**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md` +**Phase order now:** 1c → 1b → 1d → 1e → 2(paused) → **2w** → 2(resume) → 2b → 3 → 4. + +--- + +## 0. Why this phase + +Cold-start CI (fresh `abra app new` → deploy → DB-init/first-boot → … → teardown) is slow, and it +re-pays that cost on every run and for every SSO dependency. This phase adds a **warm-data** layer: +keep each app's **data volume** around between runs (Co-op Cloud's `undeploy` frees RAM but keeps +volumes), so a fast `--quick` run can reattach it, upgrade to the PR code, and assert — without the +cold-provisioning cost. A persistent **keycloak** serves SSO-dependent recipes without a fresh +co-deploy each run. A **last-known-good snapshot per app** means a bad PR tested under `--quick` can +never destroy the working state+data — we roll back. + +**Terminology (use these terms throughout code/docs/decisions):** +- **live-warm** — actually deployed and running (e.g. keycloak): instant to use, costs RAM. +- **data-warm** — *undeployed* (RAM freed) but its **data volume is retained**, so a later + `abra app deploy` reattaches it and boots warm (skips fresh DB-init/first-boot), costs only disk. +- **cold** — no retained data: fresh `abra app new` + new volume + full lifecycle + teardown that + deletes the volume. The authoritative default. + +**Design principles settled with the operator (do not relitigate):** +- **Keep keycloak live-warm; keep everything else data-warm.** Only keycloak (shared dep) + the one + app under test run at a time. RAM stops being the limiter; **disk is the budget** (monitor; bump + only if needed — test fixtures are small). +- **Default `!testme` = full cold** (authoritative; its upgrade tier already exercises PR-upgrade per + 1e). **`--quick` is an opt-in flag**, a lower-confidence fast lane. +- **The canonical known-good advances ONLY via cold runs** (esp. the nightly sweep). `--quick` NEVER + promotes the canonical — it consumes it read-mostly and rolls back on failure. +- **Snapshots: raw volume copy taken while UNDEPLOYED** (fast + consistent because nothing is + writing). **One last-known-good per app.** +- Warm volumes + snapshots are **cache, not source** — not in the git/D8 closure; re-seeded by cold + runs, not restored on a VM rebuild. + +--- + +## 1. Definition of Done (Phase 2w exit condition) + +Terminates when every item holds **and the Adversary has independently cold-verified** (logged in +`machine-docs/REVIEW-2w.md`): + +- [ ] **WC1 — Live-warm keycloak (SSO dep).** A persistent (live-warm) keycloak runs at a stable domain. SSO-dependent + recipes (per `plan-sso-dep-testing.md`) point their `setup_custom_tests` at the warm keycloak + and create a **per-run namespaced realm+client**, then **delete that realm** after the run + (cleanup), instead of co-deploying a fresh keycloak. Proven: a dependent recipe's SSO custom + tests pass against the warm keycloak; concurrent dependents don't collide (distinct realms); + leftover realms are reaped. +- [ ] **WC2 — Data-warm canonical model.** A canonical per warmed recipe at a **stable domain** + (distinct from cold per-run `-<6hex>` domains), kept **data-warm** (undeployed-when-idle, + volume retained). A small declarative registry/reconciler tracks which recipes are + canonical and **at which commit** their known-good is. Re-warmable from scratch (cache). +- [ ] **WC3 — Known-good snapshots.** For each canonical app, a **raw copy of its data volume(s) + taken while undeployed**, stored under a stable path (e.g. `/var/lib/ci-warm//`), + tagged with the commit it passed on. **One last-known-good retained per app** (prior is + replaced atomically on update). Restore is proven to bring the app back healthy with its data. +- [ ] **WC4 — `--quick` mode.** `runner/run_recipe_ci.py` gains a `--quick` path (flag/env): reattach + the canonical warm volume (`abra app deploy` of the canonical) → **upgrade to PR head** (chaos + redeploy) → run generic UPGRADE + serving + custom assertions (generic-first invariant holds) → + **on PASS:** `abra app undeploy` (keep volume), do NOT alter the known-good; **on FAIL:** + restore the last-known-good snapshot, then undeploy. `--quick` **never promotes** the canonical. +- [ ] **WC5 — Canonical advancement via cold only.** A **green full-cold run on latest** re-snapshots + + re-tags the canonical known-good (promote-on-green instead of deleting at teardown). A cold + run is the ONLY thing that advances a canonical. Seeding: the first green cold run on latest + makes an app canonical. +- [ ] **WC6 — Nightly full-cold sweep.** A scheduled job runs the **full cold** suite across enrolled + recipes nightly — refreshing every canonical's known-good (WC5) AND serving as a daily + authoritative regression run. Mechanism settled in DECISIONS (systemd timer on cc-ci / Drone + cron / bridge), declarative + reproducible. Bounded by MAX_TESTS (serial is fine — nightly). +- [ ] **WC7 — Trigger + authority + labeling.** Default `!testme` = full cold (unchanged). `--quick` + is opt-in (`!testme --quick`, or a build param) and **never gates merge**. Run results carry + the **mode** (cold vs quick) so a `--quick` pass is distinctly labeled lower-confidence (feeds + Phase 3). Quick requires an existing canonical; if none, it cleanly falls back to (or reports + "no canonical — run cold first"). +- [ ] **WC8 — Resource safety + isolation.** Warm-base runs serialize per app (MAX_TESTS honored); + warm keycloak shared safely via per-run realms; **disk monitored** (warm volumes + one snapshot + each) with a documented budget + prune of stale/orphaned warm data; cold-run teardown stays + sacred (deletes its own per-run volumes); warm data is excluded from the D8 reproducibility + closure (documented as cache). +- [ ] **WC9 — Documented + cold-verified, incl. the rollback proof.** `docs/` explains warm/quick; + the Adversary cold-verifies, **including deliberately failing a PR under `--quick` and + confirming the canonical's last-known-good is restored intact (data preserved)**, and that a + `--quick` pass did not move the known-good. No softened tests. + +When WC1–WC9 hold and are confirmed, write `## DONE` to `machine-docs/STATUS-2w.md` → the watchdog +auto-returns to **Phase 2** (resume recipe authoring). + +--- + +## 2. The `--quick` flow (reference) + +``` +PRECOND: a canonical for exists (seeded by a prior green cold run); else fall back/repor​t. + 1. abra app deploy # reattach warm volume -> fast warm boot at known-good commit + 2. wait_healthy + 3. (deps) point at the warm keycloak; create a per-run realm+client (namespaced) + 4. UPGRADE to PR head (abra app deploy --chaos to the PR checkout) # the op, once + 5. assert: generic upgrade (reconverge + moved + serving) + recipe overlay + custom (requires_deps) + 6a. PASS -> abra app undeploy # keep volume; known-good UNCHANGED + 6b. FAIL -> restore last-known-good snapshot to the volume; abra app undeploy # roll back, data safe + 7. (deps) delete the per-run realm from the warm keycloak +``` +Cold run (default, unchanged) seeds/advances the canonical: on a green cold run on latest, snapshot +the (undeployed) volume → replace the last-known-good + tag the commit, and keep the volume as the +new canonical instead of deleting it. + +## 3. Milestones (bounded) +- **W0 — Warm keycloak (WC1).** Highest ROI; unblocks faster SSO recipe tests for the resumed Phase 2. +- **W1 — Canonical registry + snapshot/restore (WC2, WC3).** Stable-domain warm apps; raw-while- + stopped snapshot + restore; prove restore round-trips data. +- **W2 — `--quick` mode (WC4, WC7).** Orchestrator path + labeling + fallback. +- **W3 — Cold-advances-canonical + nightly sweep (WC5, WC6).** Promote-on-green-cold; scheduled job. +- **W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9).** Then + `## DONE`. + +## 4. Guardrails +- **`--quick` never advances the canonical; only cold does.** Anchors the baseline to verified states. +- **Never lose the known-good** — snapshot before mutate (or rely on the standing known-good); restore + on any quick failure. The rollback proof (WC9) is mandatory. +- **Default stays cold; quick is opt-in + clearly lower-confidence.** Don't let a quick pass read as + full confidence. +- **Snapshot only while undeployed** (consistency). **One last-known-good per app** (disk). +- **Cold teardown stays sacred** (deletes per-run volumes); warm volumes are a managed cache, never + confused with per-run state; warm data excluded from D8. +- **Never weaken a test** (cardinal rule). Generic-first invariant holds in `--quick` too. +- **Bounded** — build the mechanism + prove on keycloak + a couple of recipes; do NOT re-warm all + recipes here (the nightly sweep populates canonicals over time). + +## 5. Open decisions (log in machine-docs/DECISIONS.md) +- Canonical **stable-domain scheme** (distinct from cold per-run domains) + how the registry/reconciler + is declared. +- **Snapshot storage + format** (raw tar vs reflink/CoW copy) under `/var/lib/ci-warm/`; atomic replace. +- **Nightly sweep mechanism** (systemd timer / Drone cron / bridge) + ordering + disk-prune policy. +- `--quick` **trigger surface** (`!testme --quick` comment vs Drone build param) + the "no canonical + yet" fallback (run cold vs report-and-skip). +- **Disk budget**: measure warm volume + snapshot sizes across recipes; decide if a 30→larger bump is + needed or the warm set stays bounded.