plan: Phase 2w — warm canonical deployments + --quick CI mode (interjected into Phase 2)
Operator-directed: pause Phase 2, build the warm-data + --quick system, then resume Phase 2. - live-warm keycloak (SSO dep, realm-per-run), data-warm canonicals (undeploy keeps volume), cold = authoritative default. --quick reattaches the canonical, upgrades to PR head, asserts, and rolls back to the last-known-good snapshot on failure (never loses working data). - known-good = raw volume copy taken while undeployed (consistent), one per app, advanced ONLY by green cold runs; a nightly full-cold sweep refreshes canonicals + is a daily regression run. - launch.sh: insert 2w at the current index (Phase 2 -> resumes after 2w DONE); seq is now 1c 1b 1d 1e 2w 2 2b 3 4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -7,7 +7,8 @@
|
||||
# • Adversary (tmux session: cc-ci-adv) working clone /srv/cc-ci/cc-ci-adv
|
||||
# coordinating only through the git repo on git.autonomic.zone.
|
||||
#
|
||||
# PHASES: the watchdog runs an ordered sequence of sub-phases (default: 1c → 1b → 1d → 1e → 2 → 2b → 3 → 4).
|
||||
# PHASES: the watchdog runs an ordered sequence of sub-phases (default: 1c → 1b → 1d → 1e → 2w → 2 → 2b → 3 → 4;
|
||||
# 2w = warm-canonical/--quick, interjected; Phase 2 pauses for it then resumes).
|
||||
# Each phase has its own plan + phase-namespaced loop-state files (STATUS-<id>.md etc.). When a phase's
|
||||
# STATUS-<id>.md shows "## DONE", the watchdog AUTO-TRANSITIONS to the next phase; after the LAST
|
||||
# phase (4, final review/polish/cleanup) it STOPS the loops and exits (end of the whole build).
|
||||
@ -55,7 +56,7 @@ WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}"
|
||||
# Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order,
|
||||
# auto-transitions on the phase's "## DONE" (in BUILDER_DIR/<statusbasename>), and STOPS after the
|
||||
# last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence.
|
||||
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
|
||||
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
|
||||
IFS=';' read -r -a PHASES <<< "$PHASES_SPEC"
|
||||
PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}"
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
146
cc-ci-plan/plan-phase2w-warm-canonical-quick.md
Normal file
146
cc-ci-plan/plan-phase2w-warm-canonical-quick.md
Normal file
@ -0,0 +1,146 @@
|
||||
# cc-ci Phase 2w — Warm canonical deployments + `--quick` CI mode (Autonomous Build Plan)
|
||||
|
||||
**Status:** ACTIVE — **interjected into Phase 2** by operator decision (2026-05-28). Phase 2
|
||||
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (its STATUS-2/BACKLOG-2 state is
|
||||
preserved); the loops do this phase now, then **Phase 2 resumes automatically** where it left off.
|
||||
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2w.md` the watchdog returns to Phase 2.
|
||||
**Builds on:** the Phase-1d/1e harness (generic suite, deploy-once, override overlays, HC1 upgrade
|
||||
to PR-head, the sso-dep pattern `plan-sso-dep-testing.md`) and the now-wired Docker Hub auth.
|
||||
**Owner agents:** Builder + Adversary loops (`plan.md` §6/§7); Adversary cold-verifies.
|
||||
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
|
||||
**Phase order now:** 1c → 1b → 1d → 1e → 2(paused) → **2w** → 2(resume) → 2b → 3 → 4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Why this phase
|
||||
|
||||
Cold-start CI (fresh `abra app new` → deploy → DB-init/first-boot → … → teardown) is slow, and it
|
||||
re-pays that cost on every run and for every SSO dependency. This phase adds a **warm-data** layer:
|
||||
keep each app's **data volume** around between runs (Co-op Cloud's `undeploy` frees RAM but keeps
|
||||
volumes), so a fast `--quick` run can reattach it, upgrade to the PR code, and assert — without the
|
||||
cold-provisioning cost. A persistent **keycloak** serves SSO-dependent recipes without a fresh
|
||||
co-deploy each run. A **last-known-good snapshot per app** means a bad PR tested under `--quick` can
|
||||
never destroy the working state+data — we roll back.
|
||||
|
||||
**Terminology (use these terms throughout code/docs/decisions):**
|
||||
- **live-warm** — actually deployed and running (e.g. keycloak): instant to use, costs RAM.
|
||||
- **data-warm** — *undeployed* (RAM freed) but its **data volume is retained**, so a later
|
||||
`abra app deploy` reattaches it and boots warm (skips fresh DB-init/first-boot), costs only disk.
|
||||
- **cold** — no retained data: fresh `abra app new` + new volume + full lifecycle + teardown that
|
||||
deletes the volume. The authoritative default.
|
||||
|
||||
**Design principles settled with the operator (do not relitigate):**
|
||||
- **Keep keycloak live-warm; keep everything else data-warm.** Only keycloak (shared dep) + the one
|
||||
app under test run at a time. RAM stops being the limiter; **disk is the budget** (monitor; bump
|
||||
only if needed — test fixtures are small).
|
||||
- **Default `!testme` = full cold** (authoritative; its upgrade tier already exercises PR-upgrade per
|
||||
1e). **`--quick` is an opt-in flag**, a lower-confidence fast lane.
|
||||
- **The canonical known-good advances ONLY via cold runs** (esp. the nightly sweep). `--quick` NEVER
|
||||
promotes the canonical — it consumes it read-mostly and rolls back on failure.
|
||||
- **Snapshots: raw volume copy taken while UNDEPLOYED** (fast + consistent because nothing is
|
||||
writing). **One last-known-good per app.**
|
||||
- Warm volumes + snapshots are **cache, not source** — not in the git/D8 closure; re-seeded by cold
|
||||
runs, not restored on a VM rebuild.
|
||||
|
||||
---
|
||||
|
||||
## 1. Definition of Done (Phase 2w exit condition)
|
||||
|
||||
Terminates when every item holds **and the Adversary has independently cold-verified** (logged in
|
||||
`machine-docs/REVIEW-2w.md`):
|
||||
|
||||
- [ ] **WC1 — Live-warm keycloak (SSO dep).** A persistent (live-warm) keycloak runs at a stable domain. SSO-dependent
|
||||
recipes (per `plan-sso-dep-testing.md`) point their `setup_custom_tests` at the warm keycloak
|
||||
and create a **per-run namespaced realm+client**, then **delete that realm** after the run
|
||||
(cleanup), instead of co-deploying a fresh keycloak. Proven: a dependent recipe's SSO custom
|
||||
tests pass against the warm keycloak; concurrent dependents don't collide (distinct realms);
|
||||
leftover realms are reaped.
|
||||
- [ ] **WC2 — Data-warm canonical model.** A canonical per warmed recipe at a **stable domain**
|
||||
(distinct from cold per-run `<recipe>-<6hex>` domains), kept **data-warm** (undeployed-when-idle,
|
||||
volume retained). A small declarative registry/reconciler tracks which recipes are
|
||||
canonical and **at which commit** their known-good is. Re-warmable from scratch (cache).
|
||||
- [ ] **WC3 — Known-good snapshots.** For each canonical app, a **raw copy of its data volume(s)
|
||||
taken while undeployed**, stored under a stable path (e.g. `/var/lib/ci-warm/<recipe>/`),
|
||||
tagged with the commit it passed on. **One last-known-good retained per app** (prior is
|
||||
replaced atomically on update). Restore is proven to bring the app back healthy with its data.
|
||||
- [ ] **WC4 — `--quick` mode.** `runner/run_recipe_ci.py` gains a `--quick` path (flag/env): reattach
|
||||
the canonical warm volume (`abra app deploy` of the canonical) → **upgrade to PR head** (chaos
|
||||
redeploy) → run generic UPGRADE + serving + custom assertions (generic-first invariant holds) →
|
||||
**on PASS:** `abra app undeploy` (keep volume), do NOT alter the known-good; **on FAIL:**
|
||||
restore the last-known-good snapshot, then undeploy. `--quick` **never promotes** the canonical.
|
||||
- [ ] **WC5 — Canonical advancement via cold only.** A **green full-cold run on latest** re-snapshots
|
||||
+ re-tags the canonical known-good (promote-on-green instead of deleting at teardown). A cold
|
||||
run is the ONLY thing that advances a canonical. Seeding: the first green cold run on latest
|
||||
makes an app canonical.
|
||||
- [ ] **WC6 — Nightly full-cold sweep.** A scheduled job runs the **full cold** suite across enrolled
|
||||
recipes nightly — refreshing every canonical's known-good (WC5) AND serving as a daily
|
||||
authoritative regression run. Mechanism settled in DECISIONS (systemd timer on cc-ci / Drone
|
||||
cron / bridge), declarative + reproducible. Bounded by MAX_TESTS (serial is fine — nightly).
|
||||
- [ ] **WC7 — Trigger + authority + labeling.** Default `!testme` = full cold (unchanged). `--quick`
|
||||
is opt-in (`!testme --quick`, or a build param) and **never gates merge**. Run results carry
|
||||
the **mode** (cold vs quick) so a `--quick` pass is distinctly labeled lower-confidence (feeds
|
||||
Phase 3). Quick requires an existing canonical; if none, it cleanly falls back to (or reports
|
||||
"no canonical — run cold first").
|
||||
- [ ] **WC8 — Resource safety + isolation.** Warm-base runs serialize per app (MAX_TESTS honored);
|
||||
warm keycloak shared safely via per-run realms; **disk monitored** (warm volumes + one snapshot
|
||||
each) with a documented budget + prune of stale/orphaned warm data; cold-run teardown stays
|
||||
sacred (deletes its own per-run volumes); warm data is excluded from the D8 reproducibility
|
||||
closure (documented as cache).
|
||||
- [ ] **WC9 — Documented + cold-verified, incl. the rollback proof.** `docs/` explains warm/quick;
|
||||
the Adversary cold-verifies, **including deliberately failing a PR under `--quick` and
|
||||
confirming the canonical's last-known-good is restored intact (data preserved)**, and that a
|
||||
`--quick` pass did not move the known-good. No softened tests.
|
||||
|
||||
When WC1–WC9 hold and are confirmed, write `## DONE` to `machine-docs/STATUS-2w.md` → the watchdog
|
||||
auto-returns to **Phase 2** (resume recipe authoring).
|
||||
|
||||
---
|
||||
|
||||
## 2. The `--quick` flow (reference)
|
||||
|
||||
```
|
||||
PRECOND: a canonical for <recipe> exists (seeded by a prior green cold run); else fall back/report.
|
||||
1. abra app deploy <canonical-domain> # reattach warm volume -> fast warm boot at known-good commit
|
||||
2. wait_healthy
|
||||
3. (deps) point at the warm keycloak; create a per-run realm+client (namespaced)
|
||||
4. UPGRADE to PR head (abra app deploy --chaos to the PR checkout) # the op, once
|
||||
5. assert: generic upgrade (reconverge + moved + serving) + recipe overlay + custom (requires_deps)
|
||||
6a. PASS -> abra app undeploy <canonical-domain> # keep volume; known-good UNCHANGED
|
||||
6b. FAIL -> restore last-known-good snapshot to the volume; abra app undeploy # roll back, data safe
|
||||
7. (deps) delete the per-run realm from the warm keycloak
|
||||
```
|
||||
Cold run (default, unchanged) seeds/advances the canonical: on a green cold run on latest, snapshot
|
||||
the (undeployed) volume → replace the last-known-good + tag the commit, and keep the volume as the
|
||||
new canonical instead of deleting it.
|
||||
|
||||
## 3. Milestones (bounded)
|
||||
- **W0 — Warm keycloak (WC1).** Highest ROI; unblocks faster SSO recipe tests for the resumed Phase 2.
|
||||
- **W1 — Canonical registry + snapshot/restore (WC2, WC3).** Stable-domain warm apps; raw-while-
|
||||
stopped snapshot + restore; prove restore round-trips data.
|
||||
- **W2 — `--quick` mode (WC4, WC7).** Orchestrator path + labeling + fallback.
|
||||
- **W3 — Cold-advances-canonical + nightly sweep (WC5, WC6).** Promote-on-green-cold; scheduled job.
|
||||
- **W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9).** Then
|
||||
`## DONE`.
|
||||
|
||||
## 4. Guardrails
|
||||
- **`--quick` never advances the canonical; only cold does.** Anchors the baseline to verified states.
|
||||
- **Never lose the known-good** — snapshot before mutate (or rely on the standing known-good); restore
|
||||
on any quick failure. The rollback proof (WC9) is mandatory.
|
||||
- **Default stays cold; quick is opt-in + clearly lower-confidence.** Don't let a quick pass read as
|
||||
full confidence.
|
||||
- **Snapshot only while undeployed** (consistency). **One last-known-good per app** (disk).
|
||||
- **Cold teardown stays sacred** (deletes per-run volumes); warm volumes are a managed cache, never
|
||||
confused with per-run state; warm data excluded from D8.
|
||||
- **Never weaken a test** (cardinal rule). Generic-first invariant holds in `--quick` too.
|
||||
- **Bounded** — build the mechanism + prove on keycloak + a couple of recipes; do NOT re-warm all
|
||||
recipes here (the nightly sweep populates canonicals over time).
|
||||
|
||||
## 5. Open decisions (log in machine-docs/DECISIONS.md)
|
||||
- Canonical **stable-domain scheme** (distinct from cold per-run domains) + how the registry/reconciler
|
||||
is declared.
|
||||
- **Snapshot storage + format** (raw tar vs reflink/CoW copy) under `/var/lib/ci-warm/`; atomic replace.
|
||||
- **Nightly sweep mechanism** (systemd timer / Drone cron / bridge) + ordering + disk-prune policy.
|
||||
- `--quick` **trigger surface** (`!testme --quick` comment vs Drone build param) + the "no canonical
|
||||
yet" fallback (run cold vs report-and-skip).
|
||||
- **Disk budget**: measure warm volume + snapshot sizes across recipes; decide if a 30→larger bump is
|
||||
needed or the warm set stays bounded.
|
||||
Reference in New Issue
Block a user