Phase 2b narrowed to "confirm minimal deploys"; perf ideas moved to IDEAS

Operator (2026-05-30): the real deploy-speed bottleneck was hardware (cc-ci VM was 2 vCPU on a 4-core host + disk-I/O-bound; RAM fine), now fixed directly (bumped to 4 vCPU, made cc-nix-test the only running VM on b1). The 2b software micro-optimizations are judged unlikely to help, so: - IDEAS.md: parked the whole empirical-perf program (instrumentation, baseline, attribution) + the optimization menu (image cache/prepull, readiness tuning, warm-SSO start/stop, runner caching, concurrency sizing, resources, secret overhead) under "Phase-2b empirical performance work", revisit only if measurement later proves a specific software bottleneck. - plan-phase2b: reduced to ONE goal — confirm (and fix if needed) that the per-recipe test sequence already uses the minimum deploys (1 base shared by install+functional+backup/restore, +1 for the upgrade tier, +1 per dep), enforced by the existing DG4.1 deploy-count check, WITHOUT weakening any test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:07:49 +01:00
parent 1c2be64124
commit e85e16318c
2 changed files with 84 additions and 167 deletions
--- a/cc-ci-plan/IDEAS.md
+++ b/cc-ci-plan/IDEAS.md
@ -82,3 +82,32 @@ item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant.
  bottleneck** (e.g. D8 throwaway-rebuild / fresh-canonical seeding) **AND** the cache lives on
  **recreate-surviving storage** (an Incus volume / a path on host b1, not the VM's ephemeral disk).
  Otherwise it's complexity without payoff. See DECISIONS.md "Phase 2pc". *Added:* 2026-05-29.
+
+- **Phase-2b empirical performance work (moved out of the 2b phase).** The original Phase 2b was a full
+  empirical perf program: per-phase timing instrumentation in `results.json`, a cold/warm baseline
+  across representative recipes, a Pareto attribution, and a menu of software optimizations. **Deferred
+  (operator, 2026-05-30):** the real deploy-speed bottleneck turned out to be **hardware**, not
+  software — the cc-ci VM was **2 vCPU on a 4-core host** and **disk-I/O-bound** (load ~8, io pressure
+  ~65%) while running warm-keycloak (JVM) + all infra; RAM was never the constraint. Fixed **directly**:
+  bumped to **4 vCPU** and made cc-nix-test the **only running VM** on b1. The software micro-opts below
+  are judged unlikely to move the needle enough to justify the work; revisit ONLY if measurement later
+  shows a specific software bottleneck. (Phase 2b is narrowed to just confirming the test sequence
+  already minimizes deploys — see plan-phase2b.) Parked ideas:
+  - **Per-phase timing instrumentation** + cold/warm **baseline** + **attribution** — do this first if
+    perf is ever revisited; numbers should drive any change.
+  - **Image pulls:** local registry pull-through cache (see the item above) and/or pre-pull/warm the
+    enrolled recipes' image set so the first run doesn't pay the cold pull.
+  - **Readiness/convergence:** replace fixed sleeps with tight health-endpoint polling; per-recipe
+    readiness probes; parallelize independent readiness checks within a run.
+  - **Warm shared SSO provider** (already partly live as warm-keycloak): saves per-run SSO deploy time
+    but is a steady JVM CPU tax that slows non-SSO recipes — only worth it with proven per-run
+    isolation; consider start-when-needed / stop-when-idle rather than always-on.
+  - **Runner/build caching:** persistent nix store + warm flake eval; cache pip/uv wheels + Playwright
+    browsers in a persistent volume.
+  - **Concurrency sizing:** tune `MAX_TESTS`/runner capacity + per-recipe weights so light recipes run
+    concurrently while heavy ones serialize, without overcommitting the node.
+  - **Resources:** further vCPU/RAM/disk-I/O sizing (the 4-vCPU bump is done; storage I/O on b1 is the
+    harder co-bottleneck — a faster storage pool if it ever matters).
+  - **abra/secret overhead:** avoid regenerating/re-inserting secrets redundantly across stages.
+  *Why deferred:* hardware was the real lever and is fixed; these are speculative software gains best
+  validated by measurement, not assumed. *Added:* 2026-05-30.
--- a/cc-ci-plan/plan-phase2b-test-performance.md
+++ b/cc-ci-plan/plan-phase2b-test-performance.md
@ -1,181 +1,69 @@
-# cc-ci Phase 2b — Test performance: measure, attribute, improve (Autonomous Build Plan)
+# cc-ci Phase 2b — Confirm the test sequence minimizes deploys (no redundant deploys)

-**Status:** QUEUED — starts after Phase 2 (`plan-phase2-recipe-tests.md`) reaches `## DONE`, and runs
-**before** Phase 3 (`plan-phase3-results-ux.md`).
-**Transition:** **manual** (operator kicks it off).
-**Builds on:** Phase 1 (runner, Drone, harness, `MAX_TESTS`) + Phase 2 (the full per-recipe suites —
-the *real workload* we're optimizing).
-**Owner agents:** same Builder + Adversary loops + protocol as Phase 1 (`plan.md` §6/§7). Here the
-Adversary's job is to **independently re-measure** claimed speed-ups and ensure **no test was
-weakened or isolation broken** to gain them.
-**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase2b-test-performance.md`
+**Status:** QUEUED — starts after Phase 2 (`plan-phase2-recipe-tests.md`) reaches `## DONE`, before
+Phase 3. **Transition:** manual (operator kicks it off). **Owner:** Builder + Adversary loops.
+**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2b-test-performance.md`

 ---

-## 0. Why this phase
+## 0. Scope (NARROWED — operator, 2026-05-30)

-Runs are slow — a Keycloak test took ~30 minutes (the seed observation). Before we polish results
-(Phase 3), make the pipeline fast enough to be pleasant and to scale across all recipes. This phase
-is **empirical**: instrument → measure a baseline → attribute where the time goes → try improvements
-as controlled experiments → keep what measurably helps → re-measure. **No guessing; numbers decide.**
-Speed must **never** come from weakening tests, reducing real isolation unsafely, or skipping stages.
+The original Phase 2b was a broad empirical performance program (instrument → baseline → attribute →
+optimize). **That has been removed and parked in `IDEAS.md`** ("Phase-2b empirical performance work").

---
+**Why:** the real deploy-speed bottleneck was **hardware**, not software — the cc-ci VM was **2 vCPU
+on a 4-core host** and **disk-I/O-bound** (load ~8, io pressure ~65%), with warm-keycloak (JVM) + all
+infra resident; RAM was never the constraint. That was fixed **directly**: cc-nix-test bumped to
+**4 vCPU** and made the **only running VM** on b1 (full host CPU). The software micro-optimizations are
+judged unlikely to be worth the effort and are deferred to IDEAS, to be revisited only if measurement
+later proves a specific software bottleneck.
+
+**So Phase 2b is reduced to ONE thing:** confirm the per-recipe test sequence already uses the
+**minimum number of deploys** — and fix it if it doesn't — **without weakening any test**. (Operator's
+expectation: we have probably already done this via the deploy-once / deploy-sharing design.)

 ## 1. Mission

-Understand *where* recipe-test time goes (per phase, cold vs warm) and *reduce it measurably* on the
-real Phase-2 workload, with before/after numbers for every change and no loss of correctness.
+Verify that a recipe's full test sequence does **not** redeploy more than necessary, and document the
+deploy budget. Reuse a single deployment across the stages that can safely share one; only deploy
+again where a stage genuinely requires a distinct deployment.

---
+## 2. Definition of Done (Adversary cold-verifies → REVIEW.md)

-## 2. Definition of Done (Phase 2b exit condition)
+- [ ] **B1 — Deploy budget is documented and minimal.** Write down, per recipe run, exactly how many
+      `abra app deploy`/`upgrade` cycles happen and why each is necessary. Expected minimum:
+      - **one** base deploy shared by **install + functional/custom + backup→restore** (restore
+        redeploys onto the same app only as the restore mechanism itself requires);
+      - **one** additional prior-version deploy **only** for the **upgrade** tier (old→new is the
+        whole point of that tier);
+      - **one** deploy per declared **dependency** (e.g. an SSO provider), deployed once and reused.
+      i.e. `deploys == 1 (base) + 1 (upgrade tier) + N_deps` — no extra/redundant redeploys.
+- [ ] **B2 — Enforced, not just claimed.** The harness already emits a deploy count and fails on a
+      mismatch (the DG4.1 `deploy-count != expected` check + the `RUN SUMMARY` `deploy-count` line) —
+      point to that as the enforcement and confirm `expected_deploy_count` reflects the minimal budget
+      in B1. If any recipe exceeds it, **remove the redundant deploy** (e.g. collapse a needless
+      re-deploy between install and functional) and re-verify.
+- [ ] **B3 — No test weakened to save a deploy.** Every stage still runs its real assertions and real
+      isolation/teardown; sharing a deployment must not skip or soften any check. Adversary confirms
+      from a cold start that suite coverage is unchanged — only the deploy count is reduced/confirmed.
+- [ ] **B4 — Recorded.** A short note (`docs/perf/deploys.md` or DECISIONS.md) states the confirmed
+      per-recipe deploy budget and that it is minimal. If it was already minimal, say so explicitly
+      (the likely outcome); if a redundant deploy was removed, record before/after counts.

-Terminates only when every item holds **and the Adversary has independently re-verified each within
-24h** (logged in `REVIEW.md`):
+When B1–B4 hold and are Adversary-verified, write `## DONE` to Phase-2b `STATUS.md`.

- [ ] **T1 — Instrumentation.** The runner emits **per-phase timings** for every run (image pull,
-      `abra app new`/deploy, service convergence, secret generation, each stage install/upgrade/
-      backup-restore, each functional test, dependency/SSO setup, teardown) into the run's
-      `results.json` (the same artifact Phase 3 consumes). Timings are visible per run.
- [ ] **T2 — Baseline.** A measured baseline across a **representative recipe set** — at least: a
-      light/stateless recipe (custom-html), a single-DB recipe (n8n), a heavy JVM/SSO recipe
-      (keycloak), and an SSO-*dependent* recipe (lasuite-docs). Each measured **cold** (empty image
-      cache) and **warm** (cached), multiple runs to capture variance. Recorded in `docs/perf/baseline.md`.
- [ ] **T3 — Attribution.** A written attribution (`docs/perf/attribution.md`) showing the **Pareto
-      breakdown** — which phases dominate, cold vs warm — e.g. "keycloak warm: 8m converge + 4m
-      backup + …". The biggest levers are identified from data, not intuition.
- [ ] **T4 — Experiments.** Each improvement idea (§4) tried as a **controlled experiment** (change
-      one variable, hold the rest), with **before/after numbers** in `docs/perf/experiments.md`:
-      what was changed, the measured delta, and keep/discard. Failed experiments are recorded as
-      dead-ends (don't re-try).
- [ ] **T5 — Adopted improvements + measured gain.** The beneficial changes are adopted (Nix-declared
-      / harness / Drone config) and the **overall run time is measurably reduced** vs the T2 baseline.
-      Set a concrete target in `DECISIONS.md` from the attribution (e.g. "median warm heavy-recipe run
-      ≤ X min; light recipe ≤ Y min") and hit it, with the single node still safe (RAM/disk/concurrency).
- [ ] **T6 — No regression.** Adversary confirms, from a cold start, that after the speed-ups **every
-      Phase-2 test still passes and isolation/teardown still hold** (no shared-state contamination, no
-      weakened/skipped assertions, no leaked apps). A speed-up that compromises correctness is reverted.
- [ ] **T7 — Recommendations.** `docs/perf/README.md` summarizes findings, the recommended config
-      (e.g. `MAX_TESTS`, cache settings, warm-infra choices) and per-recipe sizing/timeouts, and what
-      *didn't* help. A new engineer can understand the perf model and re-run the measurements.
+## 3. Method
+1. Read `run_recipe_ci.py`/harness: trace every `abra app deploy`/`abra app upgrade` call across the
+   stage sequence; count them; map each to a stage and a justification.
+2. Compare to the minimal budget (B1). The existing `deploy-count`/`expected_deploy_count` logic is the
+   reference — verify it equals the minimum and that runs pass it.
+3. If over budget on any recipe, eliminate the redundant deploy **without** changing what's tested;
+   re-run the full suite (Adversary cold-verifies green + isolation intact).
+4. If already minimal, document the confirmation and finish — do NOT add speculative perf changes
+   (those live in IDEAS).

-When T1–T7 hold and are Adversary-verified, write `## DONE` to Phase-2b `STATUS.md`.
-
---
-
-## 3. Method (the empirical loop)
-
-1. **Instrument first (T1).** You cannot optimize what you don't measure. Add lightweight timing
-   spans around every phase in `run_recipe_ci.py`/harness; emit to `results.json`. Keep overhead negligible.
-2. **Baseline (T2).** Run the representative set repeatedly, cold and warm; record medians + spread.
-   Distinguish **cold-cache** (first pull/eval) from **warm-cache** (steady state) — they have very
-   different profiles and call for different fixes.
-3. **Attribute (T3).** Rank phases by time. Optimize the **biggest contributors first**; ignore noise.
-4. **Experiment (T4).** One change at a time, re-measure on the same recipes, compare to baseline.
-   Keep if the delta is real and correctness holds; otherwise revert and log the dead-end. **Cap
-   retries** (don't thrash on a change that isn't helping).
-5. **Adopt + re-measure (T5).** Land the winners declaratively (Nix/harness/Drone), then re-baseline
-   to confirm the cumulative gain.
-6. **Guard correctness throughout (T6).** Every speed run is also a correctness run; the Adversary
-   re-verifies independently.
-
---
-
-## 4. Ideas to try (hypotheses — validate empirically, don't assume)
-
-Grouped by where time likely goes. Each is a hypothesis to **measure**, not a guaranteed win.
-
-**A. Image pulls (often the cold-cache dominator).**
- Stand up a **local Docker registry pull-through cache / mirror** on cc-ci (or `registry-mirrors`)
-  so recipe images aren't re-downloaded across runs.
- **Pre-pull/warm** the image set for enrolled recipes (a warm-images step / on enroll), so the first
-  real run isn't paying the cold pull.
- Ensure pinned tags (no `:latest` re-pulls); rely on the node's layer cache (don't prune images the
-  active recipes need — reconcile with Phase-1's `autoPrune`).
-
-**B. Service convergence / readiness (often the warm-cache dominator).**
- Replace any fixed `sleep`s with **tight readiness polling** against real health endpoints (short
-  interval, sensible cap) — over-waiting is pure waste.
- Per-recipe **readiness probes** tuned to the app (e.g. keycloak `/realms/master`, DB `pg_isready`)
-  instead of a generic HTTP wait.
- Parallelize independent readiness checks within a run.
-
-**C. Redundant deploy cycles.**
- A run currently deploys multiple times (install; upgrade = old→new; backup = deploy→wipe→restore→
-  redeploy). **Share one deployment** where safe: run install + functional + backup-restore against a
-  single deploy; only the upgrade stage needs a separate prior-version deploy. Measure the saving vs
-  any isolation cost.
- Scope backups to the **minimal data volumes** (restic over only what matters) to cut backup/restore time.
-
-**D. Warm / shared dependency infra (biggest lever for SSO recipes — but mind isolation).**
- Deploying an SSO provider (keycloak/authentik) *per run* is expensive. Consider a **long-lived warm
-  provider** that recipe tests register a per-run realm/client against, instead of a fresh deploy each
-  run. **Tradeoff:** shared state risks cross-run interference — only adopt if per-run isolation
-  (unique realm/client/users, cleaned up) is provably maintained; the Adversary must verify no
-  contamination. If isolation can't be guaranteed, keep per-run deploys.
- Keep traefik/the proxy warm (already persistent in Phase 1).
-
-**E. Runner / build caching.**
- Persistent **nix store** + warm flake eval on the runner (don't re-evaluate/re-fetch per build).
- Cache test-dependency installs (pip/uv wheels, Playwright browser binaries) in a persistent volume
-  or Drone cache, so each build doesn't refetch.
-
-**F. Concurrency, sized per recipe.**
- Tune `MAX_TESTS`/`DRONE_RUNNER_CAPACITY` empirically: **light recipes can run concurrently** while
-  heavy ones serialize. Consider a per-recipe **weight/size** so the scheduler packs the node without
-  overcommitting RAM/CPU (now 6GB / 2 vCPU). Parallelize independent functional tests within a run.
-
-**G. Resources.**
- Right-size the VM: RAM (now 6GB), **vCPU** (currently 2 — more cores speed parallel pulls/builds/
-  JVM), disk I/O. Measure whether CPU or RAM is the bottleneck for heavy recipes before bumping.
-
-**H. abra/secret overhead.**
- Profile `abra app secret generate` and `abra app new`; avoid regenerating/re-inserting secrets
-  redundantly across stages (reuse the per-run secret store from Phase-1 §4.4-B).
-
-(Validate each on the baseline recipes; keep only measured winners. The list is a starting menu, not
-a mandate.)
-
---
-
-## 5. Milestones (each ends with an Adversary gate)
-
- **V0 — Instrument + baseline.** Per-phase timing in `results.json`; baseline for the representative
-  set, cold & warm, in `docs/perf/baseline.md`. *Accept:* Adversary reproduces a baseline run and the
-  timings match reality.
- **V1 — Attribution.** `docs/perf/attribution.md` ranks the time sinks (cold vs warm) and names the
-  top 2–3 levers. *Accept:* the attribution is supported by the recorded numbers.
- **V2 — Quick wins.** Land the cheapest high-impact fixes (image cache/pre-pull, readiness-wait
-  tuning, dedup deploys) with before/after numbers. *Accept:* measured improvement on the baseline,
-  all tests still green.
- **V3 — Structural wins.** Evaluate warm/shared infra, runner caching, concurrency sizing, vCPU —
-  adopt the ones that pay off *and* preserve isolation. *Accept:* cumulative improvement vs T2; the
-  Adversary confirms isolation/correctness intact (esp. for any shared-infra change).
- **V4 — Lock in + document.** Re-baseline to confirm the gain; record adopted config + dead-ends +
-  recommendations in `docs/perf/`. *Accept:* target from T5 met (or a documented, justified best
-  effort); no regressions; flip Phase-2b `STATUS.md` to `## DONE`.
-
---
-
-## 6. Guardrails (inherit Phase 1 §9 + Phase 2 §7.1)
-
- **Speed never beats correctness.** No change may weaken/skip a test, reduce a real assertion, or
-  break isolation/teardown to look faster. Every perf experiment is re-run as a correctness run.
- **Shared/warm infra is opt-in and isolation-proven.** Only adopt shared dependencies if per-run
-  isolation (unique namespaces, cleanup) is verified by the Adversary; otherwise keep per-run deploys.
- **Stay within the node budget.** Concurrency/resource changes must respect RAM/disk/CPU limits
-  (Phase-1 `MAX_TESTS`); don't trade overload for apparent speed.
- **Change one variable at a time; cap retries.** Attribute gains to specific changes; record
-  dead-ends in `DECISIONS.md` and stop thrashing.
- **Measure honestly.** Report medians + variance, cold vs warm; don't cherry-pick a lucky fast run.
-
---
-
-## 7. Open decisions (log in DECISIONS.md)
- The concrete perf **target** (per-recipe time budgets), derived from the attribution.
- Local registry **pull-through cache** vs explicit pre-pull (or both).
- Whether to use **warm shared SSO providers** (speed) or keep **per-run providers** (isolation) —
-  decided by the measured saving vs the verified isolation cost.
- `MAX_TESTS` and per-recipe **weights**; whether to raise vCPU.
- Whether stage **deploy-sharing** (install+functional+backup on one deploy) is safe per recipe.
+## 4. Guardrails
+- **Correctness first:** never weaken/skip/soften a test or break isolation/teardown to cut a deploy.
+- **Bounded:** this phase ONLY confirms/fixes deploy count. Any other perf idea → `IDEAS.md`
+  ("Phase-2b empirical performance work"); do not re-import them here.
+- **Real abra path** throughout (no docker-level shortcuts).