Files
cc-ci-orchestrator/cc-ci-plan/plan-phase2b-test-performance.md
autonomic-bot 2d3c17f4bd Add Phase-2b plan: test performance (measure, attribute, improve empirically)
Phase 2b (after Phase 2, before Phase 3): instrument per-phase timings, baseline a
representative recipe set (cold vs warm), attribute where time goes (Pareto), then try
improvements as controlled before/after experiments and keep measured winners — image
pull cache/pre-pull, readiness-wait tuning, dedup deploy cycles, warm/shared infra
(isolation-proven), runner caching, concurrency sizing, vCPU. Speed never weakens tests
or isolation (Adversary re-measures + re-verifies). Phase 3 now follows 2b. Linked in README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 04:26:27 +01:00

182 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 2b — Test performance: measure, attribute, improve (Autonomous Build Plan)
**Status:** QUEUED — starts after Phase 2 (`plan-phase2-recipe-tests.md`) reaches `## DONE`, and runs
**before** Phase 3 (`plan-phase3-results-ux.md`).
**Transition:** **manual** (operator kicks it off).
**Builds on:** Phase 1 (runner, Drone, harness, `MAX_TESTS`) + Phase 2 (the full per-recipe suites —
the *real workload* we're optimizing).
**Owner agents:** same Builder + Adversary loops + protocol as Phase 1 (`plan.md` §6/§7). Here the
Adversary's job is to **independently re-measure** claimed speed-ups and ensure **no test was
weakened or isolation broken** to gain them.
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase2b-test-performance.md`
---
## 0. Why this phase
Runs are slow — a Keycloak test took ~30 minutes (the seed observation). Before we polish results
(Phase 3), make the pipeline fast enough to be pleasant and to scale across all recipes. This phase
is **empirical**: instrument → measure a baseline → attribute where the time goes → try improvements
as controlled experiments → keep what measurably helps → re-measure. **No guessing; numbers decide.**
Speed must **never** come from weakening tests, reducing real isolation unsafely, or skipping stages.
---
## 1. Mission
Understand *where* recipe-test time goes (per phase, cold vs warm) and *reduce it measurably* on the
real Phase-2 workload, with before/after numbers for every change and no loss of correctness.
---
## 2. Definition of Done (Phase 2b exit condition)
Terminates only when every item holds **and the Adversary has independently re-verified each within
24h** (logged in `REVIEW.md`):
- [ ] **T1 — Instrumentation.** The runner emits **per-phase timings** for every run (image pull,
`abra app new`/deploy, service convergence, secret generation, each stage install/upgrade/
backup-restore, each functional test, dependency/SSO setup, teardown) into the run's
`results.json` (the same artifact Phase 3 consumes). Timings are visible per run.
- [ ] **T2 — Baseline.** A measured baseline across a **representative recipe set** — at least: a
light/stateless recipe (custom-html), a single-DB recipe (n8n), a heavy JVM/SSO recipe
(keycloak), and an SSO-*dependent* recipe (lasuite-docs). Each measured **cold** (empty image
cache) and **warm** (cached), multiple runs to capture variance. Recorded in `docs/perf/baseline.md`.
- [ ] **T3 — Attribution.** A written attribution (`docs/perf/attribution.md`) showing the **Pareto
breakdown** — which phases dominate, cold vs warm — e.g. "keycloak warm: 8m converge + 4m
backup + …". The biggest levers are identified from data, not intuition.
- [ ] **T4 — Experiments.** Each improvement idea (§4) tried as a **controlled experiment** (change
one variable, hold the rest), with **before/after numbers** in `docs/perf/experiments.md`:
what was changed, the measured delta, and keep/discard. Failed experiments are recorded as
dead-ends (don't re-try).
- [ ] **T5 — Adopted improvements + measured gain.** The beneficial changes are adopted (Nix-declared
/ harness / Drone config) and the **overall run time is measurably reduced** vs the T2 baseline.
Set a concrete target in `DECISIONS.md` from the attribution (e.g. "median warm heavy-recipe run
≤ X min; light recipe ≤ Y min") and hit it, with the single node still safe (RAM/disk/concurrency).
- [ ] **T6 — No regression.** Adversary confirms, from a cold start, that after the speed-ups **every
Phase-2 test still passes and isolation/teardown still hold** (no shared-state contamination, no
weakened/skipped assertions, no leaked apps). A speed-up that compromises correctness is reverted.
- [ ] **T7 — Recommendations.** `docs/perf/README.md` summarizes findings, the recommended config
(e.g. `MAX_TESTS`, cache settings, warm-infra choices) and per-recipe sizing/timeouts, and what
*didn't* help. A new engineer can understand the perf model and re-run the measurements.
When T1T7 hold and are Adversary-verified, write `## DONE` to Phase-2b `STATUS.md`.
---
## 3. Method (the empirical loop)
1. **Instrument first (T1).** You cannot optimize what you don't measure. Add lightweight timing
spans around every phase in `run_recipe_ci.py`/harness; emit to `results.json`. Keep overhead negligible.
2. **Baseline (T2).** Run the representative set repeatedly, cold and warm; record medians + spread.
Distinguish **cold-cache** (first pull/eval) from **warm-cache** (steady state) — they have very
different profiles and call for different fixes.
3. **Attribute (T3).** Rank phases by time. Optimize the **biggest contributors first**; ignore noise.
4. **Experiment (T4).** One change at a time, re-measure on the same recipes, compare to baseline.
Keep if the delta is real and correctness holds; otherwise revert and log the dead-end. **Cap
retries** (don't thrash on a change that isn't helping).
5. **Adopt + re-measure (T5).** Land the winners declaratively (Nix/harness/Drone), then re-baseline
to confirm the cumulative gain.
6. **Guard correctness throughout (T6).** Every speed run is also a correctness run; the Adversary
re-verifies independently.
---
## 4. Ideas to try (hypotheses — validate empirically, don't assume)
Grouped by where time likely goes. Each is a hypothesis to **measure**, not a guaranteed win.
**A. Image pulls (often the cold-cache dominator).**
- Stand up a **local Docker registry pull-through cache / mirror** on cc-ci (or `registry-mirrors`)
so recipe images aren't re-downloaded across runs.
- **Pre-pull/warm** the image set for enrolled recipes (a warm-images step / on enroll), so the first
real run isn't paying the cold pull.
- Ensure pinned tags (no `:latest` re-pulls); rely on the node's layer cache (don't prune images the
active recipes need — reconcile with Phase-1's `autoPrune`).
**B. Service convergence / readiness (often the warm-cache dominator).**
- Replace any fixed `sleep`s with **tight readiness polling** against real health endpoints (short
interval, sensible cap) — over-waiting is pure waste.
- Per-recipe **readiness probes** tuned to the app (e.g. keycloak `/realms/master`, DB `pg_isready`)
instead of a generic HTTP wait.
- Parallelize independent readiness checks within a run.
**C. Redundant deploy cycles.**
- A run currently deploys multiple times (install; upgrade = old→new; backup = deploy→wipe→restore→
redeploy). **Share one deployment** where safe: run install + functional + backup-restore against a
single deploy; only the upgrade stage needs a separate prior-version deploy. Measure the saving vs
any isolation cost.
- Scope backups to the **minimal data volumes** (restic over only what matters) to cut backup/restore time.
**D. Warm / shared dependency infra (biggest lever for SSO recipes — but mind isolation).**
- Deploying an SSO provider (keycloak/authentik) *per run* is expensive. Consider a **long-lived warm
provider** that recipe tests register a per-run realm/client against, instead of a fresh deploy each
run. **Tradeoff:** shared state risks cross-run interference — only adopt if per-run isolation
(unique realm/client/users, cleaned up) is provably maintained; the Adversary must verify no
contamination. If isolation can't be guaranteed, keep per-run deploys.
- Keep traefik/the proxy warm (already persistent in Phase 1).
**E. Runner / build caching.**
- Persistent **nix store** + warm flake eval on the runner (don't re-evaluate/re-fetch per build).
- Cache test-dependency installs (pip/uv wheels, Playwright browser binaries) in a persistent volume
or Drone cache, so each build doesn't refetch.
**F. Concurrency, sized per recipe.**
- Tune `MAX_TESTS`/`DRONE_RUNNER_CAPACITY` empirically: **light recipes can run concurrently** while
heavy ones serialize. Consider a per-recipe **weight/size** so the scheduler packs the node without
overcommitting RAM/CPU (now 6GB / 2 vCPU). Parallelize independent functional tests within a run.
**G. Resources.**
- Right-size the VM: RAM (now 6GB), **vCPU** (currently 2 — more cores speed parallel pulls/builds/
JVM), disk I/O. Measure whether CPU or RAM is the bottleneck for heavy recipes before bumping.
**H. abra/secret overhead.**
- Profile `abra app secret generate` and `abra app new`; avoid regenerating/re-inserting secrets
redundantly across stages (reuse the per-run secret store from Phase-1 §4.4-B).
(Validate each on the baseline recipes; keep only measured winners. The list is a starting menu, not
a mandate.)
---
## 5. Milestones (each ends with an Adversary gate)
- **V0 — Instrument + baseline.** Per-phase timing in `results.json`; baseline for the representative
set, cold & warm, in `docs/perf/baseline.md`. *Accept:* Adversary reproduces a baseline run and the
timings match reality.
- **V1 — Attribution.** `docs/perf/attribution.md` ranks the time sinks (cold vs warm) and names the
top 23 levers. *Accept:* the attribution is supported by the recorded numbers.
- **V2 — Quick wins.** Land the cheapest high-impact fixes (image cache/pre-pull, readiness-wait
tuning, dedup deploys) with before/after numbers. *Accept:* measured improvement on the baseline,
all tests still green.
- **V3 — Structural wins.** Evaluate warm/shared infra, runner caching, concurrency sizing, vCPU —
adopt the ones that pay off *and* preserve isolation. *Accept:* cumulative improvement vs T2; the
Adversary confirms isolation/correctness intact (esp. for any shared-infra change).
- **V4 — Lock in + document.** Re-baseline to confirm the gain; record adopted config + dead-ends +
recommendations in `docs/perf/`. *Accept:* target from T5 met (or a documented, justified best
effort); no regressions; flip Phase-2b `STATUS.md` to `## DONE`.
---
## 6. Guardrails (inherit Phase 1 §9 + Phase 2 §7.1)
- **Speed never beats correctness.** No change may weaken/skip a test, reduce a real assertion, or
break isolation/teardown to look faster. Every perf experiment is re-run as a correctness run.
- **Shared/warm infra is opt-in and isolation-proven.** Only adopt shared dependencies if per-run
isolation (unique namespaces, cleanup) is verified by the Adversary; otherwise keep per-run deploys.
- **Stay within the node budget.** Concurrency/resource changes must respect RAM/disk/CPU limits
(Phase-1 `MAX_TESTS`); don't trade overload for apparent speed.
- **Change one variable at a time; cap retries.** Attribute gains to specific changes; record
dead-ends in `DECISIONS.md` and stop thrashing.
- **Measure honestly.** Report medians + variance, cold vs warm; don't cherry-pick a lucky fast run.
---
## 7. Open decisions (log in DECISIONS.md)
- The concrete perf **target** (per-recipe time budgets), derived from the attribution.
- Local registry **pull-through cache** vs explicit pre-pull (or both).
- Whether to use **warm shared SSO providers** (speed) or keep **per-run providers** (isolation) —
decided by the measured saving vs the verified isolation cost.
- `MAX_TESTS` and per-recipe **weights**; whether to raise vCPU.
- Whether stage **deploy-sharing** (install+functional+backup on one deploy) is safe per recipe.