Add Phase-2b plan: test performance (measure, attribute, improve empirically)
Phase 2b (after Phase 2, before Phase 3): instrument per-phase timings, baseline a representative recipe set (cold vs warm), attribute where time goes (Pareto), then try improvements as controlled before/after experiments and keep measured winners — image pull cache/pre-pull, readiness-wait tuning, dedup deploy cycles, warm/shared infra (isolation-proven), runner caching, concurrency sizing, vCPU. Speed never weakens tests or isolation (Adversary re-measures + re-verifies). Phase 3 now follows 2b. Linked in README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -16,7 +16,8 @@ autonomous Claude loops (a Builder and an adversarial Reviewer) running over day
|
||||
|---|---|
|
||||
| `plan.md` | The Phase-1 plan (build the CI server). Agents treat it as their single source of truth. |
|
||||
| `plan-phase2-recipe-tests.md` | **Phase 2** (after Phase-1 `## DONE`): author comprehensive per-recipe tests — port every recipe-maintainer test + ≥2 recipe-specific tests per app. |
|
||||
| `plan-phase3-results-ux.md` | **Phase 3** (after Phase-2 `## DONE`): beautiful YunoHost-style results — per-run **level**, image-forward PR comment (badge + summary card + app screenshot), polished dashboard. |
|
||||
| `plan-phase2b-test-performance.md` | **Phase 2b** (after Phase 2, before Phase 3): empirically measure where test time goes and reduce it (image cache, readiness tuning, dedup deploys, warm infra, concurrency) — no weakened tests. |
|
||||
| `plan-phase3-results-ux.md` | **Phase 3** (after Phase 2b): beautiful YunoHost-style results — per-run **level**, image-forward PR comment (badge + summary card + app screenshot), polished dashboard. |
|
||||
| `IDEAS.md` | Deferred/future ideas, parked out of current scope. |
|
||||
| `brief.md` | The original one-page brief (context only; `plan.md` supersedes it). |
|
||||
| `kickoff.md` | Launch & supervision guide. |
|
||||
|
||||
181
cc-ci-plan/plan-phase2b-test-performance.md
Normal file
181
cc-ci-plan/plan-phase2b-test-performance.md
Normal file
@ -0,0 +1,181 @@
|
||||
# cc-ci Phase 2b — Test performance: measure, attribute, improve (Autonomous Build Plan)
|
||||
|
||||
**Status:** QUEUED — starts after Phase 2 (`plan-phase2-recipe-tests.md`) reaches `## DONE`, and runs
|
||||
**before** Phase 3 (`plan-phase3-results-ux.md`).
|
||||
**Transition:** **manual** (operator kicks it off).
|
||||
**Builds on:** Phase 1 (runner, Drone, harness, `MAX_TESTS`) + Phase 2 (the full per-recipe suites —
|
||||
the *real workload* we're optimizing).
|
||||
**Owner agents:** same Builder + Adversary loops + protocol as Phase 1 (`plan.md` §6/§7). Here the
|
||||
Adversary's job is to **independently re-measure** claimed speed-ups and ensure **no test was
|
||||
weakened or isolation broken** to gain them.
|
||||
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase2b-test-performance.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. Why this phase
|
||||
|
||||
Runs are slow — a Keycloak test took ~30 minutes (the seed observation). Before we polish results
|
||||
(Phase 3), make the pipeline fast enough to be pleasant and to scale across all recipes. This phase
|
||||
is **empirical**: instrument → measure a baseline → attribute where the time goes → try improvements
|
||||
as controlled experiments → keep what measurably helps → re-measure. **No guessing; numbers decide.**
|
||||
Speed must **never** come from weakening tests, reducing real isolation unsafely, or skipping stages.
|
||||
|
||||
---
|
||||
|
||||
## 1. Mission
|
||||
|
||||
Understand *where* recipe-test time goes (per phase, cold vs warm) and *reduce it measurably* on the
|
||||
real Phase-2 workload, with before/after numbers for every change and no loss of correctness.
|
||||
|
||||
---
|
||||
|
||||
## 2. Definition of Done (Phase 2b exit condition)
|
||||
|
||||
Terminates only when every item holds **and the Adversary has independently re-verified each within
|
||||
24h** (logged in `REVIEW.md`):
|
||||
|
||||
- [ ] **T1 — Instrumentation.** The runner emits **per-phase timings** for every run (image pull,
|
||||
`abra app new`/deploy, service convergence, secret generation, each stage install/upgrade/
|
||||
backup-restore, each functional test, dependency/SSO setup, teardown) into the run's
|
||||
`results.json` (the same artifact Phase 3 consumes). Timings are visible per run.
|
||||
- [ ] **T2 — Baseline.** A measured baseline across a **representative recipe set** — at least: a
|
||||
light/stateless recipe (custom-html), a single-DB recipe (n8n), a heavy JVM/SSO recipe
|
||||
(keycloak), and an SSO-*dependent* recipe (lasuite-docs). Each measured **cold** (empty image
|
||||
cache) and **warm** (cached), multiple runs to capture variance. Recorded in `docs/perf/baseline.md`.
|
||||
- [ ] **T3 — Attribution.** A written attribution (`docs/perf/attribution.md`) showing the **Pareto
|
||||
breakdown** — which phases dominate, cold vs warm — e.g. "keycloak warm: 8m converge + 4m
|
||||
backup + …". The biggest levers are identified from data, not intuition.
|
||||
- [ ] **T4 — Experiments.** Each improvement idea (§4) tried as a **controlled experiment** (change
|
||||
one variable, hold the rest), with **before/after numbers** in `docs/perf/experiments.md`:
|
||||
what was changed, the measured delta, and keep/discard. Failed experiments are recorded as
|
||||
dead-ends (don't re-try).
|
||||
- [ ] **T5 — Adopted improvements + measured gain.** The beneficial changes are adopted (Nix-declared
|
||||
/ harness / Drone config) and the **overall run time is measurably reduced** vs the T2 baseline.
|
||||
Set a concrete target in `DECISIONS.md` from the attribution (e.g. "median warm heavy-recipe run
|
||||
≤ X min; light recipe ≤ Y min") and hit it, with the single node still safe (RAM/disk/concurrency).
|
||||
- [ ] **T6 — No regression.** Adversary confirms, from a cold start, that after the speed-ups **every
|
||||
Phase-2 test still passes and isolation/teardown still hold** (no shared-state contamination, no
|
||||
weakened/skipped assertions, no leaked apps). A speed-up that compromises correctness is reverted.
|
||||
- [ ] **T7 — Recommendations.** `docs/perf/README.md` summarizes findings, the recommended config
|
||||
(e.g. `MAX_TESTS`, cache settings, warm-infra choices) and per-recipe sizing/timeouts, and what
|
||||
*didn't* help. A new engineer can understand the perf model and re-run the measurements.
|
||||
|
||||
When T1–T7 hold and are Adversary-verified, write `## DONE` to Phase-2b `STATUS.md`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Method (the empirical loop)
|
||||
|
||||
1. **Instrument first (T1).** You cannot optimize what you don't measure. Add lightweight timing
|
||||
spans around every phase in `run_recipe_ci.py`/harness; emit to `results.json`. Keep overhead negligible.
|
||||
2. **Baseline (T2).** Run the representative set repeatedly, cold and warm; record medians + spread.
|
||||
Distinguish **cold-cache** (first pull/eval) from **warm-cache** (steady state) — they have very
|
||||
different profiles and call for different fixes.
|
||||
3. **Attribute (T3).** Rank phases by time. Optimize the **biggest contributors first**; ignore noise.
|
||||
4. **Experiment (T4).** One change at a time, re-measure on the same recipes, compare to baseline.
|
||||
Keep if the delta is real and correctness holds; otherwise revert and log the dead-end. **Cap
|
||||
retries** (don't thrash on a change that isn't helping).
|
||||
5. **Adopt + re-measure (T5).** Land the winners declaratively (Nix/harness/Drone), then re-baseline
|
||||
to confirm the cumulative gain.
|
||||
6. **Guard correctness throughout (T6).** Every speed run is also a correctness run; the Adversary
|
||||
re-verifies independently.
|
||||
|
||||
---
|
||||
|
||||
## 4. Ideas to try (hypotheses — validate empirically, don't assume)
|
||||
|
||||
Grouped by where time likely goes. Each is a hypothesis to **measure**, not a guaranteed win.
|
||||
|
||||
**A. Image pulls (often the cold-cache dominator).**
|
||||
- Stand up a **local Docker registry pull-through cache / mirror** on cc-ci (or `registry-mirrors`)
|
||||
so recipe images aren't re-downloaded across runs.
|
||||
- **Pre-pull/warm** the image set for enrolled recipes (a warm-images step / on enroll), so the first
|
||||
real run isn't paying the cold pull.
|
||||
- Ensure pinned tags (no `:latest` re-pulls); rely on the node's layer cache (don't prune images the
|
||||
active recipes need — reconcile with Phase-1's `autoPrune`).
|
||||
|
||||
**B. Service convergence / readiness (often the warm-cache dominator).**
|
||||
- Replace any fixed `sleep`s with **tight readiness polling** against real health endpoints (short
|
||||
interval, sensible cap) — over-waiting is pure waste.
|
||||
- Per-recipe **readiness probes** tuned to the app (e.g. keycloak `/realms/master`, DB `pg_isready`)
|
||||
instead of a generic HTTP wait.
|
||||
- Parallelize independent readiness checks within a run.
|
||||
|
||||
**C. Redundant deploy cycles.**
|
||||
- A run currently deploys multiple times (install; upgrade = old→new; backup = deploy→wipe→restore→
|
||||
redeploy). **Share one deployment** where safe: run install + functional + backup-restore against a
|
||||
single deploy; only the upgrade stage needs a separate prior-version deploy. Measure the saving vs
|
||||
any isolation cost.
|
||||
- Scope backups to the **minimal data volumes** (restic over only what matters) to cut backup/restore time.
|
||||
|
||||
**D. Warm / shared dependency infra (biggest lever for SSO recipes — but mind isolation).**
|
||||
- Deploying an SSO provider (keycloak/authentik) *per run* is expensive. Consider a **long-lived warm
|
||||
provider** that recipe tests register a per-run realm/client against, instead of a fresh deploy each
|
||||
run. **Tradeoff:** shared state risks cross-run interference — only adopt if per-run isolation
|
||||
(unique realm/client/users, cleaned up) is provably maintained; the Adversary must verify no
|
||||
contamination. If isolation can't be guaranteed, keep per-run deploys.
|
||||
- Keep traefik/the proxy warm (already persistent in Phase 1).
|
||||
|
||||
**E. Runner / build caching.**
|
||||
- Persistent **nix store** + warm flake eval on the runner (don't re-evaluate/re-fetch per build).
|
||||
- Cache test-dependency installs (pip/uv wheels, Playwright browser binaries) in a persistent volume
|
||||
or Drone cache, so each build doesn't refetch.
|
||||
|
||||
**F. Concurrency, sized per recipe.**
|
||||
- Tune `MAX_TESTS`/`DRONE_RUNNER_CAPACITY` empirically: **light recipes can run concurrently** while
|
||||
heavy ones serialize. Consider a per-recipe **weight/size** so the scheduler packs the node without
|
||||
overcommitting RAM/CPU (now 6GB / 2 vCPU). Parallelize independent functional tests within a run.
|
||||
|
||||
**G. Resources.**
|
||||
- Right-size the VM: RAM (now 6GB), **vCPU** (currently 2 — more cores speed parallel pulls/builds/
|
||||
JVM), disk I/O. Measure whether CPU or RAM is the bottleneck for heavy recipes before bumping.
|
||||
|
||||
**H. abra/secret overhead.**
|
||||
- Profile `abra app secret generate` and `abra app new`; avoid regenerating/re-inserting secrets
|
||||
redundantly across stages (reuse the per-run secret store from Phase-1 §4.4-B).
|
||||
|
||||
(Validate each on the baseline recipes; keep only measured winners. The list is a starting menu, not
|
||||
a mandate.)
|
||||
|
||||
---
|
||||
|
||||
## 5. Milestones (each ends with an Adversary gate)
|
||||
|
||||
- **V0 — Instrument + baseline.** Per-phase timing in `results.json`; baseline for the representative
|
||||
set, cold & warm, in `docs/perf/baseline.md`. *Accept:* Adversary reproduces a baseline run and the
|
||||
timings match reality.
|
||||
- **V1 — Attribution.** `docs/perf/attribution.md` ranks the time sinks (cold vs warm) and names the
|
||||
top 2–3 levers. *Accept:* the attribution is supported by the recorded numbers.
|
||||
- **V2 — Quick wins.** Land the cheapest high-impact fixes (image cache/pre-pull, readiness-wait
|
||||
tuning, dedup deploys) with before/after numbers. *Accept:* measured improvement on the baseline,
|
||||
all tests still green.
|
||||
- **V3 — Structural wins.** Evaluate warm/shared infra, runner caching, concurrency sizing, vCPU —
|
||||
adopt the ones that pay off *and* preserve isolation. *Accept:* cumulative improvement vs T2; the
|
||||
Adversary confirms isolation/correctness intact (esp. for any shared-infra change).
|
||||
- **V4 — Lock in + document.** Re-baseline to confirm the gain; record adopted config + dead-ends +
|
||||
recommendations in `docs/perf/`. *Accept:* target from T5 met (or a documented, justified best
|
||||
effort); no regressions; flip Phase-2b `STATUS.md` to `## DONE`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Guardrails (inherit Phase 1 §9 + Phase 2 §7.1)
|
||||
|
||||
- **Speed never beats correctness.** No change may weaken/skip a test, reduce a real assertion, or
|
||||
break isolation/teardown to look faster. Every perf experiment is re-run as a correctness run.
|
||||
- **Shared/warm infra is opt-in and isolation-proven.** Only adopt shared dependencies if per-run
|
||||
isolation (unique namespaces, cleanup) is verified by the Adversary; otherwise keep per-run deploys.
|
||||
- **Stay within the node budget.** Concurrency/resource changes must respect RAM/disk/CPU limits
|
||||
(Phase-1 `MAX_TESTS`); don't trade overload for apparent speed.
|
||||
- **Change one variable at a time; cap retries.** Attribute gains to specific changes; record
|
||||
dead-ends in `DECISIONS.md` and stop thrashing.
|
||||
- **Measure honestly.** Report medians + variance, cold vs warm; don't cherry-pick a lucky fast run.
|
||||
|
||||
---
|
||||
|
||||
## 7. Open decisions (log in DECISIONS.md)
|
||||
- The concrete perf **target** (per-recipe time budgets), derived from the attribution.
|
||||
- Local registry **pull-through cache** vs explicit pre-pull (or both).
|
||||
- Whether to use **warm shared SSO providers** (speed) or keep **per-run providers** (isolation) —
|
||||
decided by the measured saving vs the verified isolation cost.
|
||||
- `MAX_TESTS` and per-recipe **weights**; whether to raise vCPU.
|
||||
- Whether stage **deploy-sharing** (install+functional+backup on one deploy) is safe per recipe.
|
||||
@ -1,6 +1,7 @@
|
||||
# cc-ci Phase 3 — Beautiful YunoHost-style results (Autonomous Build Plan)
|
||||
|
||||
**Status:** QUEUED — starts only after Phase 2 (`plan-phase2-recipe-tests.md`) reaches `## DONE`.
|
||||
**Status:** QUEUED — starts after Phase 2 (`plan-phase2-recipe-tests.md`) and Phase 2b
|
||||
(`plan-phase2b-test-performance.md`) reach `## DONE`.
|
||||
**Transition:** **manual** (operator kicks it off; check in / test between phases).
|
||||
**Builds on:** Phase 1 (`plan.md` — dashboard `dashboard/`, the `!testme` bridge's PR comment,
|
||||
the runner, Playwright in the harness) and Phase 2 (the rich per-recipe test taxonomy → meaningful
|
||||
|
||||
Reference in New Issue
Block a user