Phase 2b (after Phase 2, before Phase 3): instrument per-phase timings, baseline a representative recipe set (cold vs warm), attribute where time goes (Pareto), then try improvements as controlled before/after experiments and keep measured winners — image pull cache/pre-pull, readiness-wait tuning, dedup deploy cycles, warm/shared infra (isolation-proven), runner caching, concurrency sizing, vCPU. Speed never weakens tests or isolation (Adversary re-measures + re-verifies). Phase 3 now follows 2b. Linked in README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
cc-ci Phase 2b — Test performance: measure, attribute, improve (Autonomous Build Plan)
Status: QUEUED — starts after Phase 2 (plan-phase2-recipe-tests.md) reaches ## DONE, and runs
before Phase 3 (plan-phase3-results-ux.md).
Transition: manual (operator kicks it off).
Builds on: Phase 1 (runner, Drone, harness, MAX_TESTS) + Phase 2 (the full per-recipe suites —
the real workload we're optimizing).
Owner agents: same Builder + Adversary loops + protocol as Phase 1 (plan.md §6/§7). Here the
Adversary's job is to independently re-measure claimed speed-ups and ensure no test was
weakened or isolation broken to gain them.
This file's path: /srv/cc-ci/cc-ci-plan/plan-phase2b-test-performance.md
0. Why this phase
Runs are slow — a Keycloak test took ~30 minutes (the seed observation). Before we polish results (Phase 3), make the pipeline fast enough to be pleasant and to scale across all recipes. This phase is empirical: instrument → measure a baseline → attribute where the time goes → try improvements as controlled experiments → keep what measurably helps → re-measure. No guessing; numbers decide. Speed must never come from weakening tests, reducing real isolation unsafely, or skipping stages.
1. Mission
Understand where recipe-test time goes (per phase, cold vs warm) and reduce it measurably on the real Phase-2 workload, with before/after numbers for every change and no loss of correctness.
2. Definition of Done (Phase 2b exit condition)
Terminates only when every item holds and the Adversary has independently re-verified each within
24h (logged in REVIEW.md):
- T1 — Instrumentation. The runner emits per-phase timings for every run (image pull,
abra app new/deploy, service convergence, secret generation, each stage install/upgrade/ backup-restore, each functional test, dependency/SSO setup, teardown) into the run'sresults.json(the same artifact Phase 3 consumes). Timings are visible per run. - T2 — Baseline. A measured baseline across a representative recipe set — at least: a
light/stateless recipe (custom-html), a single-DB recipe (n8n), a heavy JVM/SSO recipe
(keycloak), and an SSO-dependent recipe (lasuite-docs). Each measured cold (empty image
cache) and warm (cached), multiple runs to capture variance. Recorded in
docs/perf/baseline.md. - T3 — Attribution. A written attribution (
docs/perf/attribution.md) showing the Pareto breakdown — which phases dominate, cold vs warm — e.g. "keycloak warm: 8m converge + 4m backup + …". The biggest levers are identified from data, not intuition. - T4 — Experiments. Each improvement idea (§4) tried as a controlled experiment (change
one variable, hold the rest), with before/after numbers in
docs/perf/experiments.md: what was changed, the measured delta, and keep/discard. Failed experiments are recorded as dead-ends (don't re-try). - T5 — Adopted improvements + measured gain. The beneficial changes are adopted (Nix-declared
/ harness / Drone config) and the overall run time is measurably reduced vs the T2 baseline.
Set a concrete target in
DECISIONS.mdfrom the attribution (e.g. "median warm heavy-recipe run ≤ X min; light recipe ≤ Y min") and hit it, with the single node still safe (RAM/disk/concurrency). - T6 — No regression. Adversary confirms, from a cold start, that after the speed-ups every Phase-2 test still passes and isolation/teardown still hold (no shared-state contamination, no weakened/skipped assertions, no leaked apps). A speed-up that compromises correctness is reverted.
- T7 — Recommendations.
docs/perf/README.mdsummarizes findings, the recommended config (e.g.MAX_TESTS, cache settings, warm-infra choices) and per-recipe sizing/timeouts, and what didn't help. A new engineer can understand the perf model and re-run the measurements.
When T1–T7 hold and are Adversary-verified, write ## DONE to Phase-2b STATUS.md.
3. Method (the empirical loop)
- Instrument first (T1). You cannot optimize what you don't measure. Add lightweight timing
spans around every phase in
run_recipe_ci.py/harness; emit toresults.json. Keep overhead negligible. - Baseline (T2). Run the representative set repeatedly, cold and warm; record medians + spread. Distinguish cold-cache (first pull/eval) from warm-cache (steady state) — they have very different profiles and call for different fixes.
- Attribute (T3). Rank phases by time. Optimize the biggest contributors first; ignore noise.
- Experiment (T4). One change at a time, re-measure on the same recipes, compare to baseline. Keep if the delta is real and correctness holds; otherwise revert and log the dead-end. Cap retries (don't thrash on a change that isn't helping).
- Adopt + re-measure (T5). Land the winners declaratively (Nix/harness/Drone), then re-baseline to confirm the cumulative gain.
- Guard correctness throughout (T6). Every speed run is also a correctness run; the Adversary re-verifies independently.
4. Ideas to try (hypotheses — validate empirically, don't assume)
Grouped by where time likely goes. Each is a hypothesis to measure, not a guaranteed win.
A. Image pulls (often the cold-cache dominator).
- Stand up a local Docker registry pull-through cache / mirror on cc-ci (or
registry-mirrors) so recipe images aren't re-downloaded across runs. - Pre-pull/warm the image set for enrolled recipes (a warm-images step / on enroll), so the first real run isn't paying the cold pull.
- Ensure pinned tags (no
:latestre-pulls); rely on the node's layer cache (don't prune images the active recipes need — reconcile with Phase-1'sautoPrune).
B. Service convergence / readiness (often the warm-cache dominator).
- Replace any fixed
sleeps with tight readiness polling against real health endpoints (short interval, sensible cap) — over-waiting is pure waste. - Per-recipe readiness probes tuned to the app (e.g. keycloak
/realms/master, DBpg_isready) instead of a generic HTTP wait. - Parallelize independent readiness checks within a run.
C. Redundant deploy cycles.
- A run currently deploys multiple times (install; upgrade = old→new; backup = deploy→wipe→restore→ redeploy). Share one deployment where safe: run install + functional + backup-restore against a single deploy; only the upgrade stage needs a separate prior-version deploy. Measure the saving vs any isolation cost.
- Scope backups to the minimal data volumes (restic over only what matters) to cut backup/restore time.
D. Warm / shared dependency infra (biggest lever for SSO recipes — but mind isolation).
- Deploying an SSO provider (keycloak/authentik) per run is expensive. Consider a long-lived warm provider that recipe tests register a per-run realm/client against, instead of a fresh deploy each run. Tradeoff: shared state risks cross-run interference — only adopt if per-run isolation (unique realm/client/users, cleaned up) is provably maintained; the Adversary must verify no contamination. If isolation can't be guaranteed, keep per-run deploys.
- Keep traefik/the proxy warm (already persistent in Phase 1).
E. Runner / build caching.
- Persistent nix store + warm flake eval on the runner (don't re-evaluate/re-fetch per build).
- Cache test-dependency installs (pip/uv wheels, Playwright browser binaries) in a persistent volume or Drone cache, so each build doesn't refetch.
F. Concurrency, sized per recipe.
- Tune
MAX_TESTS/DRONE_RUNNER_CAPACITYempirically: light recipes can run concurrently while heavy ones serialize. Consider a per-recipe weight/size so the scheduler packs the node without overcommitting RAM/CPU (now 6GB / 2 vCPU). Parallelize independent functional tests within a run.
G. Resources.
- Right-size the VM: RAM (now 6GB), vCPU (currently 2 — more cores speed parallel pulls/builds/ JVM), disk I/O. Measure whether CPU or RAM is the bottleneck for heavy recipes before bumping.
H. abra/secret overhead.
- Profile
abra app secret generateandabra app new; avoid regenerating/re-inserting secrets redundantly across stages (reuse the per-run secret store from Phase-1 §4.4-B).
(Validate each on the baseline recipes; keep only measured winners. The list is a starting menu, not a mandate.)
5. Milestones (each ends with an Adversary gate)
- V0 — Instrument + baseline. Per-phase timing in
results.json; baseline for the representative set, cold & warm, indocs/perf/baseline.md. Accept: Adversary reproduces a baseline run and the timings match reality. - V1 — Attribution.
docs/perf/attribution.mdranks the time sinks (cold vs warm) and names the top 2–3 levers. Accept: the attribution is supported by the recorded numbers. - V2 — Quick wins. Land the cheapest high-impact fixes (image cache/pre-pull, readiness-wait tuning, dedup deploys) with before/after numbers. Accept: measured improvement on the baseline, all tests still green.
- V3 — Structural wins. Evaluate warm/shared infra, runner caching, concurrency sizing, vCPU — adopt the ones that pay off and preserve isolation. Accept: cumulative improvement vs T2; the Adversary confirms isolation/correctness intact (esp. for any shared-infra change).
- V4 — Lock in + document. Re-baseline to confirm the gain; record adopted config + dead-ends +
recommendations in
docs/perf/. Accept: target from T5 met (or a documented, justified best effort); no regressions; flip Phase-2bSTATUS.mdto## DONE.
6. Guardrails (inherit Phase 1 §9 + Phase 2 §7.1)
- Speed never beats correctness. No change may weaken/skip a test, reduce a real assertion, or break isolation/teardown to look faster. Every perf experiment is re-run as a correctness run.
- Shared/warm infra is opt-in and isolation-proven. Only adopt shared dependencies if per-run isolation (unique namespaces, cleanup) is verified by the Adversary; otherwise keep per-run deploys.
- Stay within the node budget. Concurrency/resource changes must respect RAM/disk/CPU limits
(Phase-1
MAX_TESTS); don't trade overload for apparent speed. - Change one variable at a time; cap retries. Attribute gains to specific changes; record
dead-ends in
DECISIONS.mdand stop thrashing. - Measure honestly. Report medians + variance, cold vs warm; don't cherry-pick a lucky fast run.
7. Open decisions (log in DECISIONS.md)
- The concrete perf target (per-recipe time budgets), derived from the attribution.
- Local registry pull-through cache vs explicit pre-pull (or both).
- Whether to use warm shared SSO providers (speed) or keep per-run providers (isolation) — decided by the measured saving vs the verified isolation cost.
MAX_TESTSand per-recipe weights; whether to raise vCPU.- Whether stage deploy-sharing (install+functional+backup on one deploy) is safe per recipe.