Files

autonomic-bot 2d3c17f4bd Add Phase-2b plan: test performance (measure, attribute, improve empirically)

Phase 2b (after Phase 2, before Phase 3): instrument per-phase timings, baseline a
representative recipe set (cold vs warm), attribute where time goes (Pareto), then try
improvements as controlled before/after experiments and keep measured winners — image
pull cache/pre-pull, readiness-wait tuning, dedup deploy cycles, warm/shared infra
(isolation-proven), runner caching, concurrency sizing, vCPU. Speed never weakens tests
or isolation (Adversary re-measures + re-verifies). Phase 3 now follows 2b. Linked in README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 04:26:27 +01:00

11 KiB

Raw Blame History

cc-ci Phase 3 — Beautiful YunoHost-style results (Autonomous Build Plan)

Status: QUEUED — starts after Phase 2 (plan-phase2-recipe-tests.md) and Phase 2b (plan-phase2b-test-performance.md) reach ## DONE. Transition: manual (operator kicks it off; check in / test between phases). Builds on: Phase 1 (plan.md — dashboard dashboard/, the !testme bridge's PR comment, the runner, Playwright in the harness) and Phase 2 (the rich per-recipe test taxonomy → meaningful levels). Reference style: the YunoHost app-CI result comment, e.g. https://github.com/YunoHost-Apps/lichenmarkdown_ynh/pull/20#issuecomment-3543928229 — see §3. Owner agents: same Builder + Adversary loops + protocol as Phase 1 (plan.md §6/§7). This file's path: /srv/cc-ci/cc-ci-plan/plan-phase3-results-ux.md

0. Relationship to earlier phases

Phase 1 gave us a functional results surface (Drone's per-run logs, a basic overview dashboard, and a PR comment with run URL + pass/fail — D7). Phase 2 gave us a rich, layered test taxonomy per recipe (install / upgrade / backup-data-integrity / recipe-specific / SSO-integration / recipe-local).

Phase 3 makes the results beautiful and YunoHost-style: a computed level per run, an image-forward PR comment (status badge + a rendered summary card with an app screenshot), and a polished overview dashboard comparable to ci-apps.yunohost.org. It is presentation + level-scoring on top of existing data — it must not change what the tests assert.

Do not start until Phase 2 STATUS.md shows ## DONE (Adversary-verified). Same loop protocol.

1. Mission

Turn cc-ci results into something a maintainer is happy to see on a PR and on a status page:

a level (0–N) summarizing how far up the quality ladder the recipe got,
an image-forward PR comment like YunoHost's (🌻 + status badge + summary image that includes a screenshot of the actually-deployed app), linking to the full run,
a dashboard that looks and feels like the YunoHost app list (per-recipe level badges, latest status, screenshots, history).

2. Definition of Done (Phase 3 exit condition)

Terminates only when every item holds and the Adversary has independently re-verified each within 24h (logged in REVIEW.md):

R1 — Level ladder. A documented level ladder (§4.1) maps which test sets passed → a single integer level, computed per run. Missing a lower rung caps the level (YunoHost semantics).
R2 — Image-forward PR comment. On a !testme run, the bridge posts/updates a Gitea PR comment in the YunoHost shape: a marker (🌻), a status/level badge, and a summary image, both linking to the full run/dashboard. Re-running updates the same comment.
R3 — Summary card image. Each run renders a PNG summary card showing: recipe + version, the level, a per-stage/per-test ✔/✘ breakdown, and an embedded screenshot of the deployed app. Served at a stable URL; embedded in the comment and the dashboard.
R4 — App screenshot. The runner captures a real screenshot of the deployed app (Playwright, reusing the Phase-1 harness) — post-login where the landing page requires it — for the card.
R5 — Dashboard polish. The overview at ci.commoninternet.net looks/feels like ci-apps.yunohost.org: a table/grid of recipes with level badge, latest pass/fail, last tested version, app screenshot/thumbnail, and a link to history. Regenerated on completion.
R6 — Badges. A per-recipe level/status badge endpoint (SVG) embeddable in recipe READMEs and the dashboard.
R7 — Safe & robust. No secrets in images, comments, badges, or screenshots (reuse Phase-1 §4.4 redaction; the screenshot step must not capture secret values — e.g. don't shoot pages showing generated admin passwords). Image/screenshot generation never blocks or fails the pipeline: on error it falls back to a text comment + records the failure, and the test verdict is unaffected.
R8 — Docs. docs/ explains the level ladder, how the card/screenshot/badge are generated, and how to embed a badge.

When R1–R8 hold and are Adversary-verified, write ## DONE to Phase-3 STATUS.md.

3. Reference: the YunoHost comment style

The linked YunoHost CI comment is deliberately minimal and visual (verified by fetching it):

A header marker (🌻).
A shield-style test badge linking to the CI job (ci-apps.yunohost.org/ci/job/<id>).
A summary image (PNG) — a rendered card with the result/level — also linking to the job.
No verbose inline table; the per-test breakdown + level live inside the rendered image and on the dashboard. Users click through for full logs.

Mirror this shape for cc-ci (Gitea renders markdown images in comments): marker + badge + summary PNG, both linking to the cc-ci run/dashboard. YunoHost also shows a screenshot of the app — we do the same in the card.

4. Design

4.1 The level ladder (proposed default — finalize in `DECISIONS.md`)

A single integer; each rung requires all lower rungs (a gap caps the level, like YunoHost):

L0 — install failed / app never became healthy.
L1 — Installs: deploys and passes health/readiness.
L2 — Upgrades: previous published version → PR version, stays healthy, data intact.
L3 — Backup/restore: seeded data survives backup → wipe → restore (real data-integrity, P4).
L4 — Functional: the recipe-specific functional tests pass (Phase-2 parity + ≥2 specific).
L5 — Integration: SSO/OIDC and cross-app integration tests pass (for recipes that have them; recipes with no integration surface cap at L4 by definition — record this so the level is fair).
L6 — Recipe-local: the recipe repo's own tests/ (D4) pass and are merged.

(Also surface, as badges/flags rather than levels: clean-teardown ✔, no-secret-leak ✔ — these are gating invariants from Phase 1, shown but not part of the climb.)

4.2 Data flow

run_recipe_ci.py emits a structured results.json per run
  { recipe, version, pr, stages:[{name,status,tests:[{name,status,ms}]}], level, screenshot.png }
        │
        ├─► summary-card renderer: HTML template (recipe, level badge, ✔/✘ table, app screenshot)
        │      → render to PNG (Playwright screenshot of the HTML, reusing the harness browser)
        │      → publish at ci.commoninternet.net/runs/<id>/summary.png  (+ badge.svg)
        │
        ├─► bridge updates the Gitea PR comment: 🌻 + [badge] + [summary.png], linking to the run
        │
        └─► dashboard generator: overview grid (per-recipe level badge, screenshot, last status,
               version, history) regenerated on build-completion → ci.commoninternet.net

Summary image: render an HTML results card → PNG via Playwright (already in the harness — no new heavy dep). Keep a deterministic template; embed the app screenshot.
App screenshot: Playwright navigates the live <recipe>-pr<n>-<sha>.ci.commoninternet.net (logging in via the test user where needed) and screenshots the main view — captured during the run while the app is up, before teardown.
Badges: generate SVG (shields-style) per run + a per-recipe latest-level badge endpoint.
Hosting: the dashboard service (Phase-1 dashboard/) serves /runs/<id>/... and /badge/...; Gitea comments embed them by URL.

4.3 PR comment (YunoHost-shaped)

On run start: a placeholder comment ("⏳ testing … level pending", link to live logs). On completion: update the same comment to 🌻 + level/status badge + summary card image, linking to the run and the dashboard. One comment per PR, updated in place; re-!testme refreshes it.

5. Milestones (each ends with an Adversary gate)

U0 — Results schema + level. run_recipe_ci.py emits results.json (per-stage/per-test) and computes the level (§4.1). Accept: level is correct for a recipe that passes through L4 and one that fails at L2 (capped).
U1 — App screenshot. Harness captures a real screenshot of the deployed app (post-login where needed), secret-safe. Accept: screenshot of a sample recipe shows the working UI, no secrets.
U2 — Summary card + badge. Render the HTML card → PNG (level, ✔/✘ table, screenshot) + SVG badge, served at stable URLs. Accept: card + badge render correctly for pass and fail runs.
U3 — YunoHost-style PR comment. Bridge posts/updates the image-forward comment (marker + badge
- card, linked). Accept: live on a scratch PR — comment shows badge + card + screenshot, updates on re-run, contains no secrets.
U4 — Dashboard polish. Overview grid with per-recipe level badges, screenshots, status, version, history — comparable look/feel to ci-apps.yunohost.org. Accept: matches reality across several runs; Adversary confirms it mirrors the underlying results.
U5 — Badges + docs + hardening. Embeddable per-recipe badges; docs for the ladder + embedding; fallback-to-text on render failure; secret-scan over images/screenshots/comments. Accept: Adversary's leak scan over published images/comments finds nothing; killing the renderer degrades gracefully to text without affecting the verdict; flip Phase-3 STATUS.md to ## DONE.

6. Guardrails (inherit Phase 1 §9 + Phase 2 §7.1)

Presentation never changes the verdict. The level and card report test outcomes; they must not let a run look greener than its tests. The Adversary checks the rendered level/card against the raw results.json and the actual test outcomes — a card that overstates the result is a FAIL.
No secrets in any artifact (R7) — comments, badges, summary cards, app screenshots. The screenshot step must avoid pages that display generated credentials.
Never block the pipeline on cosmetics — image/screenshot/badge generation failures degrade to a text comment and a recorded warning; they never fail or hang a test run (respect Phase-1 timeouts).
Don't weaken tests to raise a level (carry-over of the cardinal rule) — the Adversary watches for tests softened or levels mis-mapped to inflate the displayed quality.

7. Open decisions (log in DECISIONS.md)

Exact level ladder + how recipes without an integration/SSO surface are scored fairly (cap vs N/A).
Summary-card rendering: HTML→Playwright-PNG (default, reuses the harness) vs a dedicated image lib.
Where app screenshots are hosted/retained and for how long (retention/cleanup, like run logs).
Badge implementation: self-rendered SVG vs a shields.io endpoint pattern.
Whether to also post a compact markdown fallback table beneath the image for accessibility.

11 KiB Raw Blame History Unescape Escape