Files
cc-ci/BACKLOG-shot.md

8.1 KiB
Raw Blame History

BACKLOG-shot.md — phase shot (recipe screenshot audit & repair)

SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-shot-screenshots.md. Gates: M1 (audit+diagnosis), M2 (all OK / agreed N/A).

Build backlog

P1 — Audit matrix (status: complete, all 19 PNGs visually inspected 2026-06-11)

Enrolled set (19) = tests/<r>/recipe_meta.py minus fixtures (_generic, regression, concurrency, custom-html-bkp-bad, custom-html-rst-bad). Evidence: /var/lib/cc-ci-runs/<run>/ on cc-ci; PNGs pulled to /tmp/shot-audit/ on the builder host and each one Read (visually).

recipe latest run w/ artifacts screenshot field PNG bytes visual content (I looked) class
bluesky-pds ab-bluesky-pds-oldmain null no PNG; install=fail level=0 (upstream image breakage, rcust DEFERRED) → capture correctly skipped (if deploy_ok) N-A-candidate (blocked upstream)
cryptpad m2r-cryptpad screenshot.png 4802 solid light-grey frame, nothing else BLANK
custom-html m2r-custom-html screenshot.png 35707 "Welcome to nginx!" default page OK? (diagnose: is this the recipe's true fresh-install content?)
custom-html-tiny m2r-custom-html-tiny screenshot.png 12950 seeded CI content ("cc-ci custom-html-tiny … DG5") OK
discourse m2p-discourse screenshot.png 66121 real forum UI, welcome topic, Sign Up/Log In OK
ghost m2r-ghost screenshot.png 444183 real blog landing ("Thoughts, stories and ideas") OK
hedgedoc m2r-hedgedoc screenshot.png 131967 real landing (logo, Sign In, feature intro) OK
immich 356 screenshot.png 4801 pure white frame BLANK
keycloak m2r-keycloak screenshot.png 8764 spinner + "Loading the Administration Console" LOADING
lasuite-docs m2r-lasuite-docs screenshot.png 6022 lone spinner on white LOADING
lasuite-drive m2p2-lasuite-drive screenshot.png 5895 lone spinner on white LOADING
lasuite-meet m2r-lasuite-meet screenshot.png 4801 pure white frame BLANK
mailu m2r-mailu screenshot.png 33800 real sign-in page (empty fields) OK
matrix-synapse m2r-matrix-synapse screenshot.png 33296 "It works! Synapse is running" landing OK
mattermost-lts m2b-mattermost-lts screenshot.png 242139 brand splash/loading screen (logo on blue), NOT the login form LOADING (borderline — brand-recognizable but a loading state)
mumble m2r-mumble screenshot.png 7913 spinner on grey — a web page IS served on the domain LOADING (diagnose what serves it; N/A may NOT be justified)
n8n m2r-n8n screenshot.png 4801 off-white blank frame. Flaky: run 197 (30256 B) shows the real "Set up owner account" form (empty fields, credential-free) BLANK (flaky)
plausible 357 null no PNG on ANY run (122→357) NULL
uptime-kuma m2r-uptime-kuma screenshot.png 30858 real "Create your admin account" setup form (empty fields) OK

PNG-size note: 4801/4802 B at 1280×800 is a byte-stable blank-frame fingerprint (3 different apps, same size).

P2 — Root-cause diagnoses

  • NULL — plausible (evidence: Drone build 357 ci-step log, t=73s): screenshot: capture failed (non-fatal, verdict unaffected): page.goto(https://plau-b51425.ci.commoninternet.net/) never returned a status in (200, 301, 302, 303, 401, 403) after 15 attempts (45s); last status=500. Plausible's / 500s by design under DISABLE_AUTH=true (auth_controller; documented in tests/plausible/functional/test_health_check.py docstring and recipe_meta — that's why HEALTH_PATH is /api/health). Default landing-page capture can NEVER succeed → needs a per-recipe SCREENSHOT hook to a path that actually renders (probe live: e.g. /login or /sites).
  • NULL — bluesky-pds: install fails (level=0) before the app is up → if deploy_ok: gate in runner/run_recipe_ci.py:1024 correctly skips capture. Not a screenshot defect; upstream image breakage already filed in machine-docs/DEFERRED.md (rcust). → documented N/A while upstream is broken.
  • BLANK class — immich, lasuite-meet, n8n(flaky), cryptpad: SPA paint race. capture() navigates with wait_until="domcontentloaded" (runner/harness/screenshot.py:91) and screenshots immediately; SPA shell HTML has loaded but JS hasn't painted → solid 4801-2 B frame. n8n flakiness = same race, sometimes JS wins (run 197 captured the real form).
  • LOADING class — keycloak, lasuite-docs, lasuite-drive, mumble, mattermost-lts(borderline): same race, caught mid-paint (spinner/splash rendered, app JS still loading/connecting).
  • mumble web stack identified: recipe deploys a web service (mumble-web client) on the domain — spinner is its connecting state; landing renders a connect dialog once JS settles. NOT an N/A.
  • custom-html nginx-welcome question: the recipe's fresh install genuinely serves the nginx default page at / (no content seeded for this recipe's install; only custom-html-tiny seeds via install_steps.sh). Screenshot is an honest representative view of a fresh install. → OK as-is.

P3 — Fixes

  • Harness default improvement (fixes BLANK+LOADING classes): after domcontentloaded nav, bounded network-idle/paint wait + blank-frame detect (tiny PNG → one retry with stronger wait), all within NAV_DEADLINE_S=45 / step worst-case ≤ ~60s. Unit tests in tests/unit/test_screenshot.py.
  • plausible SCREENSHOT hook (tests/plausible/recipe_meta.py) to a rendering, credential-free path.
  • Re-audit mattermost-lts / mumble / keycloak / lasuite-* after harness fix; per-recipe hooks only where the default still can't work.
  • bluesky-pds: document N/A in matrix (Adversary agreement at M1/M2).

P4 — Proof runs

  • Fresh real-CI run per fixed recipe (immich, lasuite-meet, n8n, cryptpad, keycloak, lasuite-docs, lasuite-drive, mumble, mattermost-lts, plausible), ≥2 via drone !testme; visual check each PNG; card + dashboard render. Healthy class: cite existing artifact + visual check (done in P1).

Adversary findings

[adversary] A1 — blank-retry can REGRESS a larger frame to a worse one (LOW, non-blocking) — CLOSED @2026-06-11T06:32Z

CLOSED: fixed in 7ad7d1f (retry snapped to a temp path; os.replace only if retry >= first, else discard + cleanup in finally). Re-verified COLD with my own probe (not the Builder's test): the exact filed case [9999,4801] now keeps 9999 (retry discarded, no temp leak); originals intact ([4801,30256]→30256, [4801,4802]→4802, [35707]→1 shot, [5000,5000]→replace). 5/5 pass. R7 contract preserved (retry-raise still propagates to capture's swallow → None; first frame on disk). --- original finding (for the record) --- Where: runner/harness/screenshot.py _snap_with_blank_retry (ce50f64). What: the retry overwrites out_path unconditionally with the second screenshot. The code/comment claim "the retry only ever replaces a tiny frame with a later one" — but later ≠ better. If the first frame is e.g. 9999 B (a partial render, just under BLANK_SIZE_BYTES=10000) and the page regresses in the extra 4 s settle (redirect, session-timeout splash, error overlay), the retry can yield a 4801 B blank that overwrites the better 9999 B frame. The Builder's unit test only covers blank→blank (4801→4802); the bigger→smaller regression is untested. Repro (cold, my independent probe, not the Builder's test file): fake page returning sizes [9999, 4801]_snap_with_blank_retry keeps 4801 (the worse frame). Severity: LOW. R7 holds (cosmetic only, never affects verdict); my M2 per-PNG visual check is the backstop — any actually-blank final PNG will FAIL that recipe regardless. Filed for hardening, not a veto. Suggested guard (trivial, strictly safer): keep the larger frame — only overwrite if getsize(retry) >= getsize(first) (or snap retry to a temp path and pick max). Then extend the unit test with a bigger→smaller case asserting the larger frame survives. Closes: only I close this, after re-test. Non-blocking for an M2 claim, but I will re-check at M2.