Files
cc-ci/JOURNAL-shot.md
autonomic-bot 58d3505ea7
All checks were successful
continuous-integration/drone/push Build is passing
journal(shot): proof sweep progress + A1 fix + mumble probe plan
2026-06-11 06:32:42 +00:00

75 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL-shot.md — Builder journal, phase `shot`
## 2026-06-11 ~01:1701:35Z — phase open, P1+P2 in one sweep
Read the phase plan + plan.md §6.1/§7/§9. Enumerated enrolled recipes (19). Pulled per-recipe
latest-run data off cc-ci (`results.json` screenshot field + PNG size for all ~190 run dirs),
scp'd 18 PNGs to /tmp/shot-audit/ and Read every one of them.
Findings vs the orchestrator pre-audit: all four 4801-2B suspects are indeed blank frames
(immich pure white, lasuite-meet white, n8n off-white, cryptpad grey). keycloak 8.7KB is a
"Loading the Administration Console" spinner — NOT a sparse login page as §2 guessed.
lasuite-docs/drive ~5.9KB are lone spinners. Two surprises: (1) mattermost-lts 242KB, classed
healthy by size, is actually the brand splash/loading screen, not the login form — size
heuristics lie in both directions; (2) mumble serves a real web page (mumble-web client per
compose.mumbleweb.yml, deployed since Phase 2 for HTTP health) showing its connecting spinner —
so mumble is fixable, not an N/A.
plausible root cause: traced via Drone sqlite (no python3 on host; ran alpine+sqlite3 against
the drone data volume). Build 357 log t=73s: capture failed, last status=500 after 45s. Cross-ref
tests/plausible/functional/test_health_check.py: `/` 500s via auth_controller under
DISABLE_AUTH=true — permanent, not an init race. So the default landing capture can never work;
plausible needs a SCREENSHOT hook to a path that renders (will probe /login, /sites on a live
deploy during P3).
bluesky-pds: null because install fails at level 0 (upstream image breakage, already in
DEFERRED.md from rcust) — capture gated on deploy_ok, correctly skipped. N/A while upstream broken.
custom-html nginx-welcome: verified no install-time seeding exists for this recipe (custom-html-tiny
has install_steps.sh; custom-html only seeds in pre_backup/pre_upgrade ops, after capture). The
nginx default page IS the honest fresh-install view. Leaving OK; flagged in matrix for Adversary.
Adversary opened REVIEW-shot.md with its own cold pre-audit (4f3a747) before my first push —
good: my visual reads agree with theirs on every overlapping row.
Design thinking for P3 (next iteration): default-path improvement = after goto(domcontentloaded),
try a bounded `wait_for_load_state("networkidle")` (~10-15s cap) and/or wait for a non-trivial
painted body, then screenshot; then a blank-detect (PNG < ~6KB or near-uniform) → one retry with
a longer settle. Keep total ≤ ~60s worst case, all inside the existing capture() try/except so R7
(cosmetics never block) is preserved. Unit tests: blank-detector pure function + retry logic with
a fake page. Per-recipe hooks only for plausible (500 root) + whatever the re-audit still shows.
## 2026-06-11 ~05:45-06:00Z — plausible root cause was a 62-char SECRET_KEY_BASE; M1 PASSed meanwhile
M1 PASS (ae10b55) with a watch-list. P3 done in two commits: ce50f64 (harness settle+blank-retry,
6 unit tests, 205 pass, lint PASS) and b98a471 (plausible fix). The plausible story changed under
probing: three live probes (shot-probe{,2,3}-plausible) showed / and every HTML route 302→/register
which 500s; app logs gave the smoking gun: `(ArgumentError) cookie store expects conn.secret_key_base
to be at least 64 bytes`. Our EXTRA_ENV value — comment claimed "64-char" — measures 62. So every
page render 500'd while /api/* (no cookie store) passed all tiers. NOT auth_controller/DISABLE_AUTH
as the old comments claimed; corrected both stale comments. Fix = 68-char value; verified
shot-fix-plausible run: install pass, screenshot.png 64132B = real registration page (empty fields,
placeholders only — same safe shape the Adversary blessed for n8n/uptime-kuma). No hook needed.
P4 started: !testme posted 05:56:32Z on immich#2 + plausible#3 (drone builds 370+371 running,
concurrent). Manual full proof run keycloak launched (shot-proof-keycloak). Remaining queue:
mattermost-lts, cryptpad, lasuite-meet, lasuite-docs, lasuite-drive, n8n, mumble.
## 2026-06-11 ~06:05-06:30Z — proof sweep underway; A1 fixed; mumble is the holdout
Proofs verified visually so far (each level matches its baseline): drone 370 immich L4 234KB real
onboarding card (was 4801B); drone 371 plausible L4 64KB registration page (was null); keycloak L4
real sign-in form (was loading spinner); cryptpad L4 real landing w/ document picker (was grey blank);
lasuite-meet L4 real product landing (was white blank); mattermost-lts L2(=m2r baseline L2) — real
page but it's the desktop-or-browser interstitial, so per the watch-list I added the first
SCREENSHOT hook (80e5713, → /login + public settle()); re-run pending.
A1 (blank-retry could regress a larger frame): fixed in 7ad7d1f — retry goes to a temp path and
only replaces via os.replace when >= first; regression test [9999,4801]→9999. 207 unit, lint PASS.
mumble: proof run still spinner after settle+retry (7980B). Probing live what mumble-web does over
90s (it printed real mumble-web HTML while up; suspect autoconnect overlay that never resolves
because the websocket voice path may not be browser-reachable). Orchestrated probe2 running.
Also in flight: n8n + lasuite-docs proofs from the A1-fixed tree. Queue: lasuite-drive, mattermost
re-run; then ghost/hedgedoc/etc. healthy-class citations + dashboard/card check + runtime compare.