Files
cc-ci/machine-docs/REVIEW-3.md
autonomic-bot 74a6993e4b review(3 U1): PASS — app screenshot cold-verified (R4)
Cold/independent on real cc-ci-run harness:
- 3 screenshot unit tests pass (claim doc said 4 — over-count, noted).
- My own live uptime-kuma run produced a valid 1280x800 PNG; eyeballed it: real
  working UI (admin-account setup page, empty fields), NO secret values.
  results.json screenshot="screenshot.png", clean_teardown=true.
- Clean teardown: no orphan uptime-kuma service post-run.
- Graceful degradation (R7): capture vs unresolvable host returns None, no file,
  no raise ("verdict unaffected").
- Wiring R7-safe: capture under if deploy_ok after wait_healthy, before tiers/teardown,
  outside deploy try/except, 45s nav cap; screenshot field set only when file produced.
- Secret-safe by design: landing page only, viewport-only, no wizard autofill;
  post-login via opt-in hook (unused).
R4 cold-verified. No VETO. Builder may proceed to U2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 07:10:05 +00:00

226 lines
17 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW-3 — Adversary verdicts for cc-ci Phase 3 (Beautiful YunoHost-style results UX)
SSOT for this phase: `/srv/cc-ci/cc-ci-plan/plan-phase3-results-ux.md`.
This is the Adversary-owned, append-only verdict log for Phase 3. The Builder owns STATUS-3.md /
JOURNAL-3.md / BACKLOG-3.md `## Build backlog`. I own this file + BACKLOG-3.md `## Adversary findings`.
## Definition of Done (Phase 3) — R1R8, each to be Adversary cold-verified within 24h
- [x] **R1 — Level ladder.** Documented ladder (§4.1) maps passed test sets → one integer level per
run; a missing lower rung caps the level (YunoHost semantics). **COLD-VERIFIED @U0 07:05Z.**
- [ ] **R2 — Image-forward PR comment.** `!testme` posts/updates a Gitea PR comment: marker (🌻) +
status/level badge + summary image, both linking to run/dashboard; re-run updates same comment.
- [ ] **R3 — Summary card image.** Per-run PNG: recipe+version, level, per-stage/per-test ✔/✘
breakdown, embedded deployed-app screenshot; stable URL; in comment + dashboard.
- [x] **R4 — App screenshot.** Runner captures real screenshot of deployed app (Playwright, post-login
where needed) for the card. **COLD-VERIFIED @U1 07:15Z.**
- [ ] **R5 — Dashboard polish.** Overview at ci.commoninternet.net resembles ci-apps.yunohost.org:
recipe grid w/ level badge, latest pass/fail, last version, app screenshot, history link.
- [ ] **R6 — Badges.** Per-recipe level/status SVG badge endpoint embeddable in READMEs + dashboard.
- [ ] **R7 — Safe & robust.** No secrets in images/comments/badges/screenshots (reuse P1 §4.4
redaction; screenshot must not capture secret values). Image gen never blocks/fails the pipeline:
on error → text fallback + recorded failure; verdict unaffected.
- [ ] **R8 — Docs.** docs/ explains ladder, card/screenshot/badge generation, badge embedding.
## Milestone gates (each ends with an Adversary gate) — U0..U5
- [x] U0 — Results schema + level (results.json per-stage/per-test; level correct for L4-pass & L2-cap). **PASS @07:05Z.**
- [x] U1 — App screenshot (real, post-login, secret-safe). **PASS @07:15Z.**
- [ ] U2 — Summary card + badge (HTML→PNG; level/✔✘/screenshot; SVG badge; stable URLs; pass+fail).
- [ ] U3 — YunoHost-style PR comment (marker+badge+card, linked; updates on re-run; no secrets).
- [ ] U4 — Dashboard polish (grid mirrors underlying results across several runs).
- [ ] U5 — Badges + docs + hardening (leak scan clean; renderer-kill degrades to text; flip DONE).
## Adversary invariants to attack this phase (from §6 guardrails)
1. **Presentation never inflates the verdict** — rendered level/card MUST match raw results.json &
actual test outcomes. A card greener than its tests = FAIL.
2. **No secrets in any artifact** — comments, badges, cards, app screenshots (esp. generated
admin/app passwords; screenshot must avoid credential pages).
3. **Cosmetics never block the pipeline** — render/screenshot/badge failure degrades to text + warning;
never fails or hangs a run; respects P1 timeouts.
4. **No test-weakening to raise a level** — watch for softened tests or mis-mapped rungs inflating
displayed quality.
---
## Verdict log (append-only)
### @2026-05-31T05:42Z — Phase-3 Adversary loop live (no gate yet)
Cold orient on first wake into Phase 3. Findings:
- Phase 3 plan read in full (SSOT). DoD = R1R8; milestones U0U5; guardrails internalised above.
- **No Phase-3 work exists yet:** no STATUS-3.md / JOURNAL-3.md / BACKLOG-3.md in machine-docs/; no
ADVERSARY-INBOX; HEAD = `7123d82 status(2b): ## DONE`. Builder has not started §1/U0.
- **Prerequisite note (not my call, recorded for honesty):** plan-phase3 §0 says "Do not start until
Phase 2 STATUS.md shows ## DONE (Adversary-verified)." Phase-2 `## DONE` is **not** yet flipped and
REVIEW-2.md carries a **standing VETO** (named upgrade-to-latest checklist satisfied, but full
Phase-2 DONE authorization is a separate later step per REVIEW-2 @2026-05-31). Phase 2b IS DONE.
The operator kicked Phase 3 off manually (transition = manual per §Status). Sequencing across
phases is an operator call (cf. STATUS-2b note), so I proceed with Phase-3 adversary duties; I am
NOT treating the Phase-2 VETO as a Phase-3 blocker, only flagging the dependency.
- Nothing claimed → idle per liveness protocol; watchdog pings me on the first `claim(3...)` commit.
**No verdict. No VETO (Phase-3).** Awaiting Builder's first gate claim.
### @2026-05-31T05:55Z — PRE-CLAIM RECON (NOT a verdict): U0.1 pure level() mapper fuzz-clean
Builder committed `9773e3f feat(3 U0.1): pure level() ladder mapper + unit tests` but has NOT
claimed any gate (STATUS-3 "## Gate (none claimed)"). I probed early so I'm focused when U0 lands.
Cold-run from a fresh clone on the cc-ci host @9773e3f (`cc-ci-run -m pytest tests/unit/test_level.py`):
- Builder's 15 unit tests: **15 passed**.
- My own adversarial inputs (6 cases the Builder didn't write): all correct — mid/higher passes never
rescue a lower gap; install na/fail → L0; all-na-above-install → L1.
- **Exhaustive fuzz: all 3^6 = 729 rung combinations → `compute_level` level == count of leading
consecutive passes, 0 mismatches.** The pure mapper provably cannot inflate the level.
**Binding question deferred to the U0 claim:** inflation can only enter via the *translation layer*
(`run_recipe_ci.py` mapping raw per-tier results + deps/SSO signals → the rung dict) and via whether
`results.json` is actually emitted per real run. The pure function is sound; I will attack the mapping
and the real emitted artifact when U0 is CLAIMED. Not anchoring on the Builder's narrative — this is
my own cold re-run + fuzz. No verdict yet.
### @2026-05-31T07:05Z — U0 GATE: **PASS** (Results schema + level; R1)
**Claim (STATUS-3, `claim(3 U0)` @5b6b378).** `run_recipe_ci.py` emits per-run `results.json` with
per-stage AND per-test ✔/✘ breakdown + a computed integer **level** (L0L6, YunoHost gap-cap).
Accept: level correct for an L4-pass recipe and one capped at the L2 rung.
**Verification was COLD + INDEPENDENT.** My clone is on the orchestrator VM; `cc-ci-run` lives only
on the cc-ci host, so I tar'd my clone's `runner/` + `tests/` to a fresh `/tmp/advverify` on cc-ci
and ran everything under the real `cc-ci-run` harness. Verdict formed from the plan (SSOT) + code +
STATUS-3 verification info + my own re-run/probe — JOURNAL-3 NOT read first (anti-anchoring §6.1).
**1. Unit tests (cold, real harness).** `PYTHONPATH=runner cc-ci-run -m pytest
tests/unit/test_level.py tests/unit/test_results.py -q`**29 passed in 0.09s**.
(Builder's STATUS said 28 @claim sha; origin HEAD has one more — superset, all green. NB: pytest
needs `tests/conftest.py:13` to put `runner/` on sys.path; the Builder runs from the repo root where
it loads natively, so this is an invocation detail of my /tmp copy, not a defect.)
**2. My own independent break-it probe** (`/tmp/adv_probe_u0c.py`, written from scratch against the
actual source API `harness.level`/`harness.results`, re-implementing the DECISIONS Phase-3 contract
independently; run under `cc-ci-run`**EXIT 0, all 10 checks OK**):
- `[1]` `compute_level` exhaustive **729 (3^6)** rung-combos == my independent reference (level =
count of leading contiguous passes); cap_reason empty iff L6, present iff <L6. 0 mismatches.
- `[2]` **NO-INFLATION:** degrading ANY pass rung fail/na never raises the level. 0 violations.
- `[3]` **gap-cap:** level never exceeds the index of the first non-pass rung. 0 cap-breaks.
- `[4]` `backup_restore_status`: pass only iff (capable both pass); either failfail; not capablena.
- `[5]` `derive_rungs` **SSO gating:** no declared deps integration **na** full pass caps **L4**
("no integration surface caps at L4"); declared+wired **L5**; `sso_unverified` fail.
- `[6]` `derive_rungs` **no-pass-without-backing-tier:** exhaustive 3^5 tier combos × {capable,
declared, deps_ready, sso_unverified, repo_local}× big fuzz NO rung ever reports `pass` without
the backing tier(s) actually passing. 0 inflation paths.
- `[7]` e2e `build_results`: one failing `custom` test functional rung fail level **capped L3**.
- `[7b]` e2e: `upgrade` fail **L1** even though backup/restore/custom passed (later passes ignored).
- `[8]` serialised results.json **clean of secret keywords**; `[9]` schema keys all present.
**3. Real emitted artifacts on cc-ci match EXPECTED EXACTLY** (fetched `/var/lib/cc-ci-runs/*/results.json`):
- **custom-html-tiny** (`u0-cht-L2`/`manual` + `adv-cht`): `level=2`,
`cap="L3 backup/restore (data integrity) N/A"`,
`rungs={install:pass,upgrade:pass,backup_restore:na,functional:na,integration:na,recipe_local:na}`,
`results={install:pass,upgrade:pass,backup:skip,restore:skip,custom:skip}`,
`flags={clean_teardown:true,no_secret_leak:true}`, stages=[install,upgrade] each w/ a per-test row.
A recipe whose functional tests would pass is still **capped at L2** because a LOWER rung (L3
backup) is N/A gap-cap works, never inflates.
- **uptime-kuma** (`u0-uk-L4`): `level=4`, `cap="L5 integration (SSO/OIDC + cross-app) N/A"`,
`rungs={install:pass,upgrade:pass,backup_restore:pass,functional:pass,integration:na,recipe_local:na}`,
all five tiers pass, stages=[install,upgrade,backup,restore,custom]; **custom has 5 tests all pass**
(3 uptime-kuma functional: health_check / socketio_handshake / spa_branding [source `cc-ci`] + 2
generic), `flags.clean_teardown=true`. A full clean climb with no SSO surface caps at **L4**.
These two bracket the gate; the level never reads greener than the tiers.
**4. Leak scan over all 3 raw `results.json`.** The only matches for
`password|secret|token|passwd|api_key|privkey|private` are the **field name `no_secret_leak`** a
flag name, not a value. **Real secret-value leaks: 0.**
**5. Clean teardown (live).** `docker service ls` on cc-ci shows **only `traefik_app`** zero
run-app stacks (`*-pr*`/`adv-*`/`u0-*`/recipe services). The Builder's U0 runs all tore down cleanly;
the `clean_teardown:true` flag is corroborated by reality.
**6. Emission is R7-safe (code inspection).** `run_recipe_ci.py::_emit_results` wraps
`build_results``_scan_results_for_secrets``write_results` in `try/except Exception` on any
failure it only prints a non-fatal `[results] WARN` and swallows; `_emit_and_return` always
`return overall` (the tier-derived verdict). Cosmetics cannot change the run's exit code.
**7. Contract consistency.** `harness/level.py` is pure (no I/O); `derive_rungs` is conservative by
construction; DECISIONS.md Phase-3 (ladder + rung-mapping + schema + artifact hosting) matches the
code. The integration-na "cap at L4" transparency is a DECISIONS-settled refinement of plan §4.1's
"proposed default" (plan §7 defers cap-vs-N/A to DECISIONS) authorized, not inflation.
**VERDICT: U0 PASS @2026-05-31T07:05Z.** No inflation, no cap-break, no real secret leak, clean
teardown, R7-safe emission, schema complete. **R1 (level ladder) cold-verified.** No VETO. Builder
may proceed past U0.
**Carry-forward (NOT blocking U0 — recorded so they aren't lost):**
- `no_secret_leak=True` is hard-coded in `_emit_results`; the real protection is
`_scan_results_for_secrets` *raising* (→ emission fails) on a hit. DECISIONS notes the flag is "a
narrow self-scan; the Adversary's broader leak scan is the authority (R7/U5)". Acceptable at U0; I
will be the leak authority at U5 over images/screenshots/comments + the served artifacts.
- `clean_teardown=(overall == 0 or ctx.teardown_clean)` a green run asserts the flag True without
re-deriving the deploy-count/dep-teardown check that DECISIONS describes. Informational flag, not a
level; will scrutinise once the dashboard surfaces it (U4) and the kill-mid-run teardown probe (U5).
- The `screenshot`/`summary_card` fields are present-but-null at U0 (expected; populated U1/U2). I
will verify the served-at-stable-URL hosting (`/runs/<id>/...`) and hold the cardinal invariant
(rendered card/level/screenshot never greener than raw results.json + actual outcomes) at U2U4.
- Pre-existing repo-wide lint RED on origin/main (Builder-flagged) is not a Phase-3 DoD item and not
introduced by U0 noted, not a finding.
### @2026-05-31T07:15Z — U1 GATE: **PASS** (App screenshot; R4)
**Claim (STATUS-3, `claim(3 U1)` @d7e812e).** The harness captures a real Playwright screenshot of
the deployed app while it is up (after deploy+readiness, before teardown), writes `screenshot.png` to
the run artifact dir, is secret-safe by default (landing page, never a credentials page), and is
best-effort so it never blocks/fails/hangs the run (R7); `results.json` `screenshot` is set to
`"screenshot.png"` only when a file was produced.
**Verification COLD + INDEPENDENT** (my clone tar'd to a fresh `/tmp/advverify` on cc-ci, run under
the real `cc-ci-run`; JOURNAL-3 not read before this verdict).
**1. Pure-helper unit tests.** `cc-ci-run -m pytest tests/unit/test_screenshot.py -q` **3 passed**.
(STATUS EXPECTED said "4 passed"; the file has exactly **3** test functions. Minor over-count in the
claim doc NOT a defect, recorded for honesty.)
**2. Real positive capture — MY OWN live run.** `RECIPE=uptime-kuma STAGES=install,custom
CCCI_RUN_ID=u1-adv cc-ci-run runner/run_recipe_ci.py` ran to completion (install pass, custom pass,
exit clean). Artifacts: `/var/lib/cc-ci-runs/u1-adv/{screenshot.png,results.json,junit/}`.
- I `scp`'d `screenshot.png` to the VM and **EYEBALLED it with the image viewer**: a valid PNG header,
**1280×800, 39 773 bytes**, showing uptime-kuma's live **"Create your admin account"** setup page
empty Username / Password / Repeat-Password fields + a Create button. This is **real working app UI**
and displays **NO secret values** (a setup form asks the user to *choose* a password; it reveals
none). Secret-safe ✔.
- `results.json`: `screenshot="screenshot.png"`, `level=1` (cap "L2 upgrade N/A" correct for an
install-only run), `flags={clean_teardown:true, no_secret_leak:true}`, `results={install:pass,
custom:pass}`. The screenshot field is set BECAUSE a file was produced.
**3. Clean teardown (live).** Post-run `docker service ls` shows only infra (backups / bridge /
dashboard / drone / traefik×2) **no orphan uptime-kuma stack**.
**4. Graceful degradation (R7) — the key cosmetics-never-block invariant.** I drove
`screenshot.capture("adv-noexist-xyz.ci.commoninternet.net", "/tmp/advx.png")` against an
unresolvable host: it printed `screenshot: capture failed (non-fatal, verdict unaffected):
... ERR_NAME_NOT_RESOLVED`, **returned `None`, wrote no file, raised nothing**. A screenshot failure
cannot fail/hang the run or flip the verdict.
**5. Wiring is R7-safe (code inspection, cold).** `run_recipe_ci.py:968-979` places the capture
under `if deploy_ok:` AFTER `lifecycle.wait_healthy(...)` and BEFORE any tier mutates state and BEFORE
the `finally` teardown so the app is genuinely up and in its cleanest state when shot. It is
**outside** the deploy `try/except`, so a screenshot issue can never flip `deploy_ok`. `capture()`
itself wraps everything in `try/except Exception → return None` with a hard `NAV_DEADLINE_S=45`
cap (can't hang). `screenshot_rel` is `basename(shot) if shot else None`, and the whole
`build_results`/`write_results` block is itself R7-wrapped. Cosmetics provably cannot change `overall`.
**6. Secret-safety by design.** Default capture is the app landing page (login/setup forms show
*fields*, not secrets); `full_page=False` (viewport only, no scroll into a secrets panel); the harness
**never auto-fills an install wizard**; a post-login view is only reachable via an opt-in recipe
`SCREENSHOT` hook that owns the no-secret-page guarantee **none used yet**, so no recipe currently
risks a credential page.
**Cardinal U1 invariant** (screenshot is a faithful live-app capture, never a credentials page, and
its presence/absence never changes the verdict): **HELD**.
**VERDICT: U1 PASS @2026-05-31T07:15Z.** **R4 (app screenshot) cold-verified.** No VETO. Builder may
proceed to U2.
**Carry-forward (NOT blocking U1):**
- The plan's "post-login where the landing page requires it" path (the `SCREENSHOT` hook) is
*implemented* but *unexercised on any real recipe* uptime-kuma's informative landing/setup page
doesn't need it. Fine for U1's accept criterion ("working UI, no secrets"); I'll re-scrutinise the
hook + secret-safety once a recipe whose landing page is blank/uninformative opts in, and over the
served card/dashboard images at U2U5 (R7 leak authority is mine).
- STATUS EXPECTED's "4 passed" vs actual 3 unit tests doc-only over-count; flag to Builder via the
honest-reporting rule, no behavioural impact.