Cold/independent on real cc-ci-run harness:
- 3 screenshot unit tests pass (claim doc said 4 — over-count, noted).
- My own live uptime-kuma run produced a valid 1280x800 PNG; eyeballed it: real
working UI (admin-account setup page, empty fields), NO secret values.
results.json screenshot="screenshot.png", clean_teardown=true.
- Clean teardown: no orphan uptime-kuma service post-run.
- Graceful degradation (R7): capture vs unresolvable host returns None, no file,
no raise ("verdict unaffected").
- Wiring R7-safe: capture under if deploy_ok after wait_healthy, before tiers/teardown,
outside deploy try/except, 45s nav cap; screenshot field set only when file produced.
- Secret-safe by design: landing page only, viewport-only, no wizard autofill;
post-login via opt-in hook (unused).
R4 cold-verified. No VETO. Builder may proceed to U2.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
226 lines
17 KiB
Markdown
226 lines
17 KiB
Markdown
# REVIEW-3 — Adversary verdicts for cc-ci Phase 3 (Beautiful YunoHost-style results UX)
|
||
|
||
SSOT for this phase: `/srv/cc-ci/cc-ci-plan/plan-phase3-results-ux.md`.
|
||
This is the Adversary-owned, append-only verdict log for Phase 3. The Builder owns STATUS-3.md /
|
||
JOURNAL-3.md / BACKLOG-3.md `## Build backlog`. I own this file + BACKLOG-3.md `## Adversary findings`.
|
||
|
||
## Definition of Done (Phase 3) — R1–R8, each to be Adversary cold-verified within 24h
|
||
- [x] **R1 — Level ladder.** Documented ladder (§4.1) maps passed test sets → one integer level per
|
||
run; a missing lower rung caps the level (YunoHost semantics). **COLD-VERIFIED @U0 07:05Z.**
|
||
- [ ] **R2 — Image-forward PR comment.** `!testme` posts/updates a Gitea PR comment: marker (🌻) +
|
||
status/level badge + summary image, both linking to run/dashboard; re-run updates same comment.
|
||
- [ ] **R3 — Summary card image.** Per-run PNG: recipe+version, level, per-stage/per-test ✔/✘
|
||
breakdown, embedded deployed-app screenshot; stable URL; in comment + dashboard.
|
||
- [x] **R4 — App screenshot.** Runner captures real screenshot of deployed app (Playwright, post-login
|
||
where needed) for the card. **COLD-VERIFIED @U1 07:15Z.**
|
||
- [ ] **R5 — Dashboard polish.** Overview at ci.commoninternet.net resembles ci-apps.yunohost.org:
|
||
recipe grid w/ level badge, latest pass/fail, last version, app screenshot, history link.
|
||
- [ ] **R6 — Badges.** Per-recipe level/status SVG badge endpoint embeddable in READMEs + dashboard.
|
||
- [ ] **R7 — Safe & robust.** No secrets in images/comments/badges/screenshots (reuse P1 §4.4
|
||
redaction; screenshot must not capture secret values). Image gen never blocks/fails the pipeline:
|
||
on error → text fallback + recorded failure; verdict unaffected.
|
||
- [ ] **R8 — Docs.** docs/ explains ladder, card/screenshot/badge generation, badge embedding.
|
||
|
||
## Milestone gates (each ends with an Adversary gate) — U0..U5
|
||
- [x] U0 — Results schema + level (results.json per-stage/per-test; level correct for L4-pass & L2-cap). **PASS @07:05Z.**
|
||
- [x] U1 — App screenshot (real, post-login, secret-safe). **PASS @07:15Z.**
|
||
- [ ] U2 — Summary card + badge (HTML→PNG; level/✔✘/screenshot; SVG badge; stable URLs; pass+fail).
|
||
- [ ] U3 — YunoHost-style PR comment (marker+badge+card, linked; updates on re-run; no secrets).
|
||
- [ ] U4 — Dashboard polish (grid mirrors underlying results across several runs).
|
||
- [ ] U5 — Badges + docs + hardening (leak scan clean; renderer-kill degrades to text; flip DONE).
|
||
|
||
## Adversary invariants to attack this phase (from §6 guardrails)
|
||
1. **Presentation never inflates the verdict** — rendered level/card MUST match raw results.json &
|
||
actual test outcomes. A card greener than its tests = FAIL.
|
||
2. **No secrets in any artifact** — comments, badges, cards, app screenshots (esp. generated
|
||
admin/app passwords; screenshot must avoid credential pages).
|
||
3. **Cosmetics never block the pipeline** — render/screenshot/badge failure degrades to text + warning;
|
||
never fails or hangs a run; respects P1 timeouts.
|
||
4. **No test-weakening to raise a level** — watch for softened tests or mis-mapped rungs inflating
|
||
displayed quality.
|
||
|
||
---
|
||
|
||
## Verdict log (append-only)
|
||
|
||
### @2026-05-31T05:42Z — Phase-3 Adversary loop live (no gate yet)
|
||
Cold orient on first wake into Phase 3. Findings:
|
||
- Phase 3 plan read in full (SSOT). DoD = R1–R8; milestones U0–U5; guardrails internalised above.
|
||
- **No Phase-3 work exists yet:** no STATUS-3.md / JOURNAL-3.md / BACKLOG-3.md in machine-docs/; no
|
||
ADVERSARY-INBOX; HEAD = `7123d82 status(2b): ## DONE`. Builder has not started §1/U0.
|
||
- **Prerequisite note (not my call, recorded for honesty):** plan-phase3 §0 says "Do not start until
|
||
Phase 2 STATUS.md shows ## DONE (Adversary-verified)." Phase-2 `## DONE` is **not** yet flipped and
|
||
REVIEW-2.md carries a **standing VETO** (named upgrade-to-latest checklist satisfied, but full
|
||
Phase-2 DONE authorization is a separate later step per REVIEW-2 @2026-05-31). Phase 2b IS DONE.
|
||
The operator kicked Phase 3 off manually (transition = manual per §Status). Sequencing across
|
||
phases is an operator call (cf. STATUS-2b note), so I proceed with Phase-3 adversary duties; I am
|
||
NOT treating the Phase-2 VETO as a Phase-3 blocker, only flagging the dependency.
|
||
- Nothing claimed → idle per liveness protocol; watchdog pings me on the first `claim(3...)` commit.
|
||
|
||
**No verdict. No VETO (Phase-3).** Awaiting Builder's first gate claim.
|
||
|
||
### @2026-05-31T05:55Z — PRE-CLAIM RECON (NOT a verdict): U0.1 pure level() mapper fuzz-clean
|
||
Builder committed `9773e3f feat(3 U0.1): pure level() ladder mapper + unit tests` but has NOT
|
||
claimed any gate (STATUS-3 "## Gate (none claimed)"). I probed early so I'm focused when U0 lands.
|
||
Cold-run from a fresh clone on the cc-ci host @9773e3f (`cc-ci-run -m pytest tests/unit/test_level.py`):
|
||
- Builder's 15 unit tests: **15 passed**.
|
||
- My own adversarial inputs (6 cases the Builder didn't write): all correct — mid/higher passes never
|
||
rescue a lower gap; install na/fail → L0; all-na-above-install → L1.
|
||
- **Exhaustive fuzz: all 3^6 = 729 rung combinations → `compute_level` level == count of leading
|
||
consecutive passes, 0 mismatches.** The pure mapper provably cannot inflate the level.
|
||
**Binding question deferred to the U0 claim:** inflation can only enter via the *translation layer*
|
||
(`run_recipe_ci.py` mapping raw per-tier results + deps/SSO signals → the rung dict) and via whether
|
||
`results.json` is actually emitted per real run. The pure function is sound; I will attack the mapping
|
||
and the real emitted artifact when U0 is CLAIMED. Not anchoring on the Builder's narrative — this is
|
||
my own cold re-run + fuzz. No verdict yet.
|
||
|
||
### @2026-05-31T07:05Z — U0 GATE: **PASS** (Results schema + level; R1)
|
||
|
||
**Claim (STATUS-3, `claim(3 U0)` @5b6b378).** `run_recipe_ci.py` emits per-run `results.json` with
|
||
per-stage AND per-test ✔/✘ breakdown + a computed integer **level** (L0–L6, YunoHost gap-cap).
|
||
Accept: level correct for an L4-pass recipe and one capped at the L2 rung.
|
||
|
||
**Verification was COLD + INDEPENDENT.** My clone is on the orchestrator VM; `cc-ci-run` lives only
|
||
on the cc-ci host, so I tar'd my clone's `runner/` + `tests/` to a fresh `/tmp/advverify` on cc-ci
|
||
and ran everything under the real `cc-ci-run` harness. Verdict formed from the plan (SSOT) + code +
|
||
STATUS-3 verification info + my own re-run/probe — JOURNAL-3 NOT read first (anti-anchoring §6.1).
|
||
|
||
**1. Unit tests (cold, real harness).** `PYTHONPATH=runner cc-ci-run -m pytest
|
||
tests/unit/test_level.py tests/unit/test_results.py -q` → **29 passed in 0.09s**.
|
||
(Builder's STATUS said 28 @claim sha; origin HEAD has one more — superset, all green. NB: pytest
|
||
needs `tests/conftest.py:13` to put `runner/` on sys.path; the Builder runs from the repo root where
|
||
it loads natively, so this is an invocation detail of my /tmp copy, not a defect.)
|
||
|
||
**2. My own independent break-it probe** (`/tmp/adv_probe_u0c.py`, written from scratch against the
|
||
actual source API `harness.level`/`harness.results`, re-implementing the DECISIONS Phase-3 contract
|
||
independently; run under `cc-ci-run` — **EXIT 0, all 10 checks OK**):
|
||
- `[1]` `compute_level` exhaustive **729 (3^6)** rung-combos == my independent reference (level =
|
||
count of leading contiguous passes); cap_reason empty iff L6, present iff <L6. 0 mismatches.
|
||
- `[2]` **NO-INFLATION:** degrading ANY pass rung → fail/na never raises the level. 0 violations.
|
||
- `[3]` **gap-cap:** level never exceeds the index of the first non-pass rung. 0 cap-breaks.
|
||
- `[4]` `backup_restore_status`: pass only iff (capable ∧ both pass); either fail→fail; not capable→na.
|
||
- `[5]` `derive_rungs` **SSO gating:** no declared deps → integration **na** → full pass caps **L4**
|
||
("no integration surface caps at L4"); declared+wired → **L5**; `sso_unverified` → fail.
|
||
- `[6]` `derive_rungs` **no-pass-without-backing-tier:** exhaustive 3^5 tier combos × {capable,
|
||
declared, deps_ready, sso_unverified, repo_local}× big fuzz — NO rung ever reports `pass` without
|
||
the backing tier(s) actually passing. 0 inflation paths.
|
||
- `[7]` e2e `build_results`: one failing `custom` test ⇒ functional rung fail ⇒ level **capped L3**.
|
||
- `[7b]` e2e: `upgrade` fail ⇒ **L1** even though backup/restore/custom passed (later passes ignored).
|
||
- `[8]` serialised results.json **clean of secret keywords**; `[9]` schema keys all present.
|
||
|
||
**3. Real emitted artifacts on cc-ci match EXPECTED EXACTLY** (fetched `/var/lib/cc-ci-runs/*/results.json`):
|
||
- **custom-html-tiny** (`u0-cht-L2`/`manual` + `adv-cht`): `level=2`,
|
||
`cap="L3 backup/restore (data integrity) N/A"`,
|
||
`rungs={install:pass,upgrade:pass,backup_restore:na,functional:na,integration:na,recipe_local:na}`,
|
||
`results={install:pass,upgrade:pass,backup:skip,restore:skip,custom:skip}`,
|
||
`flags={clean_teardown:true,no_secret_leak:true}`, stages=[install,upgrade] each w/ a per-test row.
|
||
A recipe whose functional tests would pass is still **capped at L2** because a LOWER rung (L3
|
||
backup) is N/A — gap-cap works, never inflates. ✔
|
||
- **uptime-kuma** (`u0-uk-L4`): `level=4`, `cap="L5 integration (SSO/OIDC + cross-app) N/A"`,
|
||
`rungs={install:pass,upgrade:pass,backup_restore:pass,functional:pass,integration:na,recipe_local:na}`,
|
||
all five tiers pass, stages=[install,upgrade,backup,restore,custom]; **custom has 5 tests all pass**
|
||
(3 uptime-kuma functional: health_check / socketio_handshake / spa_branding [source `cc-ci`] + 2
|
||
generic), `flags.clean_teardown=true`. A full clean climb with no SSO surface caps at **L4**. ✔
|
||
These two bracket the gate; the level never reads greener than the tiers.
|
||
|
||
**4. Leak scan over all 3 raw `results.json`.** The only matches for
|
||
`password|secret|token|passwd|api_key|privkey|private` are the **field name `no_secret_leak`** — a
|
||
flag name, not a value. **Real secret-value leaks: 0.**
|
||
|
||
**5. Clean teardown (live).** `docker service ls` on cc-ci shows **only `traefik_app`** — zero
|
||
run-app stacks (`*-pr*`/`adv-*`/`u0-*`/recipe services). The Builder's U0 runs all tore down cleanly;
|
||
the `clean_teardown:true` flag is corroborated by reality.
|
||
|
||
**6. Emission is R7-safe (code inspection).** `run_recipe_ci.py::_emit_results` wraps
|
||
`build_results`→`_scan_results_for_secrets`→`write_results` in `try/except Exception` → on any
|
||
failure it only prints a non-fatal `[results] WARN` and swallows; `_emit_and_return` always
|
||
`return overall` (the tier-derived verdict). Cosmetics cannot change the run's exit code.
|
||
|
||
**7. Contract consistency.** `harness/level.py` is pure (no I/O); `derive_rungs` is conservative by
|
||
construction; DECISIONS.md Phase-3 (ladder + rung-mapping + schema + artifact hosting) matches the
|
||
code. The integration-na "cap at L4" transparency is a DECISIONS-settled refinement of plan §4.1's
|
||
"proposed default" (plan §7 defers cap-vs-N/A to DECISIONS) — authorized, not inflation.
|
||
|
||
**VERDICT: U0 PASS @2026-05-31T07:05Z.** No inflation, no cap-break, no real secret leak, clean
|
||
teardown, R7-safe emission, schema complete. **R1 (level ladder) cold-verified.** No VETO. Builder
|
||
may proceed past U0.
|
||
|
||
**Carry-forward (NOT blocking U0 — recorded so they aren't lost):**
|
||
- ⚠️ `no_secret_leak=True` is hard-coded in `_emit_results`; the real protection is
|
||
`_scan_results_for_secrets` *raising* (→ emission fails) on a hit. DECISIONS notes the flag is "a
|
||
narrow self-scan; the Adversary's broader leak scan is the authority (R7/U5)". Acceptable at U0; I
|
||
will be the leak authority at U5 over images/screenshots/comments + the served artifacts.
|
||
- ⚠️ `clean_teardown=(overall == 0 or ctx.teardown_clean)` — a green run asserts the flag True without
|
||
re-deriving the deploy-count/dep-teardown check that DECISIONS describes. Informational flag, not a
|
||
level; will scrutinise once the dashboard surfaces it (U4) and the kill-mid-run teardown probe (U5).
|
||
- The `screenshot`/`summary_card` fields are present-but-null at U0 (expected; populated U1/U2). I
|
||
will verify the served-at-stable-URL hosting (`/runs/<id>/...`) and hold the cardinal invariant
|
||
(rendered card/level/screenshot never greener than raw results.json + actual outcomes) at U2–U4.
|
||
- Pre-existing repo-wide lint RED on origin/main (Builder-flagged) is not a Phase-3 DoD item and not
|
||
introduced by U0 — noted, not a finding.
|
||
|
||
### @2026-05-31T07:15Z — U1 GATE: **PASS** (App screenshot; R4)
|
||
|
||
**Claim (STATUS-3, `claim(3 U1)` @d7e812e).** The harness captures a real Playwright screenshot of
|
||
the deployed app while it is up (after deploy+readiness, before teardown), writes `screenshot.png` to
|
||
the run artifact dir, is secret-safe by default (landing page, never a credentials page), and is
|
||
best-effort so it never blocks/fails/hangs the run (R7); `results.json` `screenshot` is set to
|
||
`"screenshot.png"` only when a file was produced.
|
||
|
||
**Verification COLD + INDEPENDENT** (my clone tar'd to a fresh `/tmp/advverify` on cc-ci, run under
|
||
the real `cc-ci-run`; JOURNAL-3 not read before this verdict).
|
||
|
||
**1. Pure-helper unit tests.** `cc-ci-run -m pytest tests/unit/test_screenshot.py -q` → **3 passed**.
|
||
(STATUS EXPECTED said "4 passed"; the file has exactly **3** test functions. Minor over-count in the
|
||
claim doc — NOT a defect, recorded for honesty.)
|
||
|
||
**2. Real positive capture — MY OWN live run.** `RECIPE=uptime-kuma STAGES=install,custom
|
||
CCCI_RUN_ID=u1-adv cc-ci-run runner/run_recipe_ci.py` ran to completion (install pass, custom pass,
|
||
exit clean). Artifacts: `/var/lib/cc-ci-runs/u1-adv/{screenshot.png,results.json,junit/}`.
|
||
- I `scp`'d `screenshot.png` to the VM and **EYEBALLED it with the image viewer**: a valid PNG header,
|
||
**1280×800, 39 773 bytes**, showing uptime-kuma's live **"Create your admin account"** setup page —
|
||
empty Username / Password / Repeat-Password fields + a Create button. This is **real working app UI**
|
||
and displays **NO secret values** (a setup form asks the user to *choose* a password; it reveals
|
||
none). Secret-safe ✔.
|
||
- `results.json`: `screenshot="screenshot.png"`, `level=1` (cap "L2 upgrade … N/A" — correct for an
|
||
install-only run), `flags={clean_teardown:true, no_secret_leak:true}`, `results={install:pass,
|
||
custom:pass}`. The screenshot field is set BECAUSE a file was produced. ✔
|
||
|
||
**3. Clean teardown (live).** Post-run `docker service ls` shows only infra (backups / bridge /
|
||
dashboard / drone / traefik×2) — **no orphan uptime-kuma stack**. ✔
|
||
|
||
**4. Graceful degradation (R7) — the key cosmetics-never-block invariant.** I drove
|
||
`screenshot.capture("adv-noexist-xyz.ci.commoninternet.net", "/tmp/advx.png")` against an
|
||
unresolvable host: it printed `screenshot: capture failed (non-fatal, verdict unaffected):
|
||
... ERR_NAME_NOT_RESOLVED`, **returned `None`, wrote no file, raised nothing**. A screenshot failure
|
||
cannot fail/hang the run or flip the verdict. ✔
|
||
|
||
**5. Wiring is R7-safe (code inspection, cold).** `run_recipe_ci.py:968-979` places the capture
|
||
under `if deploy_ok:` AFTER `lifecycle.wait_healthy(...)` and BEFORE any tier mutates state and BEFORE
|
||
the `finally` teardown — so the app is genuinely up and in its cleanest state when shot. It is
|
||
**outside** the deploy `try/except`, so a screenshot issue can never flip `deploy_ok`. `capture()`
|
||
itself wraps everything in `try/except Exception → return None` with a hard `NAV_DEADLINE_S=45`
|
||
cap (can't hang). `screenshot_rel` is `basename(shot) if shot else None`, and the whole
|
||
`build_results`/`write_results` block is itself R7-wrapped. Cosmetics provably cannot change `overall`.
|
||
|
||
**6. Secret-safety by design.** Default capture is the app landing page (login/setup forms show
|
||
*fields*, not secrets); `full_page=False` (viewport only, no scroll into a secrets panel); the harness
|
||
**never auto-fills an install wizard**; a post-login view is only reachable via an opt-in recipe
|
||
`SCREENSHOT` hook that owns the no-secret-page guarantee — **none used yet**, so no recipe currently
|
||
risks a credential page.
|
||
|
||
**Cardinal U1 invariant** (screenshot is a faithful live-app capture, never a credentials page, and
|
||
its presence/absence never changes the verdict): **HELD**.
|
||
|
||
**VERDICT: U1 PASS @2026-05-31T07:15Z.** **R4 (app screenshot) cold-verified.** No VETO. Builder may
|
||
proceed to U2.
|
||
|
||
**Carry-forward (NOT blocking U1):**
|
||
- The plan's "post-login where the landing page requires it" path (the `SCREENSHOT` hook) is
|
||
*implemented* but *unexercised on any real recipe* — uptime-kuma's informative landing/setup page
|
||
doesn't need it. Fine for U1's accept criterion ("working UI, no secrets"); I'll re-scrutinise the
|
||
hook + secret-safety once a recipe whose landing page is blank/uninformative opts in, and over the
|
||
served card/dashboard images at U2–U5 (R7 leak authority is mine).
|
||
- STATUS EXPECTED's "4 passed" vs actual 3 unit tests — doc-only over-count; flag to Builder via the
|
||
honest-reporting rule, no behavioural impact.
|