Files
cc-ci/machine-docs/REVIEW-3.md
autonomic-bot 18d2bd1443 review(3 U0): PASS — results.json schema + level ladder cold-verified
Cold/independent on the real cc-ci-run harness:
- 29 unit tests pass (test_level + test_results, PYTHONPATH=runner).
- Independent break-it probe EXIT 0, all 10 checks: compute_level 729 exhaustive vs own
  reference; no-inflation monotonicity; gap-cap; backup_restore_status; SSO gating
  (no-deps->L4, deps->L5, unverified->fail); derive_rungs no-pass-without-backing big fuzz;
  e2e custom-fail->L3 + upgrade-fail->L1; leak-clean; schema complete.
- Real artifacts match EXPECTED exactly: custom-html-tiny L2 (cap L3 backup N/A),
  uptime-kuma L4 (cap L5 integration N/A). 0 real secret leaks (only field name
  no_secret_leak matched). Clean teardown (only traefik_app live). Emission R7-wrapped
  (try/except; return overall) so cosmetics never change the verdict.
R1 (level ladder) cold-verified. Builder may proceed past U0. No VETO.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 06:53:34 +00:00

161 lines
12 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW-3 — Adversary verdicts for cc-ci Phase 3 (Beautiful YunoHost-style results UX)
SSOT for this phase: `/srv/cc-ci/cc-ci-plan/plan-phase3-results-ux.md`.
This is the Adversary-owned, append-only verdict log for Phase 3. The Builder owns STATUS-3.md /
JOURNAL-3.md / BACKLOG-3.md `## Build backlog`. I own this file + BACKLOG-3.md `## Adversary findings`.
## Definition of Done (Phase 3) — R1R8, each to be Adversary cold-verified within 24h
- [x] **R1 — Level ladder.** Documented ladder (§4.1) maps passed test sets → one integer level per
run; a missing lower rung caps the level (YunoHost semantics). **COLD-VERIFIED @U0 07:05Z.**
- [ ] **R2 — Image-forward PR comment.** `!testme` posts/updates a Gitea PR comment: marker (🌻) +
status/level badge + summary image, both linking to run/dashboard; re-run updates same comment.
- [ ] **R3 — Summary card image.** Per-run PNG: recipe+version, level, per-stage/per-test ✔/✘
breakdown, embedded deployed-app screenshot; stable URL; in comment + dashboard.
- [ ] **R4 — App screenshot.** Runner captures real screenshot of deployed app (Playwright, post-login
where needed) for the card.
- [ ] **R5 — Dashboard polish.** Overview at ci.commoninternet.net resembles ci-apps.yunohost.org:
recipe grid w/ level badge, latest pass/fail, last version, app screenshot, history link.
- [ ] **R6 — Badges.** Per-recipe level/status SVG badge endpoint embeddable in READMEs + dashboard.
- [ ] **R7 — Safe & robust.** No secrets in images/comments/badges/screenshots (reuse P1 §4.4
redaction; screenshot must not capture secret values). Image gen never blocks/fails the pipeline:
on error → text fallback + recorded failure; verdict unaffected.
- [ ] **R8 — Docs.** docs/ explains ladder, card/screenshot/badge generation, badge embedding.
## Milestone gates (each ends with an Adversary gate) — U0..U5
- [x] U0 — Results schema + level (results.json per-stage/per-test; level correct for L4-pass & L2-cap). **PASS @07:05Z.**
- [ ] U1 — App screenshot (real, post-login, secret-safe).
- [ ] U2 — Summary card + badge (HTML→PNG; level/✔✘/screenshot; SVG badge; stable URLs; pass+fail).
- [ ] U3 — YunoHost-style PR comment (marker+badge+card, linked; updates on re-run; no secrets).
- [ ] U4 — Dashboard polish (grid mirrors underlying results across several runs).
- [ ] U5 — Badges + docs + hardening (leak scan clean; renderer-kill degrades to text; flip DONE).
## Adversary invariants to attack this phase (from §6 guardrails)
1. **Presentation never inflates the verdict** — rendered level/card MUST match raw results.json &
actual test outcomes. A card greener than its tests = FAIL.
2. **No secrets in any artifact** — comments, badges, cards, app screenshots (esp. generated
admin/app passwords; screenshot must avoid credential pages).
3. **Cosmetics never block the pipeline** — render/screenshot/badge failure degrades to text + warning;
never fails or hangs a run; respects P1 timeouts.
4. **No test-weakening to raise a level** — watch for softened tests or mis-mapped rungs inflating
displayed quality.
---
## Verdict log (append-only)
### @2026-05-31T05:42Z — Phase-3 Adversary loop live (no gate yet)
Cold orient on first wake into Phase 3. Findings:
- Phase 3 plan read in full (SSOT). DoD = R1R8; milestones U0U5; guardrails internalised above.
- **No Phase-3 work exists yet:** no STATUS-3.md / JOURNAL-3.md / BACKLOG-3.md in machine-docs/; no
ADVERSARY-INBOX; HEAD = `7123d82 status(2b): ## DONE`. Builder has not started §1/U0.
- **Prerequisite note (not my call, recorded for honesty):** plan-phase3 §0 says "Do not start until
Phase 2 STATUS.md shows ## DONE (Adversary-verified)." Phase-2 `## DONE` is **not** yet flipped and
REVIEW-2.md carries a **standing VETO** (named upgrade-to-latest checklist satisfied, but full
Phase-2 DONE authorization is a separate later step per REVIEW-2 @2026-05-31). Phase 2b IS DONE.
The operator kicked Phase 3 off manually (transition = manual per §Status). Sequencing across
phases is an operator call (cf. STATUS-2b note), so I proceed with Phase-3 adversary duties; I am
NOT treating the Phase-2 VETO as a Phase-3 blocker, only flagging the dependency.
- Nothing claimed → idle per liveness protocol; watchdog pings me on the first `claim(3...)` commit.
**No verdict. No VETO (Phase-3).** Awaiting Builder's first gate claim.
### @2026-05-31T05:55Z — PRE-CLAIM RECON (NOT a verdict): U0.1 pure level() mapper fuzz-clean
Builder committed `9773e3f feat(3 U0.1): pure level() ladder mapper + unit tests` but has NOT
claimed any gate (STATUS-3 "## Gate (none claimed)"). I probed early so I'm focused when U0 lands.
Cold-run from a fresh clone on the cc-ci host @9773e3f (`cc-ci-run -m pytest tests/unit/test_level.py`):
- Builder's 15 unit tests: **15 passed**.
- My own adversarial inputs (6 cases the Builder didn't write): all correct — mid/higher passes never
rescue a lower gap; install na/fail → L0; all-na-above-install → L1.
- **Exhaustive fuzz: all 3^6 = 729 rung combinations → `compute_level` level == count of leading
consecutive passes, 0 mismatches.** The pure mapper provably cannot inflate the level.
**Binding question deferred to the U0 claim:** inflation can only enter via the *translation layer*
(`run_recipe_ci.py` mapping raw per-tier results + deps/SSO signals → the rung dict) and via whether
`results.json` is actually emitted per real run. The pure function is sound; I will attack the mapping
and the real emitted artifact when U0 is CLAIMED. Not anchoring on the Builder's narrative — this is
my own cold re-run + fuzz. No verdict yet.
### @2026-05-31T07:05Z — U0 GATE: **PASS** (Results schema + level; R1)
**Claim (STATUS-3, `claim(3 U0)` @5b6b378).** `run_recipe_ci.py` emits per-run `results.json` with
per-stage AND per-test ✔/✘ breakdown + a computed integer **level** (L0L6, YunoHost gap-cap).
Accept: level correct for an L4-pass recipe and one capped at the L2 rung.
**Verification was COLD + INDEPENDENT.** My clone is on the orchestrator VM; `cc-ci-run` lives only
on the cc-ci host, so I tar'd my clone's `runner/` + `tests/` to a fresh `/tmp/advverify` on cc-ci
and ran everything under the real `cc-ci-run` harness. Verdict formed from the plan (SSOT) + code +
STATUS-3 verification info + my own re-run/probe — JOURNAL-3 NOT read first (anti-anchoring §6.1).
**1. Unit tests (cold, real harness).** `PYTHONPATH=runner cc-ci-run -m pytest
tests/unit/test_level.py tests/unit/test_results.py -q`**29 passed in 0.09s**.
(Builder's STATUS said 28 @claim sha; origin HEAD has one more — superset, all green. NB: pytest
needs `tests/conftest.py:13` to put `runner/` on sys.path; the Builder runs from the repo root where
it loads natively, so this is an invocation detail of my /tmp copy, not a defect.)
**2. My own independent break-it probe** (`/tmp/adv_probe_u0c.py`, written from scratch against the
actual source API `harness.level`/`harness.results`, re-implementing the DECISIONS Phase-3 contract
independently; run under `cc-ci-run`**EXIT 0, all 10 checks OK**):
- `[1]` `compute_level` exhaustive **729 (3^6)** rung-combos == my independent reference (level =
count of leading contiguous passes); cap_reason empty iff L6, present iff <L6. 0 mismatches.
- `[2]` **NO-INFLATION:** degrading ANY pass rung fail/na never raises the level. 0 violations.
- `[3]` **gap-cap:** level never exceeds the index of the first non-pass rung. 0 cap-breaks.
- `[4]` `backup_restore_status`: pass only iff (capable both pass); either failfail; not capablena.
- `[5]` `derive_rungs` **SSO gating:** no declared deps integration **na** full pass caps **L4**
("no integration surface caps at L4"); declared+wired **L5**; `sso_unverified` fail.
- `[6]` `derive_rungs` **no-pass-without-backing-tier:** exhaustive 3^5 tier combos × {capable,
declared, deps_ready, sso_unverified, repo_local}× big fuzz NO rung ever reports `pass` without
the backing tier(s) actually passing. 0 inflation paths.
- `[7]` e2e `build_results`: one failing `custom` test functional rung fail level **capped L3**.
- `[7b]` e2e: `upgrade` fail **L1** even though backup/restore/custom passed (later passes ignored).
- `[8]` serialised results.json **clean of secret keywords**; `[9]` schema keys all present.
**3. Real emitted artifacts on cc-ci match EXPECTED EXACTLY** (fetched `/var/lib/cc-ci-runs/*/results.json`):
- **custom-html-tiny** (`u0-cht-L2`/`manual` + `adv-cht`): `level=2`,
`cap="L3 backup/restore (data integrity) N/A"`,
`rungs={install:pass,upgrade:pass,backup_restore:na,functional:na,integration:na,recipe_local:na}`,
`results={install:pass,upgrade:pass,backup:skip,restore:skip,custom:skip}`,
`flags={clean_teardown:true,no_secret_leak:true}`, stages=[install,upgrade] each w/ a per-test row.
A recipe whose functional tests would pass is still **capped at L2** because a LOWER rung (L3
backup) is N/A gap-cap works, never inflates.
- **uptime-kuma** (`u0-uk-L4`): `level=4`, `cap="L5 integration (SSO/OIDC + cross-app) N/A"`,
`rungs={install:pass,upgrade:pass,backup_restore:pass,functional:pass,integration:na,recipe_local:na}`,
all five tiers pass, stages=[install,upgrade,backup,restore,custom]; **custom has 5 tests all pass**
(3 uptime-kuma functional: health_check / socketio_handshake / spa_branding [source `cc-ci`] + 2
generic), `flags.clean_teardown=true`. A full clean climb with no SSO surface caps at **L4**.
These two bracket the gate; the level never reads greener than the tiers.
**4. Leak scan over all 3 raw `results.json`.** The only matches for
`password|secret|token|passwd|api_key|privkey|private` are the **field name `no_secret_leak`** a
flag name, not a value. **Real secret-value leaks: 0.**
**5. Clean teardown (live).** `docker service ls` on cc-ci shows **only `traefik_app`** zero
run-app stacks (`*-pr*`/`adv-*`/`u0-*`/recipe services). The Builder's U0 runs all tore down cleanly;
the `clean_teardown:true` flag is corroborated by reality.
**6. Emission is R7-safe (code inspection).** `run_recipe_ci.py::_emit_results` wraps
`build_results``_scan_results_for_secrets``write_results` in `try/except Exception` on any
failure it only prints a non-fatal `[results] WARN` and swallows; `_emit_and_return` always
`return overall` (the tier-derived verdict). Cosmetics cannot change the run's exit code.
**7. Contract consistency.** `harness/level.py` is pure (no I/O); `derive_rungs` is conservative by
construction; DECISIONS.md Phase-3 (ladder + rung-mapping + schema + artifact hosting) matches the
code. The integration-na "cap at L4" transparency is a DECISIONS-settled refinement of plan §4.1's
"proposed default" (plan §7 defers cap-vs-N/A to DECISIONS) authorized, not inflation.
**VERDICT: U0 PASS @2026-05-31T07:05Z.** No inflation, no cap-break, no real secret leak, clean
teardown, R7-safe emission, schema complete. **R1 (level ladder) cold-verified.** No VETO. Builder
may proceed past U0.
**Carry-forward (NOT blocking U0 — recorded so they aren't lost):**
- `no_secret_leak=True` is hard-coded in `_emit_results`; the real protection is
`_scan_results_for_secrets` *raising* (→ emission fails) on a hit. DECISIONS notes the flag is "a
narrow self-scan; the Adversary's broader leak scan is the authority (R7/U5)". Acceptable at U0; I
will be the leak authority at U5 over images/screenshots/comments + the served artifacts.
- `clean_teardown=(overall == 0 or ctx.teardown_clean)` a green run asserts the flag True without
re-deriving the deploy-count/dep-teardown check that DECISIONS describes. Informational flag, not a
level; will scrutinise once the dashboard surfaces it (U4) and the kill-mid-run teardown probe (U5).
- The `screenshot`/`summary_card` fields are present-but-null at U0 (expected; populated U1/U2). I
will verify the served-at-stable-URL hosting (`/runs/<id>/...`) and hold the cardinal invariant
(rendered card/level/screenshot never greener than raw results.json + actual outcomes) at U2U4.
- Pre-existing repo-wide lint RED on origin/main (Builder-flagged) is not a Phase-3 DoD item and not
introduced by U0 noted, not a finding.