review(3 U0): PASS — results.json schema + level ladder cold-verified

Cold/independent on the real cc-ci-run harness: - 29 unit tests pass (test_level + test_results, PYTHONPATH=runner). - Independent break-it probe EXIT 0, all 10 checks: compute_level 729 exhaustive vs own reference; no-inflation monotonicity; gap-cap; backup_restore_status; SSO gating (no-deps->L4, deps->L5, unverified->fail); derive_rungs no-pass-without-backing big fuzz; e2e custom-fail->L3 + upgrade-fail->L1; leak-clean; schema complete. - Real artifacts match EXPECTED exactly: custom-html-tiny L2 (cap L3 backup N/A), uptime-kuma L4 (cap L5 integration N/A). 0 real secret leaks (only field name no_secret_leak matched). Clean teardown (only traefik_app live). Emission R7-wrapped (try/except; return overall) so cosmetics never change the verdict. R1 (level ladder) cold-verified. Builder may proceed past U0. No VETO. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 06:53:34 +00:00
parent 442741c0c8
commit 18d2bd1443
1 changed files with 88 additions and 3 deletions
--- a/machine-docs/REVIEW-3.md
+++ b/machine-docs/REVIEW-3.md
@ -5,8 +5,8 @@ This is the Adversary-owned, append-only verdict log for Phase 3. The Builder ow
 JOURNAL-3.md / BACKLOG-3.md `## Build backlog`. I own this file + BACKLOG-3.md `## Adversary findings`.

 ## Definition of Done (Phase 3) — R1–R8, each to be Adversary cold-verified within 24h
- [ ] **R1 — Level ladder.** Documented ladder (§4.1) maps passed test sets → one integer level per
-      run; a missing lower rung caps the level (YunoHost semantics).
+- [x] **R1 — Level ladder.** Documented ladder (§4.1) maps passed test sets → one integer level per
+      run; a missing lower rung caps the level (YunoHost semantics). **COLD-VERIFIED @U0 07:05Z.**
 - [ ] **R2 — Image-forward PR comment.** `!testme` posts/updates a Gitea PR comment: marker (🌻) +
      status/level badge + summary image, both linking to run/dashboard; re-run updates same comment.
 - [ ] **R3 — Summary card image.** Per-run PNG: recipe+version, level, per-stage/per-test ✔/✘
@ -22,7 +22,7 @@ JOURNAL-3.md / BACKLOG-3.md `## Build backlog`. I own this file + BACKLOG-3.md `
 - [ ] **R8 — Docs.** docs/ explains ladder, card/screenshot/badge generation, badge embedding.

 ## Milestone gates (each ends with an Adversary gate) — U0..U5
- [ ] U0 — Results schema + level (results.json per-stage/per-test; level correct for L4-pass & L2-cap).
+- [x] U0 — Results schema + level (results.json per-stage/per-test; level correct for L4-pass & L2-cap). **PASS @07:05Z.**
 - [ ] U1 — App screenshot (real, post-login, secret-safe).
 - [ ] U2 — Summary card + badge (HTML→PNG; level/✔✘/screenshot; SVG badge; stable URLs; pass+fail).
 - [ ] U3 — YunoHost-style PR comment (marker+badge+card, linked; updates on re-run; no secrets).
@ -73,3 +73,88 @@ Cold-run from a fresh clone on the cc-ci host @9773e3f (`cc-ci-run -m pytest tes
 `results.json` is actually emitted per real run. The pure function is sound; I will attack the mapping
 and the real emitted artifact when U0 is CLAIMED. Not anchoring on the Builder's narrative — this is
 my own cold re-run + fuzz. No verdict yet.
+
+### @2026-05-31T07:05Z — U0 GATE: **PASS** (Results schema + level; R1)
+
+**Claim (STATUS-3, `claim(3 U0)` @5b6b378).** `run_recipe_ci.py` emits per-run `results.json` with
+per-stage AND per-test ✔/✘ breakdown + a computed integer **level** (L0–L6, YunoHost gap-cap).
+Accept: level correct for an L4-pass recipe and one capped at the L2 rung.
+
+**Verification was COLD + INDEPENDENT.** My clone is on the orchestrator VM; `cc-ci-run` lives only
+on the cc-ci host, so I tar'd my clone's `runner/` + `tests/` to a fresh `/tmp/advverify` on cc-ci
+and ran everything under the real `cc-ci-run` harness. Verdict formed from the plan (SSOT) + code +
+STATUS-3 verification info + my own re-run/probe — JOURNAL-3 NOT read first (anti-anchoring §6.1).
+
+**1. Unit tests (cold, real harness).** `PYTHONPATH=runner cc-ci-run -m pytest
+tests/unit/test_level.py tests/unit/test_results.py -q` → **29 passed in 0.09s**.
+(Builder's STATUS said 28 @claim sha; origin HEAD has one more — superset, all green. NB: pytest
+needs `tests/conftest.py:13` to put `runner/` on sys.path; the Builder runs from the repo root where
+it loads natively, so this is an invocation detail of my /tmp copy, not a defect.)
+
+**2. My own independent break-it probe** (`/tmp/adv_probe_u0c.py`, written from scratch against the
+actual source API `harness.level`/`harness.results`, re-implementing the DECISIONS Phase-3 contract
+independently; run under `cc-ci-run` — **EXIT 0, all 10 checks OK**):
+- `[1]` `compute_level` exhaustive **729 (3^6)** rung-combos == my independent reference (level =
+  count of leading contiguous passes); cap_reason empty iff L6, present iff <L6. 0 mismatches.
+- `[2]` **NO-INFLATION:** degrading ANY pass rung → fail/na never raises the level. 0 violations.
+- `[3]` **gap-cap:** level never exceeds the index of the first non-pass rung. 0 cap-breaks.
+- `[4]` `backup_restore_status`: pass only iff (capable ∧ both pass); either fail→fail; not capable→na.
+- `[5]` `derive_rungs` **SSO gating:** no declared deps → integration **na** → full pass caps **L4**
+  ("no integration surface caps at L4"); declared+wired → **L5**; `sso_unverified` → fail.
+- `[6]` `derive_rungs` **no-pass-without-backing-tier:** exhaustive 3^5 tier combos × {capable,
+  declared, deps_ready, sso_unverified, repo_local}× big fuzz — NO rung ever reports `pass` without
+  the backing tier(s) actually passing. 0 inflation paths.
+- `[7]` e2e `build_results`: one failing `custom` test ⇒ functional rung fail ⇒ level **capped L3**.
+- `[7b]` e2e: `upgrade` fail ⇒ **L1** even though backup/restore/custom passed (later passes ignored).
+- `[8]` serialised results.json **clean of secret keywords**; `[9]` schema keys all present.
+
+**3. Real emitted artifacts on cc-ci match EXPECTED EXACTLY** (fetched `/var/lib/cc-ci-runs/*/results.json`):
+- **custom-html-tiny** (`u0-cht-L2`/`manual` + `adv-cht`): `level=2`,
+  `cap="L3 backup/restore (data integrity) N/A"`,
+  `rungs={install:pass,upgrade:pass,backup_restore:na,functional:na,integration:na,recipe_local:na}`,
+  `results={install:pass,upgrade:pass,backup:skip,restore:skip,custom:skip}`,
+  `flags={clean_teardown:true,no_secret_leak:true}`, stages=[install,upgrade] each w/ a per-test row.
+  A recipe whose functional tests would pass is still **capped at L2** because a LOWER rung (L3
+  backup) is N/A — gap-cap works, never inflates. ✔
+- **uptime-kuma** (`u0-uk-L4`): `level=4`, `cap="L5 integration (SSO/OIDC + cross-app) N/A"`,
+  `rungs={install:pass,upgrade:pass,backup_restore:pass,functional:pass,integration:na,recipe_local:na}`,
+  all five tiers pass, stages=[install,upgrade,backup,restore,custom]; **custom has 5 tests all pass**
+  (3 uptime-kuma functional: health_check / socketio_handshake / spa_branding [source `cc-ci`] + 2
+  generic), `flags.clean_teardown=true`. A full clean climb with no SSO surface caps at **L4**. ✔
+  These two bracket the gate; the level never reads greener than the tiers.
+
+**4. Leak scan over all 3 raw `results.json`.** The only matches for
+`password|secret|token|passwd|api_key|privkey|private` are the **field name `no_secret_leak`** — a
+flag name, not a value. **Real secret-value leaks: 0.**
+
+**5. Clean teardown (live).** `docker service ls` on cc-ci shows **only `traefik_app`** — zero
+run-app stacks (`*-pr*`/`adv-*`/`u0-*`/recipe services). The Builder's U0 runs all tore down cleanly;
+the `clean_teardown:true` flag is corroborated by reality.
+
+**6. Emission is R7-safe (code inspection).** `run_recipe_ci.py::_emit_results` wraps
+`build_results`→`_scan_results_for_secrets`→`write_results` in `try/except Exception` → on any
+failure it only prints a non-fatal `[results] WARN` and swallows; `_emit_and_return` always
+`return overall` (the tier-derived verdict). Cosmetics cannot change the run's exit code.
+
+**7. Contract consistency.** `harness/level.py` is pure (no I/O); `derive_rungs` is conservative by
+construction; DECISIONS.md Phase-3 (ladder + rung-mapping + schema + artifact hosting) matches the
+code. The integration-na "cap at L4" transparency is a DECISIONS-settled refinement of plan §4.1's
+"proposed default" (plan §7 defers cap-vs-N/A to DECISIONS) — authorized, not inflation.
+
+**VERDICT: U0 PASS @2026-05-31T07:05Z.** No inflation, no cap-break, no real secret leak, clean
+teardown, R7-safe emission, schema complete. **R1 (level ladder) cold-verified.** No VETO. Builder
+may proceed past U0.
+
+**Carry-forward (NOT blocking U0 — recorded so they aren't lost):**
+- ⚠️ `no_secret_leak=True` is hard-coded in `_emit_results`; the real protection is
+  `_scan_results_for_secrets` *raising* (→ emission fails) on a hit. DECISIONS notes the flag is "a
+  narrow self-scan; the Adversary's broader leak scan is the authority (R7/U5)". Acceptable at U0; I
+  will be the leak authority at U5 over images/screenshots/comments + the served artifacts.
+- ⚠️ `clean_teardown=(overall == 0 or ctx.teardown_clean)` — a green run asserts the flag True without
+  re-deriving the deploy-count/dep-teardown check that DECISIONS describes. Informational flag, not a
+  level; will scrutinise once the dashboard surfaces it (U4) and the kill-mid-run teardown probe (U5).
+- The `screenshot`/`summary_card` fields are present-but-null at U0 (expected; populated U1/U2). I
+  will verify the served-at-stable-URL hosting (`/runs/<id>/...`) and hold the cardinal invariant
+  (rendered card/level/screenshot never greener than raw results.json + actual outcomes) at U2–U4.
+- Pre-existing repo-wide lint RED on origin/main (Builder-flagged) is not a Phase-3 DoD item and not
+  introduced by U0 — noted, not a finding.