Files

autonomic-bot 18d2bd1443 review(3 U0): PASS — results.json schema + level ladder cold-verified

Cold/independent on the real cc-ci-run harness:
- 29 unit tests pass (test_level + test_results, PYTHONPATH=runner).
- Independent break-it probe EXIT 0, all 10 checks: compute_level 729 exhaustive vs own
  reference; no-inflation monotonicity; gap-cap; backup_restore_status; SSO gating
  (no-deps->L4, deps->L5, unverified->fail); derive_rungs no-pass-without-backing big fuzz;
  e2e custom-fail->L3 + upgrade-fail->L1; leak-clean; schema complete.
- Real artifacts match EXPECTED exactly: custom-html-tiny L2 (cap L3 backup N/A),
  uptime-kuma L4 (cap L5 integration N/A). 0 real secret leaks (only field name
  no_secret_leak matched). Clean teardown (only traefik_app live). Emission R7-wrapped
  (try/except; return overall) so cosmetics never change the verdict.
R1 (level ladder) cold-verified. Builder may proceed past U0. No VETO.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-31 06:53:34 +00:00

12 KiB

Raw Blame History

REVIEW-3 — Adversary verdicts for cc-ci Phase 3 (Beautiful YunoHost-style results UX)

SSOT for this phase: /srv/cc-ci/cc-ci-plan/plan-phase3-results-ux.md. This is the Adversary-owned, append-only verdict log for Phase 3. The Builder owns STATUS-3.md / JOURNAL-3.md / BACKLOG-3.md ## Build backlog. I own this file + BACKLOG-3.md ## Adversary findings.

Definition of Done (Phase 3) — R1–R8, each to be Adversary cold-verified within 24h

R1 — Level ladder. Documented ladder (§4.1) maps passed test sets → one integer level per run; a missing lower rung caps the level (YunoHost semantics). COLD-VERIFIED @U0 07:05Z.
R2 — Image-forward PR comment. !testme posts/updates a Gitea PR comment: marker (🌻) + status/level badge + summary image, both linking to run/dashboard; re-run updates same comment.
R3 — Summary card image. Per-run PNG: recipe+version, level, per-stage/per-test ✔/✘ breakdown, embedded deployed-app screenshot; stable URL; in comment + dashboard.
R4 — App screenshot. Runner captures real screenshot of deployed app (Playwright, post-login where needed) for the card.
R5 — Dashboard polish. Overview at ci.commoninternet.net resembles ci-apps.yunohost.org: recipe grid w/ level badge, latest pass/fail, last version, app screenshot, history link.
R6 — Badges. Per-recipe level/status SVG badge endpoint embeddable in READMEs + dashboard.
R7 — Safe & robust. No secrets in images/comments/badges/screenshots (reuse P1 §4.4 redaction; screenshot must not capture secret values). Image gen never blocks/fails the pipeline: on error → text fallback + recorded failure; verdict unaffected.
R8 — Docs. docs/ explains ladder, card/screenshot/badge generation, badge embedding.

Milestone gates (each ends with an Adversary gate) — U0..U5

U0 — Results schema + level (results.json per-stage/per-test; level correct for L4-pass & L2-cap). PASS @07:05Z.
U1 — App screenshot (real, post-login, secret-safe).
U2 — Summary card + badge (HTML→PNG; level/✔✘/screenshot; SVG badge; stable URLs; pass+fail).
U3 — YunoHost-style PR comment (marker+badge+card, linked; updates on re-run; no secrets).
U4 — Dashboard polish (grid mirrors underlying results across several runs).
U5 — Badges + docs + hardening (leak scan clean; renderer-kill degrades to text; flip DONE).

Adversary invariants to attack this phase (from §6 guardrails)

Presentation never inflates the verdict — rendered level/card MUST match raw results.json & actual test outcomes. A card greener than its tests = FAIL.
No secrets in any artifact — comments, badges, cards, app screenshots (esp. generated admin/app passwords; screenshot must avoid credential pages).
Cosmetics never block the pipeline — render/screenshot/badge failure degrades to text + warning; never fails or hangs a run; respects P1 timeouts.
No test-weakening to raise a level — watch for softened tests or mis-mapped rungs inflating displayed quality.

Verdict log (append-only)

@2026-05-31T05:42Z — Phase-3 Adversary loop live (no gate yet)

Cold orient on first wake into Phase 3. Findings:

Phase 3 plan read in full (SSOT). DoD = R1–R8; milestones U0–U5; guardrails internalised above.
No Phase-3 work exists yet: no STATUS-3.md / JOURNAL-3.md / BACKLOG-3.md in machine-docs/; no ADVERSARY-INBOX; HEAD = 7123d82 status(2b): ## DONE. Builder has not started §1/U0.
Prerequisite note (not my call, recorded for honesty): plan-phase3 §0 says "Do not start until Phase 2 STATUS.md shows ## DONE (Adversary-verified)." Phase-2 ## DONE is not yet flipped and REVIEW-2.md carries a standing VETO (named upgrade-to-latest checklist satisfied, but full Phase-2 DONE authorization is a separate later step per REVIEW-2 @2026-05-31). Phase 2b IS DONE. The operator kicked Phase 3 off manually (transition = manual per §Status). Sequencing across phases is an operator call (cf. STATUS-2b note), so I proceed with Phase-3 adversary duties; I am NOT treating the Phase-2 VETO as a Phase-3 blocker, only flagging the dependency.
Nothing claimed → idle per liveness protocol; watchdog pings me on the first claim(3...) commit.

No verdict. No VETO (Phase-3). Awaiting Builder's first gate claim.

@2026-05-31T05:55Z — PRE-CLAIM RECON (NOT a verdict): U0.1 pure level() mapper fuzz-clean

Builder committed 9773e3f feat(3 U0.1): pure level() ladder mapper + unit tests but has NOT claimed any gate (STATUS-3 "## Gate (none claimed)"). I probed early so I'm focused when U0 lands. Cold-run from a fresh clone on the cc-ci host @9773e3f (cc-ci-run -m pytest tests/unit/test_level.py):

Builder's 15 unit tests: 15 passed.
My own adversarial inputs (6 cases the Builder didn't write): all correct — mid/higher passes never rescue a lower gap; install na/fail → L0; all-na-above-install → L1.
Exhaustive fuzz: all 3^6 = 729 rung combinations → compute_level level == count of leading consecutive passes, 0 mismatches. The pure mapper provably cannot inflate the level. Binding question deferred to the U0 claim: inflation can only enter via the translation layer (run_recipe_ci.py mapping raw per-tier results + deps/SSO signals → the rung dict) and via whether results.json is actually emitted per real run. The pure function is sound; I will attack the mapping and the real emitted artifact when U0 is CLAIMED. Not anchoring on the Builder's narrative — this is my own cold re-run + fuzz. No verdict yet.

@2026-05-31T07:05Z — U0 GATE: PASS (Results schema + level; R1)

Claim (STATUS-3, claim(3 U0) @5b6b378). run_recipe_ci.py emits per-run results.json with per-stage AND per-test ✔/✘ breakdown + a computed integer level (L0–L6, YunoHost gap-cap). Accept: level correct for an L4-pass recipe and one capped at the L2 rung.

Verification was COLD + INDEPENDENT. My clone is on the orchestrator VM; cc-ci-run lives only on the cc-ci host, so I tar'd my clone's runner/ + tests/ to a fresh /tmp/advverify on cc-ci and ran everything under the real cc-ci-run harness. Verdict formed from the plan (SSOT) + code + STATUS-3 verification info + my own re-run/probe — JOURNAL-3 NOT read first (anti-anchoring §6.1).

1. Unit tests (cold, real harness). PYTHONPATH=runner cc-ci-run -m pytest tests/unit/test_level.py tests/unit/test_results.py -q → 29 passed in 0.09s. (Builder's STATUS said 28 @claim sha; origin HEAD has one more — superset, all green. NB: pytest needs tests/conftest.py:13 to put runner/ on sys.path; the Builder runs from the repo root where it loads natively, so this is an invocation detail of my /tmp copy, not a defect.)

2. My own independent break-it probe (/tmp/adv_probe_u0c.py, written from scratch against the actual source API harness.level/harness.results, re-implementing the DECISIONS Phase-3 contract independently; run under cc-ci-run — EXIT 0, all 10 checks OK):

[1] compute_level exhaustive 729 (3^6) rung-combos == my independent reference (level = count of leading contiguous passes); cap_reason empty iff L6, present iff <L6. 0 mismatches.
[2] NO-INFLATION: degrading ANY pass rung → fail/na never raises the level. 0 violations.
[3] gap-cap: level never exceeds the index of the first non-pass rung. 0 cap-breaks.
[4] backup_restore_status: pass only iff (capable ∧ both pass); either fail→fail; not capable→na.
[5] derive_rungs SSO gating: no declared deps → integration na → full pass caps L4 ("no integration surface caps at L4"); declared+wired → L5; sso_unverified → fail.
[6] derive_rungs no-pass-without-backing-tier: exhaustive 3^5 tier combos × {capable, declared, deps_ready, sso_unverified, repo_local}× big fuzz — NO rung ever reports pass without the backing tier(s) actually passing. 0 inflation paths.
[7] e2e build_results: one failing custom test ⇒ functional rung fail ⇒ level capped L3.
[7b] e2e: upgrade fail ⇒ L1 even though backup/restore/custom passed (later passes ignored).
[8] serialised results.json clean of secret keywords; [9] schema keys all present.

3. Real emitted artifacts on cc-ci match EXPECTED EXACTLY (fetched /var/lib/cc-ci-runs/*/results.json):

custom-html-tiny (u0-cht-L2/manual + adv-cht): level=2, cap="L3 backup/restore (data integrity) N/A", rungs={install:pass,upgrade:pass,backup_restore:na,functional:na,integration:na,recipe_local:na}, results={install:pass,upgrade:pass,backup:skip,restore:skip,custom:skip}, flags={clean_teardown:true,no_secret_leak:true}, stages=[install,upgrade] each w/ a per-test row. A recipe whose functional tests would pass is still capped at L2 because a LOWER rung (L3 backup) is N/A — gap-cap works, never inflates. ✔
uptime-kuma (u0-uk-L4): level=4, cap="L5 integration (SSO/OIDC + cross-app) N/A", rungs={install:pass,upgrade:pass,backup_restore:pass,functional:pass,integration:na,recipe_local:na}, all five tiers pass, stages=[install,upgrade,backup,restore,custom]; custom has 5 tests all pass (3 uptime-kuma functional: health_check / socketio_handshake / spa_branding [source cc-ci] + 2 generic), flags.clean_teardown=true. A full clean climb with no SSO surface caps at L4. ✔ These two bracket the gate; the level never reads greener than the tiers.

5. Clean teardown (live). docker service ls on cc-ci shows only traefik_app — zero run-app stacks (*-pr*/adv-*/u0-*/recipe services). The Builder's U0 runs all tore down cleanly; the clean_teardown:true flag is corroborated by reality.

6. Emission is R7-safe (code inspection). run_recipe_ci.py::_emit_results wraps build_results→_scan_results_for_secrets→write_results in try/except Exception → on any failure it only prints a non-fatal [results] WARN and swallows; _emit_and_return always return overall (the tier-derived verdict). Cosmetics cannot change the run's exit code.

7. Contract consistency. harness/level.py is pure (no I/O); derive_rungs is conservative by construction; DECISIONS.md Phase-3 (ladder + rung-mapping + schema + artifact hosting) matches the code. The integration-na "cap at L4" transparency is a DECISIONS-settled refinement of plan §4.1's "proposed default" (plan §7 defers cap-vs-N/A to DECISIONS) — authorized, not inflation.

VERDICT: U0 PASS @2026-05-31T07:05Z. No inflation, no cap-break, no real secret leak, clean teardown, R7-safe emission, schema complete. R1 (level ladder) cold-verified. No VETO. Builder may proceed past U0.

Carry-forward (NOT blocking U0 — recorded so they aren't lost):

⚠️ no_secret_leak=True is hard-coded in _emit_results; the real protection is _scan_results_for_secrets raising (→ emission fails) on a hit. DECISIONS notes the flag is "a narrow self-scan; the Adversary's broader leak scan is the authority (R7/U5)". Acceptable at U0; I will be the leak authority at U5 over images/screenshots/comments + the served artifacts.
⚠️ clean_teardown=(overall == 0 or ctx.teardown_clean) — a green run asserts the flag True without re-deriving the deploy-count/dep-teardown check that DECISIONS describes. Informational flag, not a level; will scrutinise once the dashboard surfaces it (U4) and the kill-mid-run teardown probe (U5).
The screenshot/summary_card fields are present-but-null at U0 (expected; populated U1/U2). I will verify the served-at-stable-URL hosting (/runs/<id>/...) and hold the cardinal invariant (rendered card/level/screenshot never greener than raw results.json + actual outcomes) at U2–U4.
Pre-existing repo-wide lint RED on origin/main (Builder-flagged) is not a Phase-3 DoD item and not introduced by U0 — noted, not a finding.

12 KiB Raw Blame History Unescape Escape