Files
cc-ci/machine-docs/REVIEW-prevb.md

17 KiB
Raw Blame History

REVIEW — phase prevb (Adversary verdicts)

Append-only. Gates this phase: M1 (implemented + green locally), M2 (proven in real CI + spot-check). SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-prevb-previous-dynamic-base.md.

Status

  • 2026-06-16T23:57Z — Adversary live for prevb. No Builder claim yet (no STATUS-prevb.md, no claim(). Cold-start recon done: baseline mechanism understood —
    • base resolution: run_recipe_ci.upgrade_basemeta.UPGRADE_BASE_VERSION or lifecycle.previous_version (vers[-2]); discourse pins 0.7.0+3.3.1.
    • overlay tests/discourse/compose.ccci.yml applied to ALL deploys via EXTRA_ENV.COMPOSE_FILE; fuses environmental (start_period 20m, order stop-first) + version-specific (bitnamilegacy image pin + sidekiq block) — the bug.
    • existing unit tests to watch for weakening: tests/unit/test_upgrade_base.py, tests/unit/test_meta.py. Idle until a gate is CLAIMED.
  • 2026-06-17T00:12Z — Independently cold-verified the Builder's STATUS ground-truth facts via gitea API (NOT trusting STATUS): PR #4 head ae5a81802b4d1d6cd1b449ac46cfa16d80730aaa compose.ymlapp.image = discourse/discourse:3.5.3, no sidekiq service; .diff shows -bitnamilegacy/discourse:3.5.0+discourse/discourse:3.5.3 + full sidekiq: block removed. main → app+sidekiq = bitnamilegacy/discourse:3.5.0, sidekiq present, base f87c612d. Facts CONFIRMED. (Caution noted: gitea raw?ref=<shortsha> silently falls back to default branch — must use the FULL sha when cold-verifying head content.) Foundation for "discourse needs no previous/" holds.

Pre-review (M1 code, gate NOT yet CLAIMED — preliminary recon, not a verdict)

2026-06-17T00:30Z — studied the M1 feat commit bb2e3c6 (code/diff only, NOT JOURNAL). Design looks sound:

  • resolve_upgrade_base → BasePlan(kind, version, ref, reason): override → last-green (canonical.read_registry) → main-tip (recipe_branch_commit) → skip. .runs gates the upgrade tier. head_ref = recipe_head_commit.
  • previous/ surface (lifecycle): has_previous, previous_target_version (VERSION marker), previous_status (version-guarded apply/stale), provide/remove overlay, compose_file add/remove. Base-only; stripped before head redeploy (generic.perform_upgraderemove_previous_overlay + COMPOSE_FILE strip). Good teeth.
  • discourse migrated: compose.ccci.yml now ENVIRONMENTAL-ONLY (order: stop-first); bitnamilegacy pins + sidekiq + UPGRADE_BASE_VERSION removed. test_upgrade.py asserts running app image == official discourse/discourse:3.5.3 (not bitnamilegacy) + sidekiq gone; resolves as the upgrade-tier overlay (resolve_overlay_optest_{op}.py), run as its own pytest → rc!=0 fails the tier. Real teeth confirmed.
  • Unit tests run cold (nix pytest): 63 passed (test_upgrade_base + test_previous + test_meta). Matrix EXPANDED, not weakened (override-wins / last-green-primary / main-tip-fallback / head==main-tip skip / no-pred skip).

STILL REQUIRED for the formal M1 PASS (needs the Builder's e2e claim + my cold acceptance run): (a) discourse upgrade tier GREEN locally with proof the head ran real 3.5.3 (not bitnamilegacy) + no sidekiq; (b) BREAK-IT: a deliberately-broken head still fails the upgrade tier (base resolution didn't paper over it); (c) base falls back to main when last-green absent (unit-covered; e2e desirable); (d) previous/ ignored for the head (code-confirmed; e2e desirable).

Adversary findings (pre-review notes)

  • [F-prevb-A] (PRE-EXISTING, NOT a prevb regression; INFO) tests/unit/test_warm_reconcile.py:: test_traefik_spec_is_stateless_with_setup is RED on main — KeyError: 'health_domain'. Fails identically at the gtea-DONE commit 778720c (verified by checkout), and the prevb feat never touched warm_reconcile — the pxgate-M1 traefik-probe change (0e9fd38) refactored the spec without updating this test. Out of prevb scope, but it means the FULL tests/unit/ suite is NOT all-green (283 pass / 1 fail). Flagging so "unit green" claims are scoped honestly. Not an M1 blocker.
  • [F-prevb-B] (NIT) old test_expected_na_other_rung_does_not_suppress was dropped in the rewrite; the behavior (an EXPECTED_NA for a non-upgrade rung must not suppress the base) is preserved via .get("upgrade") but no longer has a dedicated test. Low risk; consider re-adding one line of coverage.

M1 cold acceptance — IN FLIGHT (2026-06-17T00:42Z)

Gate M1 CLAIMED @00:40Z (code commit e1b32ea; claim commit bb79e91 = machine-docs only). Cold-verifying from a FRESH clone on cc-ci (/root/cc-ci-adv-prevb @ bb79e91), not the Builder's tree. Done so far (cold):

  • prevb unit surface: 64 passed (test_upgrade_base+test_previous+test_meta) via nix pytest.
  • statics: compose.ccci.yml env-only (order: stop-first); discourse recipe_meta.py has NO UPGRADE_BASE_VERSION assignment.
  • prune_orphan_services reviewed: removes only services NOT in the head compose → cannot mask the prevb bug (if overlay leaked sidekiq into the head compose it'd be in defined → not pruned → test RED). Teeth preserved.
  • e2e launched (RECIPE=discourse SRC=recipe-maintainers/discourse REF=ae5a8180… PR=4 STAGES=install,upgrade), run manual-1344943. Early log CONFIRMS upgrade base: kind=ref ref=f87c612d71b4 (target-branch (main) tip) → base = main-tip chaos deploy (matches claim). Base deploy (main-tip, has the known sidekiq depends_on bug) in progress; observed a non-fatal lint rung: fail R011 on the base — watching whether it blocks.
  • CONCURRENCY observed: a Builder keycloak spot-check (PR#3) runs simultaneously in /root/prevb-deploy. My discourse run's janitor saw the keycloak lock and LEFT IT (live concurrent run, leaving it) — per-run ABRA_DIR isolation holding. Watching for memory-pressure false-failures on the shared 7GB node. UPDATE 2026-06-17T01:00Z (post-reboot, cold re-check of completed run):
  • e2e manual-1344943 COMPLETED GREEN (read full log /root/cc-ci-adv-prevb-e2e.log): upgrade base: kind=ref ref=f87c612d71b4 (target-branch (main) tip); upgrade→PR-head head_ref=ae5a8180; generic test_upgrade_reconverges PASSED; discourse test_head_runs_official_image_not_bitnamilegacy PASSED + test_sidekiq_service_dropped_by_head PASSED; RUN SUMMARY deploy-count=1 (expect 1), install:pass upgrade:pass, level=2/5. Matches STATUS EXPECTED exactly.
  • TEARDOWN clean: docker stack ls shows NO discourse stack; no discourse secrets/volumes. (warm-keycloak stack present = Builder's concurrent spot-check, not mine.)
  • BREAK-IT: my first probe (manual-1357729, broken-head ref 94ebaaa = head image discourse/discourse:99.99.99-adversary-broken) was SIGTERM-killed mid-base-deploy by MY reboot — INCOMPLETE. RE-LAUNCHED as manual-1360025 (same broken head, base resolving to main-tip f87c612d as expected). In flight. STILL TO CONFIRM: break-it manual-1360025 → upgrade tier RED (broken head not papered over).

Verdicts

M1: PASS @2026-06-17T01:03Z (code commit e1b32ea / claim bb79e91)

Cold-verified from a fresh clone on cc-ci (/root/cc-ci-adv-prevb), independent of the Builder's tree. Every M1 DoD item (plan §4) re-executed and confirmed:

  1. Dynamic base resolution (last-green → main-tip → skip). e2e manual-1344943 log: upgrade base: kind=ref ref=f87c612d71b4 (target-branch (main) tip) — correctly falls back to main-tip (discourse has NO last-green warm canonical and its only published tag is 0.7.0, behind main). Unit matrix re-run cold (nix pytest, 64 passed): override-wins / last-green-primary / main-tip-fallback / head==main-tip skip / no-predecessor skip. Matrix EXPANDED vs old upgrade_base, not weakened.
  2. previous/ surface (discovery + base-only application + version-guard/stale-flag): unit-covered (test_previous), code-confirmed base-only (stripped before head redeploy via perform_upgraderemove_previous_overlay + COMPOSE_FILE strip). discourse ships NO previous/ (base deploys clean) — matches plan §3 thesis.
  3. Environmental vs version-specific separated. tests/discourse/compose.ccci.yml is env-only (app.deploy.update_config.order: stop-first); bitnamilegacy image pins + sidekiq block removed; UPGRADE_BASE_VERSION removed from recipe_meta.py (grep: none). Verified statically in cold clone.
  4. discourse migrated — confirmed via #3 + e2e behaviour.
  5. discourse upgrade tier GREEN locally w/ proof head ran the REAL official image. e2e manual-1344943: generic test_upgrade_reconverges PASSED; discourse test_head_runs_official_image_not_bitnamilegacy PASSED + test_sidekiq_service_dropped_by_head PASSED; RUN SUMMARY deploy-count=1 (expect 1), install:pass, upgrade:pass, level=2/5. upgrade→PR-head head_ref=ae5a8180 version=0.8.1+3.5.0→1.0.0+3.5.3.
  6. TEETH — deliberately-broken head still goes RED (base resolution did NOT paper it over). Break-it probe manual-1360025: broken-head commit 94ebaaa sets head app.image = discourse/discourse:99.99.99-adversary-broken. Base resolved to main-tip f87c612d (same as GREEN run), install:pass, then the HEAD redeploy failed: prepull: docker pull discourse/discourse:99.99.99-adversary-broken failed — manifest unknownupgrade:fail (level 1/5). Proves the head's real (broken) image is what gets deployed; base/prune/previous machinery cannot mask a broken head.
  7. Clean teardown after BOTH the GREEN run and the broken/failed run: docker stack ls / secret ls / volume ls show NO discourse stack, secrets, or volumes. (warm-keycloak stack present = Builder's concurrent spot-check, not discourse.)
  8. No test weakened. F-prevb-B addressed — test_expected_na_other_rung_does_not_suppress_upgrade re-added (commit e1b32ea), present in cold clone. Net coverage up (+ resolver matrix + previous/ layering).

SCOPE CAVEAT (not an M1 blocker): the FULL tests/unit/ suite has 1 PRE-EXISTING unrelated red — test_warm_reconcile.py::test_traefik_spec_is_stateless_with_setup (KeyError 'health_domain'), failing identically at gtea-DONE 778720c, untouched by prevb (see [F-prevb-A]). prevb's own surface is all-green.

(JOURNAL not consulted before this verdict, per anti-anchoring. M1 stands on the plan, the code/diff, the STATUS verification info, and my own cold re-runs.)

M2 cold acceptance — IN FLIGHT (2026-06-17T01:45Z)

Gate M2 CLAIMED @01:40Z (HEAD 71399f6). Cold-verifying independently (gitea API + host artifacts + own re-run). CONFIRMED so far:

  • discourse PR#4 !testme GREEN in REAL CI — verified via gitea API (NOT trusting STATUS): !testme comment @01:27:09Z → bridge reply @01:27:25Z 🌻 cc-ci — discourse @ ae5a8180 ✅ **passed** → Drone 717. (Teeth of the signal: an EARLIER !testme @22:34 → run 700 → ❌ failure — !testme genuinely CAN go RED; 717's pass is meaningful, not a rubber-stamp. 700 failed pre-mint_admin-fix.)
  • Drone 717 junit cold-read: all 10 suites errors=0 failures=0 (install/upgrade ×2/backup ×2/restore ×2/custom create_topic+health_check+site_basic). results.json: level=4, results{install,upgrade,backup, restore,custom}=all pass; clean_teardown=true; no_secret_leak=true; ref=ae5a8180 (real PR head).
  • Head genuinely ran official 3.5.3 — REAL TEETH: tests/discourse/test_upgrade.py asserts via lifecycle.deployed_identity (= docker service inspect <stack>_app …ContainerSpec.Image — the LIVE running swarm image, not a compose grep) that image startswith discourse/discourse:3.5.3 & no bitnamilegacy; + stack_service_names (= docker stack services) that sidekiq is gone. Both PASS in 717.
  • lint R011 is a level-cap RUNG, NOT a gate (verified in code): run_recipe_ci.py:770 passed = warm_ok and bool(results) and all(v!='fail' for v in results.values()) and not sso_unverified — covers only the 5 functional tiers, NOT lint. So R011 caps level at 4/5 but cannot turn !testme RED. (R011 = "all services have images" on the official-image head + "invalid reference format" warns — a RECIPE-head lint nit, not a prevb/cc-ci failure; candidate PR comment, not a blocker.)
  • Secret-leak (independent scan of the PUBLIC surface): dashboard index (lists 717), results.json (all 11 test message fields empty on PASS), summary.html, junit, lint.txt — NO secret/password/token values. no_secret_leak flag scans results.json vs /run/secrets/* (infra secrets). NOTE [F-prevb-C, INFO]: mint_admin prints the minted plaintext discourse ApiKey to stdout → it lands in the Drone RAW build log (access-controlled, 401 w/o token — NOT the public dashboard). Pre-existing behavior (prevb only made the path image-agnostic, b66abc4; the .key print predates prevb). Not a public-surface leak; low severity.
  • Spot-checks (cold-read Builder logs + dynamic-base confirmed): cryptpad#5 base=ref 36ee3451 (main tip; =PR#5's real base sha, gitea-confirmed), keycloak#3 base=ref 12ac6db8 (main tip via master fallback), hedgedoc#1 base=ref 09bf4d54 (main tip). All install:pass upgrade:pass deploy-count=1; cryptpad test_upgrade_preserves_data PASS, keycloak test_upgrade_preserves_realm PASS. No leftover stacks (only infra + pre-existing warm-keycloak orphan).
  • INDEPENDENT re-run in flight: re-executing cryptpad#5 (REF=9c18c176) from MY cold clone @71399f6 (normal fetch, not the Builder's tree) to confirm dynamic-base generality isn't tree/env-specific. STILL TO CONFIRM: my cryptpad re-run resolves base=main-tip 36ee3451, install+upgrade pass, clean teardown. → CONFIRMED @01:58Z: my cold-clone (@71399f6, normal fetch) cryptpad#5 re-run: upgrade base: kind=ref ref=36ee3451a354 (target-branch (main) tip); install:pass upgrade:pass deploy-count=1; tests/cryptpad/test_upgrade.py::test_upgrade_preserves_data PASSED; NO leftover cryptpad stack (clean teardown). Dynamic base generality is NOT tree/env-specific — reproduced from my own clone.

Verdicts (cont.)

M2: PASS @2026-06-17T01:58Z (code/claim commit 71399f6)

Cold-verified independently of the Builder's tree — gitea API for the real-CI verdict, host-shared Drone artifacts read cold, code-read for the gating logic, + my OWN spot-check re-run. Every M2 DoD item (plan §4):

  1. discourse PR#4 !testme GREEN in real CI — gitea API (not STATUS): !testme @01:27:09Z → bridge 🌻 cc-ci — discourse @ ae5a8180 ✅ passed @01:27:25Z → Drone 717. Meaningful (earlier !testme @22:34 → run 700 → ❌ failure pre-fix; !testme genuinely can go RED).
  2. Head genuinely ran official discourse/discourse:3.5.3 (migration exercised) — REAL TEETH. 717 junit upgrade__cc-ci__test_upgrade.xml: test_head_runs_official_image_not_bitnamilegacy + test_sidekiq_service_dropped_by_head both PASS, asserting against the LIVE swarm service (docker service inspect …ContainerSpec.Image / docker stack services) — not a compose grep. Image is official 3.5.3 (not bitnamilegacy), sidekiq gone → the official-image migration the PR claims was tested.
  3. All tiers GREEN. 717: 10 junit suites errors=0 failures=0; results{install,upgrade,backup,restore, custom}=pass; level 4/5. The only non-pass is the lint rung (R011) — code-verified NON-GATING (run_recipe_ci.py:770 passed covers only the 5 functional results, not lint) → caps level, can't turn the verdict RED. R011 ("all services have images" + "invalid reference format") is a RECIPE-head lint nit (candidate PR comment per guardrail), not a prevb/cc-ci defect.
  4. Spot-check ≥3 recipes green under dynamic base. cryptpad#5 (base=main-tip 36ee3451), keycloak#3 (base=main-tip 12ac6db8 via master fallback; prune-orphans safe-skip), hedgedoc#1 (base=main-tip 09bf4d54) — all install:pass upgrade:pass deploy-count=1, data-preservation tests pass, no leftover stacks. PLUS my OWN cold re-run of cryptpad#5 reproduced base=main-tip + green + clean teardown.
  5. Secrets — independent scan of the PUBLIC surface clean. dashboard index, results.json (all test message empty on PASS), summary.html, junit, lint.txt — no secret values; clean_teardown=true, no_secret_leak=true. [F-prevb-C, INFO/pre-existing]: mint_admin prints the minted plaintext discourse ApiKey → it reaches only the access-controlled Drone RAW log (401 w/o token), NOT the public dashboard; prevb only made the path image-agnostic (the print predates prevb). Low severity, not a blocker.
  6. Levels/records reconciled — results.json levels correctly derived (discourse 4/5 lint-capped, cryptpad 2/5 install+upgrade-only); PR runs don't promote last-green (correct — nothing merged).

Nothing merged on any mirror (verified: PRs #4/#5 still open). No test weakened. M1 already PASS @01:03Z. Both milestones now have fresh Adversary PASSes → no VETO; the Builder may write ## DONE. (JOURNAL not consulted before this verdict, per anti-anchoring.)

Open VETOes

(none)