Files
cc-ci/machine-docs/REVIEW-prevb.md

198 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — phase `prevb` (Adversary verdicts)
Append-only. Gates this phase: **M1** (implemented + green locally), **M2** (proven in real CI + spot-check).
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-prevb-previous-dynamic-base.md`.
## Status
- 2026-06-16T23:57Z — Adversary live for `prevb`. No Builder claim yet (no STATUS-prevb.md, no `claim(`).
Cold-start recon done: baseline mechanism understood —
- base resolution: `run_recipe_ci.upgrade_base``meta.UPGRADE_BASE_VERSION or lifecycle.previous_version` (`vers[-2]`); discourse pins `0.7.0+3.3.1`.
- overlay `tests/discourse/compose.ccci.yml` applied to ALL deploys via `EXTRA_ENV.COMPOSE_FILE`; fuses environmental (start_period 20m, order stop-first) + version-specific (bitnamilegacy image pin + sidekiq block) — the bug.
- existing unit tests to watch for weakening: `tests/unit/test_upgrade_base.py`, `tests/unit/test_meta.py`.
Idle until a gate is CLAIMED.
- 2026-06-17T00:12Z — Independently cold-verified the Builder's STATUS ground-truth facts via gitea API
(NOT trusting STATUS): PR #4 head `ae5a81802b4d1d6cd1b449ac46cfa16d80730aaa` `compose.yml`
`app.image = discourse/discourse:3.5.3`, **no `sidekiq` service**; `.diff` shows
`-bitnamilegacy/discourse:3.5.0``+discourse/discourse:3.5.3` + full `sidekiq:` block removed.
main → `app`+`sidekiq` = `bitnamilegacy/discourse:3.5.0`, sidekiq present, base `f87c612d`.
Facts CONFIRMED. (Caution noted: gitea `raw?ref=<shortsha>` silently falls back to default branch —
must use the FULL sha when cold-verifying head content.) Foundation for "discourse needs no previous/" holds.
## Pre-review (M1 code, gate NOT yet CLAIMED — preliminary recon, not a verdict)
2026-06-17T00:30Z — studied the M1 `feat` commit bb2e3c6 (code/diff only, NOT JOURNAL). Design looks sound:
- `resolve_upgrade_base` → BasePlan(kind, version, ref, reason): override → last-green (`canonical.read_registry`)
→ main-tip (`recipe_branch_commit`) → skip. `.runs` gates the upgrade tier. head_ref = `recipe_head_commit`.
- `previous/` surface (lifecycle): `has_previous`, `previous_target_version` (VERSION marker), `previous_status`
(version-guarded apply/stale), provide/remove overlay, compose_file add/remove. Base-only; **stripped before
head redeploy** (`generic.perform_upgrade``remove_previous_overlay` + COMPOSE_FILE strip). Good teeth.
- discourse migrated: `compose.ccci.yml` now ENVIRONMENTAL-ONLY (`order: stop-first`); bitnamilegacy pins +
sidekiq + UPGRADE_BASE_VERSION **removed**. `test_upgrade.py` asserts running `app` image == official
`discourse/discourse:3.5.3` (not bitnamilegacy) + sidekiq gone; resolves as the upgrade-tier overlay
(`resolve_overlay_op``test_{op}.py`), run as its own pytest → rc!=0 fails the tier. Real teeth confirmed.
- **Unit tests run cold (nix pytest): 63 passed** (test_upgrade_base + test_previous + test_meta). Matrix
EXPANDED, not weakened (override-wins / last-green-primary / main-tip-fallback / head==main-tip skip / no-pred skip).
STILL REQUIRED for the formal M1 PASS (needs the Builder's e2e claim + my cold acceptance run):
(a) discourse upgrade tier GREEN locally with proof the head ran real 3.5.3 (not bitnamilegacy) + no sidekiq;
(b) BREAK-IT: a deliberately-broken head still fails the upgrade tier (base resolution didn't paper over it);
(c) base falls back to main when last-green absent (unit-covered; e2e desirable);
(d) `previous/` ignored for the head (code-confirmed; e2e desirable).
## Adversary findings (pre-review notes)
- [F-prevb-A] (PRE-EXISTING, NOT a prevb regression; INFO) `tests/unit/test_warm_reconcile.py::
test_traefik_spec_is_stateless_with_setup` is RED on main — `KeyError: 'health_domain'`. Fails identically at
the gtea-DONE commit 778720c (verified by checkout), and the prevb feat never touched warm_reconcile — the
`pxgate-M1` traefik-probe change (0e9fd38) refactored the spec without updating this test. Out of prevb scope,
but it means the FULL `tests/unit/` suite is NOT all-green (283 pass / 1 fail). Flagging so "unit green" claims
are scoped honestly. Not an M1 blocker.
- [F-prevb-B] (NIT) old `test_expected_na_other_rung_does_not_suppress` was dropped in the rewrite; the behavior
(an EXPECTED_NA for a non-upgrade rung must not suppress the base) is preserved via `.get("upgrade")` but no
longer has a dedicated test. Low risk; consider re-adding one line of coverage.
## M1 cold acceptance — IN FLIGHT (2026-06-17T00:42Z)
Gate M1 CLAIMED @00:40Z (code commit e1b32ea; claim commit bb79e91 = machine-docs only). Cold-verifying from a
FRESH clone on cc-ci (`/root/cc-ci-adv-prevb` @ bb79e91), not the Builder's tree.
Done so far (cold):
- prevb unit surface: **64 passed** (`test_upgrade_base`+`test_previous`+`test_meta`) via nix pytest.
- statics: `compose.ccci.yml` env-only (`order: stop-first`); discourse `recipe_meta.py` has NO `UPGRADE_BASE_VERSION` assignment.
- `prune_orphan_services` reviewed: removes only services NOT in the head compose → cannot mask the prevb bug
(if overlay leaked sidekiq into the head compose it'd be in `defined` → not pruned → test RED). Teeth preserved.
- e2e launched (`RECIPE=discourse SRC=recipe-maintainers/discourse REF=ae5a8180… PR=4 STAGES=install,upgrade`),
run `manual-1344943`. Early log CONFIRMS `upgrade base: kind=ref ref=f87c612d71b4 (target-branch (main) tip)`
→ base = main-tip chaos deploy (matches claim). Base deploy (main-tip, has the known sidekiq depends_on bug)
in progress; observed a non-fatal `lint rung: fail R011` on the base — watching whether it blocks.
- CONCURRENCY observed: a Builder keycloak spot-check (PR#3) runs simultaneously in `/root/prevb-deploy`. My
discourse run's janitor saw the keycloak lock and LEFT IT (`live concurrent run, leaving it`) — per-run
ABRA_DIR isolation holding. Watching for memory-pressure false-failures on the shared 7GB node.
UPDATE 2026-06-17T01:00Z (post-reboot, cold re-check of completed run):
- e2e `manual-1344943` COMPLETED **GREEN** (read full log /root/cc-ci-adv-prevb-e2e.log): `upgrade base:
kind=ref ref=f87c612d71b4 (target-branch (main) tip)`; `upgrade→PR-head head_ref=ae5a8180`;
generic `test_upgrade_reconverges` PASSED; discourse `test_head_runs_official_image_not_bitnamilegacy`
PASSED + `test_sidekiq_service_dropped_by_head` PASSED; RUN SUMMARY deploy-count=1 (expect 1),
install:pass upgrade:pass, level=2/5. Matches STATUS EXPECTED exactly.
- TEARDOWN clean: `docker stack ls` shows NO discourse stack; no discourse secrets/volumes. (warm-keycloak
stack present = Builder's concurrent spot-check, not mine.)
- BREAK-IT: my first probe (`manual-1357729`, broken-head ref 94ebaaa = head image
`discourse/discourse:99.99.99-adversary-broken`) was SIGTERM-killed mid-base-deploy by MY reboot — INCOMPLETE.
RE-LAUNCHED as `manual-1360025` (same broken head, base resolving to main-tip f87c612d as expected). In flight.
STILL TO CONFIRM: break-it `manual-1360025` → upgrade tier RED (broken head not papered over).
## Verdicts
### M1: PASS @2026-06-17T01:03Z (code commit e1b32ea / claim bb79e91)
Cold-verified from a fresh clone on cc-ci (`/root/cc-ci-adv-prevb`), independent of the Builder's tree.
Every M1 DoD item (plan §4) re-executed and confirmed:
1. **Dynamic base resolution (last-green → main-tip → skip).** e2e `manual-1344943` log: `upgrade base:
kind=ref ref=f87c612d71b4 (target-branch (main) tip)` — correctly falls back to main-tip (discourse has
NO last-green warm canonical and its only published tag is 0.7.0, behind main). Unit matrix re-run cold
(nix pytest, **64 passed**): override-wins / last-green-primary / main-tip-fallback / head==main-tip skip /
no-predecessor skip. Matrix EXPANDED vs old `upgrade_base`, not weakened.
2. **`previous/` surface** (discovery + base-only application + version-guard/stale-flag): unit-covered
(`test_previous`), code-confirmed base-only (stripped before head redeploy via `perform_upgrade` →
`remove_previous_overlay` + COMPOSE_FILE strip). discourse ships NO `previous/` (base deploys clean) —
matches plan §3 thesis.
3. **Environmental vs version-specific separated.** `tests/discourse/compose.ccci.yml` is env-only
(`app.deploy.update_config.order: stop-first`); bitnamilegacy image pins + `sidekiq` block removed;
`UPGRADE_BASE_VERSION` removed from `recipe_meta.py` (grep: none). Verified statically in cold clone.
4. **discourse migrated** — confirmed via #3 + e2e behaviour.
5. **discourse upgrade tier GREEN locally w/ proof head ran the REAL official image.** e2e `manual-1344943`:
generic `test_upgrade_reconverges` PASSED; discourse `test_head_runs_official_image_not_bitnamilegacy`
PASSED + `test_sidekiq_service_dropped_by_head` PASSED; RUN SUMMARY deploy-count=1 (expect 1),
install:pass, upgrade:pass, level=2/5. `upgrade→PR-head head_ref=ae5a8180 version=0.8.1+3.5.0→1.0.0+3.5.3`.
6. **TEETH — deliberately-broken head still goes RED (base resolution did NOT paper it over).** Break-it
probe `manual-1360025`: broken-head commit `94ebaaa` sets head `app.image =
discourse/discourse:99.99.99-adversary-broken`. Base resolved to main-tip f87c612d (same as GREEN run),
**install:pass**, then the HEAD redeploy failed: `prepull: docker pull
discourse/discourse:99.99.99-adversary-broken failed — manifest unknown` → **upgrade:fail (level 1/5)**.
Proves the head's real (broken) image is what gets deployed; base/prune/previous machinery cannot mask a
broken head.
7. **Clean teardown** after BOTH the GREEN run and the broken/failed run: `docker stack ls` / `secret ls` /
`volume ls` show NO discourse stack, secrets, or volumes. (warm-keycloak stack present = Builder's
concurrent spot-check, not discourse.)
8. **No test weakened.** F-prevb-B addressed — `test_expected_na_other_rung_does_not_suppress_upgrade`
re-added (commit e1b32ea), present in cold clone. Net coverage up (+ resolver matrix + previous/ layering).
SCOPE CAVEAT (not an M1 blocker): the FULL `tests/unit/` suite has 1 PRE-EXISTING unrelated red —
`test_warm_reconcile.py::test_traefik_spec_is_stateless_with_setup` (KeyError 'health_domain'), failing
identically at gtea-DONE 778720c, untouched by prevb (see [F-prevb-A]). prevb's own surface is all-green.
(JOURNAL not consulted before this verdict, per anti-anchoring. M1 stands on the plan, the code/diff, the
STATUS verification info, and my own cold re-runs.)
## M2 cold acceptance — IN FLIGHT (2026-06-17T01:45Z)
Gate M2 CLAIMED @01:40Z (HEAD 71399f6). Cold-verifying independently (gitea API + host artifacts + own re-run).
CONFIRMED so far:
- **discourse PR#4 !testme GREEN in REAL CI** — verified via gitea API (NOT trusting STATUS): `!testme`
comment @01:27:09Z → bridge reply @01:27:25Z `🌻 cc-ci — discourse @ ae5a8180 ✅ **passed**` → Drone 717.
(Teeth of the signal: an EARLIER !testme @22:34 → run 700 → `❌ failure` — !testme genuinely CAN go RED;
717's pass is meaningful, not a rubber-stamp. 700 failed pre-mint_admin-fix.)
- **Drone 717 junit cold-read**: all 10 suites errors=0 failures=0 (install/upgrade ×2/backup ×2/restore
×2/custom create_topic+health_check+site_basic). results.json: level=4, results{install,upgrade,backup,
restore,custom}=all pass; clean_teardown=true; no_secret_leak=true; ref=ae5a8180 (real PR head).
- **Head genuinely ran official 3.5.3 — REAL TEETH**: `tests/discourse/test_upgrade.py` asserts via
`lifecycle.deployed_identity` (= `docker service inspect <stack>_app …ContainerSpec.Image` — the LIVE
running swarm image, not a compose grep) that image startswith `discourse/discourse:3.5.3` & no
bitnamilegacy; + `stack_service_names` (= `docker stack services`) that sidekiq is gone. Both PASS in 717.
- **lint R011 is a level-cap RUNG, NOT a gate** (verified in code): `run_recipe_ci.py:770` `passed =
warm_ok and bool(results) and all(v!='fail' for v in results.values()) and not sso_unverified` — covers
only the 5 functional tiers, NOT lint. So R011 caps level at 4/5 but cannot turn !testme RED. (R011 =
"all services have images" on the official-image head + "invalid reference format" warns — a RECIPE-head
lint nit, not a prevb/cc-ci failure; candidate PR comment, not a blocker.)
- **Secret-leak (independent scan of the PUBLIC surface)**: dashboard index (lists 717), results.json (all
11 test `message` fields empty on PASS), summary.html, junit, lint.txt — NO secret/password/token values.
`no_secret_leak` flag scans results.json vs `/run/secrets/*` (infra secrets). NOTE [F-prevb-C, INFO]:
`mint_admin` prints the minted plaintext discourse ApiKey to stdout → it lands in the Drone RAW build log
(access-controlled, 401 w/o token — NOT the public dashboard). Pre-existing behavior (prevb only made the
path image-agnostic, b66abc4; the `.key` print predates prevb). Not a public-surface leak; low severity.
- **Spot-checks (cold-read Builder logs + dynamic-base confirmed)**: cryptpad#5 base=ref 36ee3451 (main tip;
=PR#5's real base sha, gitea-confirmed), keycloak#3 base=ref 12ac6db8 (main tip via master fallback),
hedgedoc#1 base=ref 09bf4d54 (main tip). All install:pass upgrade:pass deploy-count=1; cryptpad
`test_upgrade_preserves_data` PASS, keycloak `test_upgrade_preserves_realm` PASS. No leftover stacks
(only infra + pre-existing warm-keycloak orphan).
- **INDEPENDENT re-run in flight**: re-executing cryptpad#5 (REF=9c18c176) from MY cold clone @71399f6
(normal fetch, not the Builder's tree) to confirm dynamic-base generality isn't tree/env-specific.
STILL TO CONFIRM: my cryptpad re-run resolves base=main-tip 36ee3451, install+upgrade pass, clean teardown.
→ CONFIRMED @01:58Z: my cold-clone (@71399f6, normal fetch) cryptpad#5 re-run: `upgrade base: kind=ref
ref=36ee3451a354 (target-branch (main) tip)`; install:pass upgrade:pass deploy-count=1;
`tests/cryptpad/test_upgrade.py::test_upgrade_preserves_data` PASSED; NO leftover cryptpad stack
(clean teardown). Dynamic base generality is NOT tree/env-specific — reproduced from my own clone.
## Verdicts (cont.)
### M2: PASS @2026-06-17T01:58Z (code/claim commit 71399f6)
Cold-verified independently of the Builder's tree — gitea API for the real-CI verdict, host-shared Drone
artifacts read cold, code-read for the gating logic, + my OWN spot-check re-run. Every M2 DoD item (plan §4):
1. **discourse PR#4 `!testme` GREEN in real CI** — gitea API (not STATUS): `!testme` @01:27:09Z → bridge
`🌻 cc-ci — discourse @ ae5a8180 ✅ passed` @01:27:25Z → Drone 717. Meaningful (earlier !testme @22:34
→ run 700 → `❌ failure` pre-fix; !testme genuinely can go RED).
2. **Head genuinely ran official `discourse/discourse:3.5.3` (migration exercised) — REAL TEETH.** 717 junit
`upgrade__cc-ci__test_upgrade.xml`: `test_head_runs_official_image_not_bitnamilegacy` +
`test_sidekiq_service_dropped_by_head` both PASS, asserting against the LIVE swarm service
(`docker service inspect …ContainerSpec.Image` / `docker stack services`) — not a compose grep. Image is
official 3.5.3 (not bitnamilegacy), sidekiq gone → the official-image migration the PR claims was tested.
3. **All tiers GREEN.** 717: 10 junit suites errors=0 failures=0; results{install,upgrade,backup,restore,
custom}=pass; level 4/5. The only non-pass is the `lint` rung (R011) — code-verified NON-GATING
(`run_recipe_ci.py:770` `passed` covers only the 5 functional results, not lint) → caps level, can't turn
the verdict RED. R011 ("all services have images" + "invalid reference format") is a RECIPE-head lint nit
(candidate PR comment per guardrail), not a prevb/cc-ci defect.
4. **Spot-check ≥3 recipes green under dynamic base.** cryptpad#5 (base=main-tip 36ee3451), keycloak#3
(base=main-tip 12ac6db8 via master fallback; prune-orphans safe-skip), hedgedoc#1 (base=main-tip
09bf4d54) — all install:pass upgrade:pass deploy-count=1, data-preservation tests pass, no leftover
stacks. PLUS my OWN cold re-run of cryptpad#5 reproduced base=main-tip + green + clean teardown.
5. **Secrets — independent scan of the PUBLIC surface clean.** dashboard index, results.json (all test
`message` empty on PASS), summary.html, junit, lint.txt — no secret values; `clean_teardown=true`,
`no_secret_leak=true`. [F-prevb-C, INFO/pre-existing]: `mint_admin` prints the minted plaintext discourse
ApiKey → it reaches only the access-controlled Drone RAW log (401 w/o token), NOT the public dashboard;
prevb only made the path image-agnostic (the print predates prevb). Low severity, not a blocker.
6. **Levels/records reconciled** — results.json levels correctly derived (discourse 4/5 lint-capped,
cryptpad 2/5 install+upgrade-only); PR runs don't promote last-green (correct — nothing merged).
Nothing merged on any mirror (verified: PRs #4/#5 still open). No test weakened. M1 already PASS @01:03Z.
**Both milestones now have fresh Adversary PASSes → no VETO; the Builder may write `## DONE`.**
(JOURNAL not consulted before this verdict, per anti-anchoring.)
## Open VETOes
(none)