Files
cc-ci/machine-docs/BUILDER-INBOX.md

38 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## [adversary heads-up @2026-06-10T20:13Z] M2 sweep — restore-failure CLUSTER blocks M2 PASS (likely infra, NOT obviously our regression)
NOT a verdict (M2 unclaimed; REVIEW untouched). Independent observation from your /root/m2-logs
before you claim M2 — to keep us aligned and prevent a premature PASS.
**Observed:** 4 recipes RED at the RESTORE tier, identical root cause:
`RuntimeError: docker exec … failed: ERROR: relation "ci_marker" does not exist` (lifecycle.py:653)
→ level=2 (L3 backup/restore data-integrity FAILED):
- discourse (baseline L4), immich (baseline L4, run 307 TODAY), plausible (baseline L4),
mattermost-lts (baseline L4).
Plus lasuite-drive: install-tier health FAIL → level=0 (baseline L5) — but upgrade/backup/restore/
custom all PASS, so that smells like a transient install-tier health timeout, not a deploy break.
bluesky-pds: install deploy FAIL (abra app deploy rc=1) → separate, check upstream.
**For the 4 restore failures I traced discourse end-to-end:** pre_backup seeded ci_marker (log L55),
backup captured it (snapshot taken), pre_restore DROPPED it by design (L98), restore should bring it
back — but ci_marker is absent for 90s+ after restore. So the RESTORE op didn't restore the DB state.
**Why I think this is NOT a restructure regression (evidence against blaming the merge):**
1. Restore ORCHESTRATION is behaviorally unchanged by the merge — the only diff in
run_recipe_ci.py's backup/restore path is `meta.get("BACKUP_VERIFY")``meta.BACKUP_VERIFY`
(dict→object accessor). perform_op/restore/snapshot-selection untouched.
2. BACKUP_VERIFY does NOT correlate: immich/plausible/mattermost-lts have ZERO BACKUP_VERIFY;
same as the PASSING matrix-synapse/keycloak. Only discourse has it.
3. Several postgres recipes with the IDENTICAL ci_marker pattern PASSED restore (matrix-synapse,
keycloak, cryptpad, n8n). The 4 failures cluster in the LATER sweep window (19:5720:05Z) —
consistent with host resource/restic-repo contention late in a 21-recipe 2-concurrent sweep.
4. Recipe pre_restore hooks + test_restore.py asserts were mechanically unchanged (M1-verified).
**Ask (before any M2 claim):** this MUST resolve to green or be proven not-ours. Cleanest A/B:
re-run ONE failing recipe (e.g. immich) restore tier on OLD main (pre-merge 49fb818/c2508c7) in the
SAME env NOW — if it ALSO fails ci_marker, it's environmental/pre-existing, not the restructure, and
M2 baseline-match should re-run the 4 serially (or accept with documented infra cause). If old main
PASSES where merged main FAILS, it IS a regression and I'll dig with you. I will independently
re-verify whichever path you take at the M2 claim. Until then I cannot PASS M2 and would VETO a PASS
that leaves these 4 RED vs their L4 baseline without resolution.