status(rcust): m2p-lasuite-drive WILL land L0 — second P2b regression (completed one-shot 0/1 vs services_converged) root-caused live; fix on branch be2026a awaiting approval
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -198,3 +198,33 @@ main serial re-run, AND old main @ old default head. The earlier "deploy timed o
|
||||
concurrent image pulls" guess in STATUS was wrong (the 600s timeout was the SYMPTOM; the ~2min
|
||||
A/B failure exposed the crash-loop). Upstream re-published the pinned tag with a different image
|
||||
layout — no harness can deploy it. Filed in STATUS as restructure-neutral with grep-able evidence.
|
||||
|
||||
## 2026-06-11 lasuite-drive root cause #2 — completed one-shot poisons convergence (caught live)
|
||||
|
||||
Watching the m2p proof run instead of just waiting paid off: the fix-forward's best-effort line
|
||||
printed (so #1 is fixed), but the install assert then sat in pytest for 25+ minutes. Live state:
|
||||
app serving 200, every service 1/1 EXCEPT minio-createbuckets 0/1 with its task **Complete 28
|
||||
minutes ago**. services_converged demands cur==want for every service; a completed
|
||||
restart_policy-none one-shot never returns to 1/1, so the bounded converge poll (DEPLOY_TIMEOUT
|
||||
1800s for this recipe) was always going to burn to the deadline and fail install.
|
||||
|
||||
Why nobody ever saw this before P2b: the old setup_custom_tests.sh ran AFTER the install asserts
|
||||
(post-deploy hook path), so converge never observed desired=1 on the one-shot, and the upgrade
|
||||
tier's chaos redeploy reapplied the compose spec (replicas: 0) before its own converge checks.
|
||||
P2b folded the trigger into ops.py pre_install — which the orchestrator runs BEFORE the generic
|
||||
install assert. Also explains m2rr's odd "install fail but upgrade/backup/restore/custom all pass"
|
||||
shape exactly (redeploy resets the spec).
|
||||
|
||||
Fix options weighed: (a) hook scales the one-shot back to 0 after the poll — rejected: on the
|
||||
timeout path the task is typically still Preparing (image pull) and scale-to-0 CANCELS it, so the
|
||||
observed "bucket lands just after the window" runs would become custom-tier RED, i.e. strictly
|
||||
worse than baseline; (b) move the trigger to a post-assert hook point — no such hook exists in the
|
||||
new convention and inventing one mid-M2 is scope creep; (c) teach services_converged that a
|
||||
replica deficit consisting entirely of Complete tasks IS converged — chosen: semantically correct
|
||||
(the one-shot did its job), restores baseline behavior for any triggered one-shot, and the
|
||||
converge window doubles as the late-landing grace. Disclosed delta: a genuinely FAILING one-shot
|
||||
now reds at install (converge timeout) instead of at the custom bucket test — both red, no false
|
||||
green. Guard: Failed/mixed/spinning-up/no-tasks-yet still block (unit-pinned, 7 cases).
|
||||
|
||||
Branch fix/converged-oneshot @ be2026a, proposal in ADVERSARY-INBOX, awaiting approval per the M2
|
||||
fix-forward protocol. Unit suite 199 passed + lint PASS from the cc-ci working-tree rsync.
|
||||
|
||||
@ -127,9 +127,23 @@ sweep runs, not retroactively here.
|
||||
- M2.3 in-flight proof runs (serial queue /root/m2-proof.sh + /root/m2-proof2.sh, logs
|
||||
/root/m2-proof-logs/, driver /root/m2-proof-logs/driver.log):
|
||||
1. **lasuite-drive @baseline ref ffa7d585afa2 PR=1 on merged main @5c0676b** (post-fix-forward
|
||||
1357544) → run id m2p-lasuite-drive; EXPECTED L5 (the Adversary approval condition).
|
||||
1357544) → run id m2p-lasuite-drive: **WILL LAND L0 — second P2b regression found via this
|
||||
run, root-caused LIVE.** The 1357544 best-effort path WORKED (`!!` warn + continue in the
|
||||
log); the one-shot task went **Complete** ~3min in (bucket created); but a completed
|
||||
restart_policy-none one-shot reports replicas 0/1 FOREVER, and services_converged requires
|
||||
cur==want → the install assert burned DEPLOY_TIMEOUT (1800s) and failed. Old world never saw
|
||||
this: setup_custom_tests.sh ran POST-install-assert (its own header: orchestrator runs it
|
||||
after the deploy is healthy); P2b moved the trigger to ops.py pre_install = PRE-assert.
|
||||
Verified live during the run: app HTTP 200, all other services 1/1,
|
||||
`docker service ps ..._minio-createbuckets` = Complete, pytest in converge loop 27+ min.
|
||||
**Fix-forward proposed, awaiting Adversary approval: branch `fix/converged-oneshot` @
|
||||
be2026a** — services_converged treats a replica deficit explained ENTIRELY by Complete tasks
|
||||
as converged (Failed/mixed/spinning-up/no-tasks still block; 0/0 + N/N unchanged); pinned by
|
||||
tests/unit/test_converged_oneshot.py (7 cases). Proof: working tree on cc-ci
|
||||
`cc-ci-run -m pytest tests/unit -q` → 199 passed; lint PASS. Post-approval: merge + re-run
|
||||
this proof (expect L5).
|
||||
2. **discourse @7ae7b0f PR=2 on merged main** (exact baseline-184 invocation) → m2p-discourse;
|
||||
discriminates PR=0-artifact/race vs deterministic-at-ref.
|
||||
discriminates PR=0-artifact/race vs deterministic-at-ref. Unaffected by the one-shot issue.
|
||||
3. **discourse @7ae7b0f PR=2 on OLD main** (/root/m2-oldmain) → ab-discourse-7ae7b0f-oldmain;
|
||||
completes the same-ref A/B the upgrade-HC1 mode is missing.
|
||||
- M2.4 spot-greps (customizations actually executed — log evidence in /root/m2-logs/):
|
||||
|
||||
Reference in New Issue
Block a user