status(rcust): m2p-lasuite-drive WILL land L0 — second P2b regression (completed one-shot 0/1 vs services_converged) root-caused live; fix on branch be2026a awaiting approval
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -198,3 +198,33 @@ main serial re-run, AND old main @ old default head. The earlier "deploy timed o
|
|||||||
concurrent image pulls" guess in STATUS was wrong (the 600s timeout was the SYMPTOM; the ~2min
|
concurrent image pulls" guess in STATUS was wrong (the 600s timeout was the SYMPTOM; the ~2min
|
||||||
A/B failure exposed the crash-loop). Upstream re-published the pinned tag with a different image
|
A/B failure exposed the crash-loop). Upstream re-published the pinned tag with a different image
|
||||||
layout — no harness can deploy it. Filed in STATUS as restructure-neutral with grep-able evidence.
|
layout — no harness can deploy it. Filed in STATUS as restructure-neutral with grep-able evidence.
|
||||||
|
|
||||||
|
## 2026-06-11 lasuite-drive root cause #2 — completed one-shot poisons convergence (caught live)
|
||||||
|
|
||||||
|
Watching the m2p proof run instead of just waiting paid off: the fix-forward's best-effort line
|
||||||
|
printed (so #1 is fixed), but the install assert then sat in pytest for 25+ minutes. Live state:
|
||||||
|
app serving 200, every service 1/1 EXCEPT minio-createbuckets 0/1 with its task **Complete 28
|
||||||
|
minutes ago**. services_converged demands cur==want for every service; a completed
|
||||||
|
restart_policy-none one-shot never returns to 1/1, so the bounded converge poll (DEPLOY_TIMEOUT
|
||||||
|
1800s for this recipe) was always going to burn to the deadline and fail install.
|
||||||
|
|
||||||
|
Why nobody ever saw this before P2b: the old setup_custom_tests.sh ran AFTER the install asserts
|
||||||
|
(post-deploy hook path), so converge never observed desired=1 on the one-shot, and the upgrade
|
||||||
|
tier's chaos redeploy reapplied the compose spec (replicas: 0) before its own converge checks.
|
||||||
|
P2b folded the trigger into ops.py pre_install — which the orchestrator runs BEFORE the generic
|
||||||
|
install assert. Also explains m2rr's odd "install fail but upgrade/backup/restore/custom all pass"
|
||||||
|
shape exactly (redeploy resets the spec).
|
||||||
|
|
||||||
|
Fix options weighed: (a) hook scales the one-shot back to 0 after the poll — rejected: on the
|
||||||
|
timeout path the task is typically still Preparing (image pull) and scale-to-0 CANCELS it, so the
|
||||||
|
observed "bucket lands just after the window" runs would become custom-tier RED, i.e. strictly
|
||||||
|
worse than baseline; (b) move the trigger to a post-assert hook point — no such hook exists in the
|
||||||
|
new convention and inventing one mid-M2 is scope creep; (c) teach services_converged that a
|
||||||
|
replica deficit consisting entirely of Complete tasks IS converged — chosen: semantically correct
|
||||||
|
(the one-shot did its job), restores baseline behavior for any triggered one-shot, and the
|
||||||
|
converge window doubles as the late-landing grace. Disclosed delta: a genuinely FAILING one-shot
|
||||||
|
now reds at install (converge timeout) instead of at the custom bucket test — both red, no false
|
||||||
|
green. Guard: Failed/mixed/spinning-up/no-tasks-yet still block (unit-pinned, 7 cases).
|
||||||
|
|
||||||
|
Branch fix/converged-oneshot @ be2026a, proposal in ADVERSARY-INBOX, awaiting approval per the M2
|
||||||
|
fix-forward protocol. Unit suite 199 passed + lint PASS from the cc-ci working-tree rsync.
|
||||||
|
|||||||
@ -127,9 +127,23 @@ sweep runs, not retroactively here.
|
|||||||
- M2.3 in-flight proof runs (serial queue /root/m2-proof.sh + /root/m2-proof2.sh, logs
|
- M2.3 in-flight proof runs (serial queue /root/m2-proof.sh + /root/m2-proof2.sh, logs
|
||||||
/root/m2-proof-logs/, driver /root/m2-proof-logs/driver.log):
|
/root/m2-proof-logs/, driver /root/m2-proof-logs/driver.log):
|
||||||
1. **lasuite-drive @baseline ref ffa7d585afa2 PR=1 on merged main @5c0676b** (post-fix-forward
|
1. **lasuite-drive @baseline ref ffa7d585afa2 PR=1 on merged main @5c0676b** (post-fix-forward
|
||||||
1357544) → run id m2p-lasuite-drive; EXPECTED L5 (the Adversary approval condition).
|
1357544) → run id m2p-lasuite-drive: **WILL LAND L0 — second P2b regression found via this
|
||||||
|
run, root-caused LIVE.** The 1357544 best-effort path WORKED (`!!` warn + continue in the
|
||||||
|
log); the one-shot task went **Complete** ~3min in (bucket created); but a completed
|
||||||
|
restart_policy-none one-shot reports replicas 0/1 FOREVER, and services_converged requires
|
||||||
|
cur==want → the install assert burned DEPLOY_TIMEOUT (1800s) and failed. Old world never saw
|
||||||
|
this: setup_custom_tests.sh ran POST-install-assert (its own header: orchestrator runs it
|
||||||
|
after the deploy is healthy); P2b moved the trigger to ops.py pre_install = PRE-assert.
|
||||||
|
Verified live during the run: app HTTP 200, all other services 1/1,
|
||||||
|
`docker service ps ..._minio-createbuckets` = Complete, pytest in converge loop 27+ min.
|
||||||
|
**Fix-forward proposed, awaiting Adversary approval: branch `fix/converged-oneshot` @
|
||||||
|
be2026a** — services_converged treats a replica deficit explained ENTIRELY by Complete tasks
|
||||||
|
as converged (Failed/mixed/spinning-up/no-tasks still block; 0/0 + N/N unchanged); pinned by
|
||||||
|
tests/unit/test_converged_oneshot.py (7 cases). Proof: working tree on cc-ci
|
||||||
|
`cc-ci-run -m pytest tests/unit -q` → 199 passed; lint PASS. Post-approval: merge + re-run
|
||||||
|
this proof (expect L5).
|
||||||
2. **discourse @7ae7b0f PR=2 on merged main** (exact baseline-184 invocation) → m2p-discourse;
|
2. **discourse @7ae7b0f PR=2 on merged main** (exact baseline-184 invocation) → m2p-discourse;
|
||||||
discriminates PR=0-artifact/race vs deterministic-at-ref.
|
discriminates PR=0-artifact/race vs deterministic-at-ref. Unaffected by the one-shot issue.
|
||||||
3. **discourse @7ae7b0f PR=2 on OLD main** (/root/m2-oldmain) → ab-discourse-7ae7b0f-oldmain;
|
3. **discourse @7ae7b0f PR=2 on OLD main** (/root/m2-oldmain) → ab-discourse-7ae7b0f-oldmain;
|
||||||
completes the same-ref A/B the upgrade-HC1 mode is missing.
|
completes the same-ref A/B the upgrade-HC1 mode is missing.
|
||||||
- M2.4 spot-greps (customizations actually executed — log evidence in /root/m2-logs/):
|
- M2.4 spot-greps (customizations actually executed — log evidence in /root/m2-logs/):
|
||||||
|
|||||||
Reference in New Issue
Block a user