status(rcust): m2p-lasuite-drive WILL land L0 — second P2b regression (completed one-shot 0/1 vs services_converged) root-caused live; fix on branch be2026a awaiting approval
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -198,3 +198,33 @@ main serial re-run, AND old main @ old default head. The earlier "deploy timed o
|
||||
concurrent image pulls" guess in STATUS was wrong (the 600s timeout was the SYMPTOM; the ~2min
|
||||
A/B failure exposed the crash-loop). Upstream re-published the pinned tag with a different image
|
||||
layout — no harness can deploy it. Filed in STATUS as restructure-neutral with grep-able evidence.
|
||||
|
||||
## 2026-06-11 lasuite-drive root cause #2 — completed one-shot poisons convergence (caught live)
|
||||
|
||||
Watching the m2p proof run instead of just waiting paid off: the fix-forward's best-effort line
|
||||
printed (so #1 is fixed), but the install assert then sat in pytest for 25+ minutes. Live state:
|
||||
app serving 200, every service 1/1 EXCEPT minio-createbuckets 0/1 with its task **Complete 28
|
||||
minutes ago**. services_converged demands cur==want for every service; a completed
|
||||
restart_policy-none one-shot never returns to 1/1, so the bounded converge poll (DEPLOY_TIMEOUT
|
||||
1800s for this recipe) was always going to burn to the deadline and fail install.
|
||||
|
||||
Why nobody ever saw this before P2b: the old setup_custom_tests.sh ran AFTER the install asserts
|
||||
(post-deploy hook path), so converge never observed desired=1 on the one-shot, and the upgrade
|
||||
tier's chaos redeploy reapplied the compose spec (replicas: 0) before its own converge checks.
|
||||
P2b folded the trigger into ops.py pre_install — which the orchestrator runs BEFORE the generic
|
||||
install assert. Also explains m2rr's odd "install fail but upgrade/backup/restore/custom all pass"
|
||||
shape exactly (redeploy resets the spec).
|
||||
|
||||
Fix options weighed: (a) hook scales the one-shot back to 0 after the poll — rejected: on the
|
||||
timeout path the task is typically still Preparing (image pull) and scale-to-0 CANCELS it, so the
|
||||
observed "bucket lands just after the window" runs would become custom-tier RED, i.e. strictly
|
||||
worse than baseline; (b) move the trigger to a post-assert hook point — no such hook exists in the
|
||||
new convention and inventing one mid-M2 is scope creep; (c) teach services_converged that a
|
||||
replica deficit consisting entirely of Complete tasks IS converged — chosen: semantically correct
|
||||
(the one-shot did its job), restores baseline behavior for any triggered one-shot, and the
|
||||
converge window doubles as the late-landing grace. Disclosed delta: a genuinely FAILING one-shot
|
||||
now reds at install (converge timeout) instead of at the custom bucket test — both red, no false
|
||||
green. Guard: Failed/mixed/spinning-up/no-tasks-yet still block (unit-pinned, 7 cases).
|
||||
|
||||
Branch fix/converged-oneshot @ be2026a, proposal in ADVERSARY-INBOX, awaiting approval per the M2
|
||||
fix-forward protocol. Unit suite 199 passed + lint PASS from the cc-ci working-tree rsync.
|
||||
|
||||
Reference in New Issue
Block a user