fix(dstamp): discourse upgrade stop-first overlay (stop 2x-memory start-first OOM→spurious swarm rollback) + harness assert_upgrade_converged (detect rollback/pause → honest upgrade failure, HC1 unweakened). Root cause: failure_action:rollback reverted chaos-version label, masked by start-first+wait_healthy
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -28,10 +28,30 @@ version: "3.8"
|
||||
# bad `discourse` key. Instead the 2.4GB `bitnamilegacy/discourse:3.3.1` image is kept warm in the node
|
||||
# image cache, so the inline pull during deploy is a no-op and convergence isn't pull-bound. (swarm
|
||||
# ignores depends_on, so the dangling ref has zero runtime effect — a recipe lint nit, not a defect.)
|
||||
#
|
||||
# 3. UPGRADE ROLLOUT (dstamp 2026-06-11, direct-evidence attribution in JOURNAL-dstamp): the
|
||||
# published app service sets `deploy.update_config: { failure_action: rollback, order:
|
||||
# start-first }`. On the upgrade chaos redeploy (base 0.7.0 → PR head), start-first runs the OLD
|
||||
# and NEW precompile/Rails-heavy discourse tasks CO-RESIDENT (~2x memory); under host memory
|
||||
# pressure the NEW task intermittently OOMs/fails swarm's update monitor → `failure_action:
|
||||
# rollback` reverts the app service to its PREVIOUS spec, INCLUDING the
|
||||
# `coop-cloud.<stack>.chaos-version` label (head → base). Because start-first keeps the OLD task
|
||||
# serving, wait_healthy still passes, and HC1 then reads the reverted BASE commit (eb96de9+U) and
|
||||
# misreports it as 'the re-checkout failed' — the dstamp drift, reproduced solo (runs
|
||||
# dstamp-repro1/4) with `.Spec.chaos-version=7ae7b0f7+U` (head applied) flipping to
|
||||
# `.PreviousSpec=eb96de94+U` after the rollback. FIX: `order: stop-first` so the NEW task boots
|
||||
# with the full host memory (no 2x co-residency) and genuinely becomes healthy → no spurious
|
||||
# rollback. This is a CI deploy-rollout tweak only: the upgrade still really deploys + asserts the
|
||||
# PR-head code under test, and `failure_action: rollback` is LEFT intact, so a genuinely broken
|
||||
# head still rolls back and is caught (lifecycle.assert_upgrade_converged) — NO test is weakened.
|
||||
# Trade-off: brief real downtime during the CI upgrade (covered by DEPLOY_TIMEOUT 3600).
|
||||
services:
|
||||
app:
|
||||
image: bitnamilegacy/discourse:3.3.1
|
||||
healthcheck:
|
||||
start_period: 20m
|
||||
deploy:
|
||||
update_config:
|
||||
order: stop-first
|
||||
sidekiq:
|
||||
image: bitnamilegacy/discourse:3.3.1
|
||||
|
||||
Reference in New Issue
Block a user