journal(2): discourse full4 timeout root-cause + full5 fixes (warm image cache + 3600s)

This commit is contained in:
autonomic-bot
2026-05-31 01:26:41 +00:00
parent 8dfd8ed3b3
commit cc952903df

View File

@ -1463,3 +1463,35 @@ omitted → data loss), full8 won it. Merged db healthcheck confirmed retries=10
failure, and an intermittently-broken P4 data-integrity test is a real defect (P7). NOT claiming ghost
on luck. Decision stands: implement the harness backup-integrity verify+re-invoke fix (next), then a
ghost run must pass restore RELIABLY (ideally confirm with 2 consecutive green incl upgrade) before claim.
---
## 2026-05-31T01:2x — discourse full4 timeout root-cause + full5 fixes (Builder)
Woke into the loop with discourse full4 in flight (PR head 3758522, STAGES=install,upgrade,backup,
restore,custom — the VETO-clearing run incl upgrade-to-latest). full4 FAILED at the BASE deploy:
`install: fail`, rest skipped; `abra app deploy disc-ce6450 ... timed out after 2400 seconds`.
Investigation:
- full2 (same REF, same overlay) base deploy SUCCEEDED (install+upgrade tiers passed) → the overlay
approach works; full4's timeout is flakiness at the convergence edge, not a config break.
- The recurring log line `service "sidekiq" depends on undefined service "discourse": invalid compose
project` comes from `abra app config --images` (the prepull step): the published recipe (base 0.7.0
AND PR head) has `sidekiq.depends_on: [discourse]`, but the main service is `app` — `discourse` is
undefined → config rc=15 → prepull SKIPPED → the 2.4GB image is pulled INLINE during deploy.
- On cc-ci the image was cached as `bitnamilegacy/discourse:<none>` (tag dangling) → the deploy
re-pulled 2.4GB, eating the convergence budget. Combined with the node being only **7 GiB RAM**
(not the 28 GiB the plan assumed) + load 6-7 on 4 vCPU during Rails asset-precompile, 40min was too
tight. (swarm IGNORES depends_on, so the dangling ref has zero runtime effect — full2 proves deploy
works despite it; it only breaks the prepull lint.)
Tried to fix prepull by overriding `sidekiq.depends_on:[app]` in the overlay (04cc44c). It does NOT
work: docker normalizes short-form depends_on to a map and map-merge is ADDITIVE → {discourse}+{app}
={discourse,app}, the bad key survives, config --images still rc=15. (My initial "rc=0" test was
bogus — `$?` after `| head` is head's exit code.) Reverted (8dfd8ed); overlay stays minimal.
full5 fixes (the ones that actually address the timeout):
1. Pre-cached `bitnamilegacy/discourse:3.3.1` by TAG on cc-ci (`docker pull`) — was dangling <none>;
now the inline pull during deploy is a no-op (layers present) → convergence not pull-bound.
2. DEPLOY_TIMEOUT/TIMEOUT 2400→3600 (recipe_meta) — headroom for the RAM/CPU-constrained Rails boot.
Cleaned full4's stray state (2 app.1 containers stuck "Removal In Progress" held the discourse_data
volume; cleared after the daemon finished removal; volume rm'd). Node verified clean before launch.
full5: `/root/ccci-discourse-full5.log`, PID 848184, REF 3758522, builder-clone @8dfd8ed.