journal(2): discourse full4 timeout root-cause + full5 fixes (warm image cache + 3600s)

2026-05-31 01:26:41 +00:00
parent 8dfd8ed3b3
commit cc952903df
1 changed files with 32 additions and 0 deletions
--- a/machine-docs/JOURNAL-2.md
+++ b/machine-docs/JOURNAL-2.md
@ -1463,3 +1463,35 @@ omitted → data loss), full8 won it. Merged db healthcheck confirmed retries=10
 failure, and an intermittently-broken P4 data-integrity test is a real defect (P7). NOT claiming ghost
 on luck. Decision stands: implement the harness backup-integrity verify+re-invoke fix (next), then a
 ghost run must pass restore RELIABLY (ideally confirm with 2 consecutive green incl upgrade) before claim.
+
+---
+## 2026-05-31T01:2x — discourse full4 timeout root-cause + full5 fixes (Builder)
+Woke into the loop with discourse full4 in flight (PR head 3758522, STAGES=install,upgrade,backup,
+restore,custom — the VETO-clearing run incl upgrade-to-latest). full4 FAILED at the BASE deploy:
+`install: fail`, rest skipped; `abra app deploy disc-ce6450 ... timed out after 2400 seconds`.
+
+Investigation:
+- full2 (same REF, same overlay) base deploy SUCCEEDED (install+upgrade tiers passed) → the overlay
+  approach works; full4's timeout is flakiness at the convergence edge, not a config break.
+- The recurring log line `service "sidekiq" depends on undefined service "discourse": invalid compose
+  project` comes from `abra app config --images` (the prepull step): the published recipe (base 0.7.0
+  AND PR head) has `sidekiq.depends_on: [discourse]`, but the main service is `app` — `discourse` is
+  undefined → config rc=15 → prepull SKIPPED → the 2.4GB image is pulled INLINE during deploy.
+- On cc-ci the image was cached as `bitnamilegacy/discourse:<none>` (tag dangling) → the deploy
+  re-pulled 2.4GB, eating the convergence budget. Combined with the node being only **7 GiB RAM**
+  (not the 28 GiB the plan assumed) + load 6-7 on 4 vCPU during Rails asset-precompile, 40min was too
+  tight. (swarm IGNORES depends_on, so the dangling ref has zero runtime effect — full2 proves deploy
+  works despite it; it only breaks the prepull lint.)
+
+Tried to fix prepull by overriding `sidekiq.depends_on:[app]` in the overlay (04cc44c). It does NOT
+work: docker normalizes short-form depends_on to a map and map-merge is ADDITIVE → {discourse}+{app}
+={discourse,app}, the bad key survives, config --images still rc=15. (My initial "rc=0" test was
+bogus — `$?` after `| head` is head's exit code.) Reverted (8dfd8ed); overlay stays minimal.
+
+full5 fixes (the ones that actually address the timeout):
+1. Pre-cached `bitnamilegacy/discourse:3.3.1` by TAG on cc-ci (`docker pull`) — was dangling <none>;
+   now the inline pull during deploy is a no-op (layers present) → convergence not pull-bound.
+2. DEPLOY_TIMEOUT/TIMEOUT 2400→3600 (recipe_meta) — headroom for the RAM/CPU-constrained Rails boot.
+Cleaned full4's stray state (2 app.1 containers stuck "Removal In Progress" held the discourse_data
+volume; cleared after the daemon finished removal; volume rm'd). Node verified clean before launch.
+full5: `/root/ccci-discourse-full5.log`, PID 848184, REF 3758522, builder-clone @8dfd8ed.