diff --git a/machine-docs/JOURNAL-2.md b/machine-docs/JOURNAL-2.md index 61d76df..7bc12c4 100644 --- a/machine-docs/JOURNAL-2.md +++ b/machine-docs/JOURNAL-2.md @@ -1463,3 +1463,35 @@ omitted → data loss), full8 won it. Merged db healthcheck confirmed retries=10 failure, and an intermittently-broken P4 data-integrity test is a real defect (P7). NOT claiming ghost on luck. Decision stands: implement the harness backup-integrity verify+re-invoke fix (next), then a ghost run must pass restore RELIABLY (ideally confirm with 2 consecutive green incl upgrade) before claim. + +--- +## 2026-05-31T01:2x — discourse full4 timeout root-cause + full5 fixes (Builder) +Woke into the loop with discourse full4 in flight (PR head 3758522, STAGES=install,upgrade,backup, +restore,custom — the VETO-clearing run incl upgrade-to-latest). full4 FAILED at the BASE deploy: +`install: fail`, rest skipped; `abra app deploy disc-ce6450 ... timed out after 2400 seconds`. + +Investigation: +- full2 (same REF, same overlay) base deploy SUCCEEDED (install+upgrade tiers passed) → the overlay + approach works; full4's timeout is flakiness at the convergence edge, not a config break. +- The recurring log line `service "sidekiq" depends on undefined service "discourse": invalid compose + project` comes from `abra app config --images` (the prepull step): the published recipe (base 0.7.0 + AND PR head) has `sidekiq.depends_on: [discourse]`, but the main service is `app` — `discourse` is + undefined → config rc=15 → prepull SKIPPED → the 2.4GB image is pulled INLINE during deploy. +- On cc-ci the image was cached as `bitnamilegacy/discourse:` (tag dangling) → the deploy + re-pulled 2.4GB, eating the convergence budget. Combined with the node being only **7 GiB RAM** + (not the 28 GiB the plan assumed) + load 6-7 on 4 vCPU during Rails asset-precompile, 40min was too + tight. (swarm IGNORES depends_on, so the dangling ref has zero runtime effect — full2 proves deploy + works despite it; it only breaks the prepull lint.) + +Tried to fix prepull by overriding `sidekiq.depends_on:[app]` in the overlay (04cc44c). It does NOT +work: docker normalizes short-form depends_on to a map and map-merge is ADDITIVE → {discourse}+{app} +={discourse,app}, the bad key survives, config --images still rc=15. (My initial "rc=0" test was +bogus — `$?` after `| head` is head's exit code.) Reverted (8dfd8ed); overlay stays minimal. + +full5 fixes (the ones that actually address the timeout): +1. Pre-cached `bitnamilegacy/discourse:3.3.1` by TAG on cc-ci (`docker pull`) — was dangling ; + now the inline pull during deploy is a no-op (layers present) → convergence not pull-bound. +2. DEPLOY_TIMEOUT/TIMEOUT 2400→3600 (recipe_meta) — headroom for the RAM/CPU-constrained Rails boot. +Cleaned full4's stray state (2 app.1 containers stuck "Removal In Progress" held the discourse_data +volume; cleared after the daemon finished removal; volume rm'd). Node verified clean before launch. +full5: `/root/ccci-discourse-full5.log`, PID 848184, REF 3758522, builder-clone @8dfd8ed.