fix(2): discourse base-deploy timeout — prepull-enable (sidekiq depends_on app, valid compose) + 3600s timeout
full4 base deploy timed out at 2400s on the 7-GiB single node. Root causes:
(1) sidekiq.depends_on referenced undefined service 'discourse' (main svc is 'app') → abra config
--images rc=15 → prepull SKIPPED → 2.4GB image pulled inline during deploy, eating convergence
budget. Overlay now overrides sidekiq.depends_on:[app] (swarm ignores depends_on → no-op at
runtime, masks nothing) so prepull resolves+pre-pulls images on both base+head deploys.
(2) bumped DEPLOY_TIMEOUT/TIMEOUT 2400→3600 for headroom on the RAM/CPU-constrained Rails cold boot.
Also pre-cached bitnamilegacy/discourse:3.3.1 by tag on cc-ci (was dangling <none>).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -20,6 +20,19 @@ version: "3.8"
|
||||
# ships 20m, so this overlay is idempotent on the head (it persists untracked across the checkout).
|
||||
# Both changes are namespace/grace-only: identical image content, a healthy check still marks healthy
|
||||
# immediately → NO assertion is weakened and no defect is masked.
|
||||
#
|
||||
# 3. PREPULL-ENABLE (compose-validity): the published recipe (BOTH 0.7.0 base AND the PR head) ships
|
||||
# `sidekiq.depends_on: [discourse]`, but the main service is named `app` — `discourse` is an
|
||||
# UNDEFINED service, so `abra app config --images` (the harness prepull step) returns
|
||||
# `invalid compose project` (rc=15) and prepull is SKIPPED → the 2.4GB discourse image is pulled
|
||||
# INLINE during `abra app deploy`, eating the convergence budget on the 7-GiB single node and
|
||||
# pushing the slow Rails cold boot past the deploy timeout (full4 base-deploy timed out at 2400s).
|
||||
# This overlay overrides sidekiq.depends_on to the real service `app` so the merged compose is
|
||||
# VALID → prepull resolves+pre-pulls the images before deploy → reliable convergence. This is a
|
||||
# NO-OP at runtime: `docker stack deploy` (swarm) IGNORES depends_on entirely, so correcting a
|
||||
# dangling lint-only reference to the real service changes NOTHING swarm does — it masks no defect
|
||||
# and weakens no test; it only unblocks the image pre-pull. (The same dangling ref exists upstream;
|
||||
# filed as a recipe nit, but the fix here is the immutable-published-base + harness-prepull path.)
|
||||
services:
|
||||
app:
|
||||
image: bitnamilegacy/discourse:3.3.1
|
||||
@ -27,3 +40,5 @@ services:
|
||||
start_period: 20m
|
||||
sidekiq:
|
||||
image: bitnamilegacy/discourse:3.3.1
|
||||
depends_on:
|
||||
- app
|
||||
|
||||
@ -6,7 +6,8 @@
|
||||
# app is actually serving (the canonical "is discourse up" signal — NOT "/", which may redirect to setup).
|
||||
HEALTH_PATH = "/srv/status"
|
||||
HEALTH_OK = (200,)
|
||||
DEPLOY_TIMEOUT = 2400 # slow Rails cold boot (15-25min); matches the EXTRA_ENV TIMEOUT below
|
||||
DEPLOY_TIMEOUT = 3600 # slow Rails cold boot (15-25min) on the 7-GiB single node; bumped 2400→3600 for
|
||||
# headroom after full4's base deploy timed out at 2400s (RAM/CPU-constrained boot + image re-pull).
|
||||
HTTP_TIMEOUT = 1200
|
||||
|
||||
# Slow-cold-boot handling: the recipe-PR (recipe-maintainers/discourse#1) bumps the app healthcheck
|
||||
@ -33,7 +34,7 @@ HTTP_TIMEOUT = 1200
|
||||
CHAOS_BASE_DEPLOY = True
|
||||
UPGRADE_BASE_VERSION = "0.7.0+3.3.1"
|
||||
EXTRA_ENV = {
|
||||
"TIMEOUT": "2400",
|
||||
"TIMEOUT": "3600", # abra's internal convergence wait; matches DEPLOY_TIMEOUT (slow Rails boot headroom)
|
||||
"COMPOSE_FILE": "compose.yml:compose.ccci.yml",
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user