harness/screenshot.py: best-effort Playwright capture of the live app (reuses harness browser).
Default = landing page (credential-free, secret-safe R7); recipes needing post-login opt into a
recipe-meta SCREENSHOT hook responsible for avoiding secret pages. Every failure swallowed -> None
(cosmetics never block, R7). Pure helpers unit-tested. Orchestrator wiring + live demo come after U0
PASSes (avoid deploy contention with the Adversary's cold U0 re-runs).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The depends_on:[app] override in 04cc44c does NOT make compose valid: docker normalizes short-form
depends_on to a map and merges additively, so {discourse}+{app}={discourse,app} keeps the invalid
'discourse' key (config --images still rc=15). Reverted to keep the overlay minimal (re-pin + grace
only). Prepull-skip is harmless because bitnamilegacy/discourse:3.3.1 is warm in the node image cache
→ inline pull is a no-op. Timeout headroom (3600s) retained in recipe_meta.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
full4 base deploy timed out at 2400s on the 7-GiB single node. Root causes:
(1) sidekiq.depends_on referenced undefined service 'discourse' (main svc is 'app') → abra config
--images rc=15 → prepull SKIPPED → 2.4GB image pulled inline during deploy, eating convergence
budget. Overlay now overrides sidekiq.depends_on:[app] (swarm ignores depends_on → no-op at
runtime, masks nothing) so prepull resolves+pre-pulls images on both base+head deploys.
(2) bumped DEPLOY_TIMEOUT/TIMEOUT 2400→3600 for headroom on the RAM/CPU-constrained Rails cold boot.
Also pre-cached bitnamilegacy/discourse:3.3.1 by tag on cc-ci (was dangling <none>).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
full9: backup tier FAILed with NameError('__file__' not defined) — recipe_meta.py is exec()'d into a
bare namespace so __file__ is undefined. The harness already has runner/ on sys.path + harness imported,
so import lifecycle directly. (restore PASSED on full9 — the data-integrity fix works; this just fixes
the verify probe crashing the backup tier.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root cause (instrumented, DECISIONS 2026-05-30): a DB recipe dumps its data in a backupbot pre-hook,
but if the DB container cycles mid-dump (intermittent on the loaded CI node — full5/6/7 RED, full8
green; NOT OOM/NOT healthcheck) the dump is truncated/absent and restic snapshots an empty path —
abra app backup 'succeeds' yet a later restore silently loses the data (ghost ci_marker).
Fix (additive, recipe-scoped via meta like READY_PROBE): recipe_meta may define BACKUP_VERIFY(domain)
-> bool, a READ-ONLY post-backup integrity probe. When it returns False the harness re-runs the whole
backup (fresh snapshot, re-stabilised db) up to 3x. Recipes without the hook are unaffected. ghost's
BACKUP_VERIFY confirms /var/lib/mysql/backup.sql.gz is a valid non-empty gzip. Weakens no assertion —
it only retries a flaky CAPTURE so P4 restore is RELIABLY exercised, not luck-dependent.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
full4 timed out: abra deploy killed at 1200s while the app was at the near-final email_recipients
migration tables (still 0/1). Wall-time = mysql fresh-dir init (~6min, app crash-loops on ECONNREFUSED
until DB ready — no migration progress lost) + ~9-15min schema migration (round-trip-bound, slower
under host load). Not a test weakening — bounded wait (matches discourse), a genuine hang still fails.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per Adversary course-correction (bdef282) + plan-ccci-compose-overlay-policy.md §1: upgrade-to-latest
is MANDATORY. The 0.7.0+3.3.1 from-version pins the Docker-Hub-removed bitnami/discourse:3.3.1 (404)
and ships a too-tight 5m start_period for the 15-25min Rails cold boot. Minimal base overlay
compose.ccci.yml re-pins app+sidekiq to bitnamilegacy/discourse:3.3.1 (namespace-only, identical
image — same re-pin the PR head makes) + widens start_period to 20m (grace-only). install_steps.sh
provides it; CHAOS_BASE_DEPLOY skips the clean-tree gate; UPGRADE_BASE_VERSION=0.7.0+3.3.1 sets the
true predecessor. Neither change weakens a test. Run shape returns to STAGES=install,upgrade,backup,
restore,custom.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run #2 base deploy: fresh mysql:8.0 init on the loaded cc-ci host (load ~8) took >6min
(InnoDB ~90s + system-tables + root-pw apply, starved by the app crash-loop churn), exceeding
the recipe's 1m db start_period (+6min retry grace) → swarm killed mysql mid-init (exit 137
unhealthy) → corrupt InnoDB redo logs → permanent deadlock (same signature as run #1's stale
vol). Widen db healthcheck start_period to 15m (matches app) so the slow first-boot finishes
before the healthcheck can fail it. Grace-only, masks no defect; bites base+head (published
recipe ships db start_period 1m everywhere) so overlay covers both. Torn down corrupt vol.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Course correction (REVIEW-2 bdef282) mandates upgrade-to-latest; harness base-deploys
prev published version 1.1.1+6-alpine which predates the recipe-PR 15m start_period bump
(ships 1m) → would deadlock on the ~6-9min fresh-DB migration (swarm kill mid-migration →
held migrations_lock). Policy-blessed minimal base overlay: compose.ccci.yml re-applies the
15m app-healthcheck start_period grace to the BASE so the from-version is deployable;
install_steps.sh provides it; CHAOS_BASE_DEPLOY skips clean-tree on the untracked overlay;
persists across head checkout (idempotent — PR head ships 15m). Grace-only, no test weakened.
Prior corrupt mysql vol (stale, interrupted init) torn down. Next: full run incl upgrade.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
abra rejects env-interpolation in healthcheck start_period (FATA 'Does not match
format duration' for both ${VAR} and quoted forms — validates the literal compose
duration before .env substitution). So §9 pt1's env-var route is impossible for
this field; the §9-compliant fix is a LITERAL start_period:20m bump in the
recipe-PR (recipe everyone runs, not a cc-ci overlay; strictly safer). Remove
APP_START_PERIOD from recipe_meta EXTRA_ENV; record the finding in DECISIONS
(ghost E1 must use the same approach); STATUS-2 → new PR head 7a2e0e0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrate discourse off the cc-ci compose overlay per plan §9 / plan-prefer-env-over-compose-overlay.md:
- recipe_meta: drop UPGRADE_BASE_VERSION + COMPOSE_FILE + CHAOS_BASE_DEPLOY; set APP_START_PERIOD=1200s
via EXTRA_ENV (the recipe-PR exposes start_period: ${APP_START_PERIOD:-5m}); declare upgrade tier N/A
(both published prev bases pin removed bitnami images; Adversary §7.1 granted, REVIEW-2 efe3790).
- delete tests/discourse/compose.ccci-health.yml + install_steps.sh (existed only to copy the overlay).
- DECISIONS.md + STATUS-2 record the §9 guardrail + discourse shape (upgrade N/A, env start_period,
pg_backup restore-hook recipe-PR = 5th data-loss recipe cc-ci caught).
recipe-PR head now 8b8df17 (start_period env var added). Not a claim — run STAGES=install,backup,restore,custom next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the real 0.7.0+3.3.1 -> 0.8.0+3.3.1 upgrade crossover instead of a
§7.1 skip-with-sign-off (Adversary leans DENY on the deferral; agreed):
- recipe_meta UPGRADE_BASE_VERSION=0.7.0+3.3.1 + generic support in
run_recipe_ci (prev = meta override or previous_version). Harness default
[-2]=0.6.3+3.1.2 is a hollow base (img 3.1.2 != head 3.3.1); [-1]=0.7.0+3.3.1
is the PR's true predecessor and shares head's servable 3.3.1 image.
- compose.ccci-health.yml re-pins services.{app,sidekiq}.image to
bitnamilegacy/discourse:3.3.1 so the 0.7.0 base (compose pins 404 bitnami:3.3.1)
is servable; idempotent on the head (PR already bitnamilegacy).
Consumes Adversary BUILDER-INBOX (deleted), leaves ADVERSARY-INBOX ack; STATUS-2
discourse section updated. Full lifecycle run launching next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_discourse.py: bootstrap an admin (recipe seeds none) + mint an ApiKey via rails runner in the app
container (class-B run-scoped). test_create_topic.py: POST /posts.json (unique marker) -> GET
/t/<id>.json title+cooked round-trip. test_site_basic.py: GET /site.json asserts discourse categories
config. Meets P3 (>=2 functional beyond health).
Install timed out at 1800s: discourse's 15-25min Rails cold boot overran both the deploy timeout and
the recipe healthcheck start_period:5m (swarm killed the booting app). Add compose.ccci-health.yml
(app healthcheck start_period 1200s) via install_steps.sh + recipe_meta COMPOSE_FILE + CHAOS_BASE_DEPLOY,
bump DEPLOY_TIMEOUT/TIMEOUT to 2400. Image re-pin (bitnamilegacy) already proven working. NO test weakened.
Root cause: Ghost's fresh-DB first boot runs a ~6-9min schema migration (round-trip-bound, not CPU);
the recipe healthcheck start_period:1m (~6min grace) kills the still-migrating task, leaving a stale
migrations_lock → every later task deadlocks (MigrationsAreLockedError). Hit on both 2- and 4-vCPU.
Fix (cc-ci deploy overlay, NOT a recipe/test change): compose.ccci-health.yml raises app healthcheck
start_period to 900s, wired via recipe_meta COMPOSE_FILE + install_steps.sh (+ CHAOS_BASE_DEPLOY for
the untracked overlay). No assertion weakened. Budget 1200s = migration + convergence. Only the
install tier needs it (upgrade redeploys on the populated DB → fast boot).
- ops.py + test_{upgrade,backup,restore}.py: seed ci_marker into the MySQL `ghost` DB (db service)
via the mysql CLI; rides the recipe's mysqldump --tab backup. recipe is MySQL not sqlite (stale
comment fixed). Expect restore RED -> recipe-PR (no backupbot.restore hook; immich/mattermost class).
- functional/_ghost.py: cookie-aware Ghost Admin API client (stdlib http.cookiejar; Origin CSRF hdr).
- functional/test_post_roundtrip.py: §4.3 create published post + read back (unique marker, non-vacuous);
closes the DEFERRED ghost create-post item.
- PARITY.md + recipe_meta.py updated. Authored node-free; full-lifecycle run next, NOT yet claimed.
Root cause of the 2 failing custom tests: TLS_FLAVOR=notls → dovecot refuses plaintext auth over
network 143, so host-side IMAP login/auth isn't a meaningful signal. Smoke2 PROVED the in-container
path: sendmail (postfix container) local-injects a marker mail → doveadm search (imap container) finds
it in INBOX. test_mail_flow now exercises the real postfix→rspamd→dovecot deliver/store/fetch via
exec_in_app(service=smtp/imap). Dropped test_imap_login (network plaintext-auth disallowed under notls).
test_mailbox (create+config-export read-back) unchanged. PARITY.md updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnostic (RECIPE=mumble STAGES=install,backup,restore,custom, no upgrade) PROVED backup+restore green
on a stable 1.0.0 deploy incl. ci_marker survival (P4). The full-run backup 409 ('container not
running') was the chaos UPGRADE redeploy: host-mode 64738 must be released by the old task + rebound by
the new, and HEALTH_PATH '/' only proves the mumble-web sidecar (not the voice server), so wait_healthy
passed while the app churned → backup-bot execed a not-running container. Fix: extend
lifecycle.wait_ready_probes to support a TCP probe ({tcp_host,tcp_port,stable=N consecutive connects});
mumble recipe_meta READY_PROBE returns 64738 (stable=3) so the harness waits for the voice server up
after install AND upgrade before backup.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PRAGMA busy_timeout=N emits its own result row, polluting the read-back parse (seed read back
'20000\nupgrade-survives' → AssertionError 'seed did not commit', failing upgrade/backup/restore ops
— though the INSERT actually committed). Switch _sqlite to 'sqlite3 -cmd ".timeout 20000"' which sets
the busy timeout silently. install+custom already green (handshake/welcome/web/tcp PASS); this fixes
the P4 lifecycle ops.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mumble's pinned base deploy (prev version 0.2.0) FATAs 'has locally unstaged changes' because
install_steps provides an untracked compose.host-ports.yml. New recipe_meta CHAOS_BASE_DEPLOY=True +
lifecycle._recipe_meta_flag + deploy_app branch -> base uses chaos (skips clean-tree/lint, deploys the
checked-out pinned version, not LATEST), mirroring the lightweight-tag chaos-base path. DECISIONS.md
records the full mumble enrollment design.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The upstream compose.host-ports.yml exists only from v1.0.0+, but the upgrade-tier base deploy is
the previous published version (0.2.0+), which predates it — so EXTRA_ENV's COMPOSE_FILE failed to
resolve on the base deploy (config --images rc=14, deploy FATA). install_steps.sh now copies a
cc-ci-owned identical overlay into the recipe checkout when absent, so 64738 is host-published for
every version (base + upgrade) and on-host protocol tests reach 127.0.0.1:64738.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>