Explain the cc-ci server -> Hetzner migration (ssh cc-ci now 91.98.47.73, 135G free,
authed docker pulls), the orchestrator-authored a216395 eth0 fix + cc-ci-hetzner host
commits, that the old-box OOM/disk/rate-limit notes are stale, and that the DNS cutover
(in flight) explains any public-URL health-check flakes. Loops delete on consume.
nixos-infect emitted defaultGateway6.address="" and ipv6.routes=[{address="";
prefixLength=128}] for this v4-only Hetzner instance, so network-addresses-eth0.service
failed at boot ("ip route add /128 ... any valid prefix is expected rather than /128").
The box has no real IPv6 (link-local only, kernel-managed), so remove the empty IPv6
gateway, address, and route. IPv4 unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port from terraform-hetzner branch. Adds the Hetzner cc-ci flake host with
all 3 root authorized keys so nixos-rebuild doesn't lock out SSH access.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The depends_on:[app] override in 04cc44c does NOT make compose valid: docker normalizes short-form
depends_on to a map and merges additively, so {discourse}+{app}={discourse,app} keeps the invalid
'discourse' key (config --images still rc=15). Reverted to keep the overlay minimal (re-pin + grace
only). Prepull-skip is harmless because bitnamilegacy/discourse:3.3.1 is warm in the node image cache
→ inline pull is a no-op. Timeout headroom (3600s) retained in recipe_meta.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
full4 base deploy timed out at 2400s on the 7-GiB single node. Root causes:
(1) sidekiq.depends_on referenced undefined service 'discourse' (main svc is 'app') → abra config
--images rc=15 → prepull SKIPPED → 2.4GB image pulled inline during deploy, eating convergence
budget. Overlay now overrides sidekiq.depends_on:[app] (swarm ignores depends_on → no-op at
runtime, masks nothing) so prepull resolves+pre-pulls images on both base+head deploys.
(2) bumped DEPLOY_TIMEOUT/TIMEOUT 2400→3600 for headroom on the RAM/CPU-constrained Rails cold boot.
Also pre-cached bitnamilegacy/discourse:3.3.1 by tag on cc-ci (was dangling <none>).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
full9: backup tier FAILed with NameError('__file__' not defined) — recipe_meta.py is exec()'d into a
bare namespace so __file__ is undefined. The harness already has runner/ on sys.path + harness imported,
so import lifecycle directly. (restore PASSED on full9 — the data-integrity fix works; this just fixes
the verify probe crashing the backup tier.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root cause (instrumented, DECISIONS 2026-05-30): a DB recipe dumps its data in a backupbot pre-hook,
but if the DB container cycles mid-dump (intermittent on the loaded CI node — full5/6/7 RED, full8
green; NOT OOM/NOT healthcheck) the dump is truncated/absent and restic snapshots an empty path —
abra app backup 'succeeds' yet a later restore silently loses the data (ghost ci_marker).
Fix (additive, recipe-scoped via meta like READY_PROBE): recipe_meta may define BACKUP_VERIFY(domain)
-> bool, a READ-ONLY post-backup integrity probe. When it returns False the harness re-runs the whole
backup (fresh snapshot, re-stabilised db) up to 3x. Recipes without the hook are unaffected. ghost's
BACKUP_VERIFY confirms /var/lib/mysql/backup.sql.gz is a valid non-empty gzip. Weakens no assertion —
it only retries a flaky CAPTURE so P4 restore is RELIABLY exercised, not luck-dependent.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
full4 timed out: abra deploy killed at 1200s while the app was at the near-final email_recipients
migration tables (still 0/1). Wall-time = mysql fresh-dir init (~6min, app crash-loops on ECONNREFUSED
until DB ready — no migration progress lost) + ~9-15min schema migration (round-trip-bound, slower
under host load). Not a test weakening — bounded wait (matches discourse), a genuine hang still fails.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per Adversary course-correction (bdef282) + plan-ccci-compose-overlay-policy.md §1: upgrade-to-latest
is MANDATORY. The 0.7.0+3.3.1 from-version pins the Docker-Hub-removed bitnami/discourse:3.3.1 (404)
and ships a too-tight 5m start_period for the 15-25min Rails cold boot. Minimal base overlay
compose.ccci.yml re-pins app+sidekiq to bitnamilegacy/discourse:3.3.1 (namespace-only, identical
image — same re-pin the PR head makes) + widens start_period to 20m (grace-only). install_steps.sh
provides it; CHAOS_BASE_DEPLOY skips the clean-tree gate; UPGRADE_BASE_VERSION=0.7.0+3.3.1 sets the
true predecessor. Neither change weakens a test. Run shape returns to STAGES=install,upgrade,backup,
restore,custom.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run #2 base deploy: fresh mysql:8.0 init on the loaded cc-ci host (load ~8) took >6min
(InnoDB ~90s + system-tables + root-pw apply, starved by the app crash-loop churn), exceeding
the recipe's 1m db start_period (+6min retry grace) → swarm killed mysql mid-init (exit 137
unhealthy) → corrupt InnoDB redo logs → permanent deadlock (same signature as run #1's stale
vol). Widen db healthcheck start_period to 15m (matches app) so the slow first-boot finishes
before the healthcheck can fail it. Grace-only, masks no defect; bites base+head (published
recipe ships db start_period 1m everywhere) so overlay covers both. Torn down corrupt vol.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Course correction (REVIEW-2 bdef282) mandates upgrade-to-latest; harness base-deploys
prev published version 1.1.1+6-alpine which predates the recipe-PR 15m start_period bump
(ships 1m) → would deadlock on the ~6-9min fresh-DB migration (swarm kill mid-migration →
held migrations_lock). Policy-blessed minimal base overlay: compose.ccci.yml re-applies the
15m app-healthcheck start_period grace to the BASE so the from-version is deployable;
install_steps.sh provides it; CHAOS_BASE_DEPLOY skips clean-tree on the untracked overlay;
persists across head checkout (idempotent — PR head ships 15m). Grace-only, no test weakened.
Prior corrupt mysql vol (stale, interrupted init) torn down. Next: full run incl upgrade.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>