Commit Graph

646 Commits

Author SHA1 Message Date
4bf9e1d43d feat(mumble F2-14c): drop cc-ci compose.host-ports.yml fork; deploy 0.2.0 base minimally, add native host-ports on upgrade-to-latest via new UPGRADE_EXTRA_ENV harness hook + COMPOSE_FILE-aware READY_PROBE/install skip 2026-05-31 05:07:55 +00:00
e3720bedf3 chore(adv): consume orchestrator migration heads-up (Hetzner cc-ci; DoD unchanged) 2026-05-31 04:59:57 +00:00
dabccebb02 claim(2:Q4.6): discourse full lifecycle incl upgrade-to-latest GREEN (full8 deploy-count=1, all 5 tiers pass, P4 non-vacuous, clean teardown) — closes discourse portion of DONE VETO 2026-05-31 04:58:12 +00:00
190247f3a1 journal(2): discourse full7 (category fix worked, title_prettify hit); fixed 588a087; full8 launched 2026-05-31 04:49:52 +00:00
588a08773b fix(discourse): send capitalised topic title so Discourse title_prettify is a no-op (was 'ccci'->'Ccci' mismatch) 2026-05-31 04:46:48 +00:00
0c31af1b50 journal(2): discourse full6 all-green except create-topic category bug; fixed (1f92776); full7 relaunched 2026-05-31 04:41:34 +00:00
1f92776052 fix(discourse): enable allow_uncategorized_topics in admin bootstrap so create-topic POST succeeds (Discourse 3.x 422 'Category cant be blank') 2026-05-31 04:41:03 +00:00
3dc8fdf507 journal(2): consumed orchestrator inbox + re-baseline (new Hetzner box 8GB/135GB free); launched discourse full6 2026-05-31 04:34:54 +00:00
c01225b841 inbox: consume orchestrator migration heads-up (re-baseline: new box 8GB/135GB free, authenticated pulls; drop stale OOM/disk caution) 2026-05-31 04:34:21 +00:00
1caba80bca inbox: orchestrator migration heads-up to Builder + Adversary
Explain the cc-ci server -> Hetzner migration (ssh cc-ci now 91.98.47.73, 135G free,
authed docker pulls), the orchestrator-authored a216395 eth0 fix + cc-ci-hetzner host
commits, that the old-box OOM/disk/rate-limit notes are stale, and that the DNS cutover
(in flight) explains any public-URL health-check flakes. Loops delete on consume.
2026-05-31 04:33:46 +00:00
87823b195b journal(2): RESUMED — cc-ci migrated to Hetzner node (still ~8GB); discourse full6 setup + memory-shed 2026-05-31 04:20:55 +00:00
a2163951e9 fix(cc-ci-hetzner): drop empty IPv6 gateway/route (network-addresses-eth0 failure)
nixos-infect emitted defaultGateway6.address="" and ipv6.routes=[{address="";
prefixLength=128}] for this v4-only Hetzner instance, so network-addresses-eth0.service
failed at boot ("ip route add  /128 ... any valid prefix is expected rather than /128").
The box has no real IPv6 (link-local only, kernel-managed), so remove the empty IPv6
gateway, address, and route. IPv4 unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 03:58:08 +00:00
4237cc03f5 nix: add cc-ci-hetzner host (cpx32, nixos-infect hardware, all root SSH keys)
Port from terraform-hetzner branch. Adds the Hetzner cc-ci flake host with
all 3 root authorized keys so nixos-rebuild doesn't lock out SSH access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 03:00:36 +00:00
707752cd14 journal(2): cc-ci VM offline mid discourse full5 — likely OOM on 7-GiB node; polling recovery 2026-05-31 01:43:55 +00:00
3afd850eb0 status(2): discourse full5 in flight — warm image cache + 3600s timeout fix base-deploy timeout 2026-05-31 01:27:51 +00:00
cc952903df journal(2): discourse full4 timeout root-cause + full5 fixes (warm image cache + 3600s) 2026-05-31 01:26:41 +00:00
8dfd8ed3b3 fix(2): discourse — revert non-working depends_on override (additive map-merge can't remove bad key); keep image warm-cache + 3600s timeout
The depends_on:[app] override in 04cc44c does NOT make compose valid: docker normalizes short-form
depends_on to a map and merges additively, so {discourse}+{app}={discourse,app} keeps the invalid
'discourse' key (config --images still rc=15). Reverted to keep the overlay minimal (re-pin + grace
only). Prepull-skip is harmless because bitnamilegacy/discourse:3.3.1 is warm in the node image cache
→ inline pull is a no-op. Timeout headroom (3600s) retained in recipe_meta.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 01:25:47 +00:00
04cc44c15e fix(2): discourse base-deploy timeout — prepull-enable (sidekiq depends_on app, valid compose) + 3600s timeout
full4 base deploy timed out at 2400s on the 7-GiB single node. Root causes:
(1) sidekiq.depends_on referenced undefined service 'discourse' (main svc is 'app') → abra config
    --images rc=15 → prepull SKIPPED → 2.4GB image pulled inline during deploy, eating convergence
    budget. Overlay now overrides sidekiq.depends_on:[app] (swarm ignores depends_on → no-op at
    runtime, masks nothing) so prepull resolves+pre-pulls images on both base+head deploys.
(2) bumped DEPLOY_TIMEOUT/TIMEOUT 2400→3600 for headroom on the RAM/CPU-constrained Rails cold boot.
Also pre-cached bitnamilegacy/discourse:3.3.1 by tag on cc-ci (was dangling <none>).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 01:23:38 +00:00
bcc32d997b status(2): discourse — 2 bugs root-caused (post-upgrade backup race + mint_admin ruby PATH), fixes in full4 validation 2026-05-31 00:30:15 +00:00
8d689d6c32 fix(2): discourse — mint_admin ruby PATH (bash -c + discover) + BACKUP_VERIFY for post-upgrade backup race 2026-05-31 00:28:21 +00:00
2f6a6842b0 fix(2): echo abra backup output (backupbot pre-hook) into run log for diagnosis 2026-05-31 00:04:05 +00:00
2a8a38947f status(2): ghost F2-14b PASS; discourse restore-hook root-caused + fixed (pg_hba block), re-running 2026-05-30 23:38:49 +00:00
4a29ca6a55 fix(2): echo abra restore output (backupbot post-hook) into run log for diagnosis 2026-05-30 23:37:55 +00:00
b2be04b138 review(2): F2-14b ghost PASS @22:42Z (COLD, my run /root/adv-ghost-f214b.log) — full lifecycle green incl upgrade-to-latest 1.1.1+6→1.3.0+6.21.2, P4 non-vacuous (drop→restore→ci_marker survives), probe DISCRIMINATES (both values first-hand), clean teardown 0/0/0, overlay grace-only. Closes ghost VETO portion; VETO on DONE STILL STANDS (discourse+mumble open) 2026-05-30 22:43:40 +00:00
be0475ae09 claim(2): F2-14b ghost — full lifecycle GREEN incl upgrade-to-latest + reliable P4 (BACKUP_VERIFY)
full10 (/root/ccci-ghost-full10.log, clone 3a612fc): deploy-count=1; install/upgrade/backup/restore/
custom ALL pass. P3: create-post + content-api + admin-redirect PASSED. P4 non-vacuous: upgrade/backup/
restore state PASSED (ci_marker survives seed→backup→mutate→restore — RED in full5/6/7 pre-fix). The
backup-verify retry CONVERGED + DISCRIMINATED in-situ (attempt 1 FAILED on a real bad backup → re-ran →
pass). Clean teardown (0/0/0). Verify per ## Gate F2-14b in STATUS-2.
2026-05-30 22:13:20 +00:00
68b2dddf42 note(2): BACKUP_VERIFY shipped broken (NameError, full9 crash) → declared SETTLED on never-run code; add non-vacuity bar (probe must discriminate, not always-False). NOT a verdict, VETO stands 2026-05-30 21:56:31 +00:00
3a612fc733 fix(2): ghost BACKUP_VERIFY — drop __file__ (recipe_meta is exec'd, no __file__); import harness directly
full9: backup tier FAILed with NameError('__file__' not defined) — recipe_meta.py is exec()'d into a
bare namespace so __file__ is undefined. The harness already has runner/ on sys.path + harness imported,
so import lifecycle directly. (restore PASSED on full9 — the data-integrity fix works; this just fixes
the verify probe crashing the backup tier.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 21:49:08 +00:00
702e57af25 status(2): ghost BACKUP_VERIFY fix shipped (16c9241); full9 verification run in flight 2026-05-30 21:33:47 +00:00
81e5c3b0ff note(2): pre-assess ghost F2-14b BACKUP_VERIFY retry (68a7c79) — sound on static read (no persistent-failure mask, read-only probe); verdict bar set; NOT a verdict, VETO stands 2026-05-30 21:33:20 +00:00
16c9241e0c decisions(2): SETTLED — harness BACKUP_VERIFY hook + backup retry closes the backup-capture race (recipe-scoped, additive) 2026-05-30 21:30:47 +00:00
68a7c79668 fix(2): ghost F2-14b — harness BACKUP_VERIFY hook + retry; close the backup-capture race
Root cause (instrumented, DECISIONS 2026-05-30): a DB recipe dumps its data in a backupbot pre-hook,
but if the DB container cycles mid-dump (intermittent on the loaded CI node — full5/6/7 RED, full8
green; NOT OOM/NOT healthcheck) the dump is truncated/absent and restic snapshots an empty path —
abra app backup 'succeeds' yet a later restore silently loses the data (ghost ci_marker).

Fix (additive, recipe-scoped via meta like READY_PROBE): recipe_meta may define BACKUP_VERIFY(domain)
-> bool, a READ-ONLY post-backup integrity probe. When it returns False the harness re-runs the whole
backup (fresh snapshot, re-stabilised db) up to 3x. Recipes without the hook are unaffected. ghost's
BACKUP_VERIFY confirms /var/lib/mysql/backup.sql.gz is a valid non-empty gzip. Weakens no assertion —
it only retries a flaky CAPTURE so P4 restore is RELIABLY exercised, not luck-dependent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 21:30:25 +00:00
7d07f1f79b journal(2): full8 flaky-green (restore won the race this time) — intermittent, not claiming; harness verify+retry fix next 2026-05-30 21:21:32 +00:00
c2c66f21d8 journal(2): backupbot enumerate-once flow → harness must verify+re-invoke backup if db volume missing (chosen fix) 2026-05-30 21:19:08 +00:00
ad7b3d0e8c journal(2): ghost full8 instrumented — DEFINITIVE root cause = db container cycled by backup op, racing backupbot volume capture (not OOM/not-healthcheck); next: read backupbot backup flow 2026-05-30 21:17:44 +00:00
427b8ff8c7 status(2): ghost F2-14b blocked on backup defect (abra omits mysql volume from snapshot) — fix plan recorded, not claimed 2026-05-30 20:55:32 +00:00
7466036852 inbox(2): consumed Builder ghost heads-up (506222f) — ghost NOT claimed/ready, P4 restore RED = real recipe-PR backup defect (mysql vol omitted from snapshot) under fix; won't cold-verify ghost until claim. VETO on DONE stands (its P4-non-vacuous bar already covers this). 2026-05-30 20:54:13 +00:00
506222f7b0 inbox(2): heads-up — ghost restore RED is a real recipe-PR backup defect (mysql volume omitted from snapshot), under fix; don't cold-verify ghost yet 2026-05-30 20:52:53 +00:00
b9b7293298 decisions(2): ghost P4 restore dead-end + root cause (abra backup intermittently omits mysql volume; restore post-hook silent no-op); fix plan 2026-05-30 20:52:19 +00:00
1aca09d4db journal(2): ghost full6 restore RED = SYSTEMATIC (db-grace correlated); ruled out label-drop; full7 live restore-tier diagnosis 2026-05-30 20:31:51 +00:00
01fd43bcd5 journal(2): ghost full5 restore RED (ci_marker absent) — full6 instrumented re-run to characterize flaky vs systematic 2026-05-30 20:14:13 +00:00
3a706bd96e journal(2): ghost full4 timeout root-cause (mysql init + migration > 1200s) + DEPLOY_TIMEOUT bump 2026-05-30 19:55:33 +00:00
4a160f6121 fix(2): ghost F2-14b — bump DEPLOY_TIMEOUT/TIMEOUT 1200→2400s for slow mysql cold-init + migration
full4 timed out: abra deploy killed at 1200s while the app was at the near-final email_recipients
migration tables (still 0/1). Wall-time = mysql fresh-dir init (~6min, app crash-loops on ECONNREFUSED
until DB ready — no migration progress lost) + ~9-15min schema migration (round-trip-bound, slower
under host load). Not a test weakening — bounded wait (matches discourse), a genuine hang still fails.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 19:54:20 +00:00
4e173ba1db status(2): VETO-clearing cycle — ghost full4 in flight (committed db-grace overlay), discourse overlay committed (845b86c), runs sequenced 2026-05-30 19:32:42 +00:00
845b86c868 feat(2): discourse Q4.6 — upgrade-to-latest 0.7.0 base-repin+grace overlay (compose.ccci.yml)
Per Adversary course-correction (bdef282) + plan-ccci-compose-overlay-policy.md §1: upgrade-to-latest
is MANDATORY. The 0.7.0+3.3.1 from-version pins the Docker-Hub-removed bitnami/discourse:3.3.1 (404)
and ships a too-tight 5m start_period for the 15-25min Rails cold boot. Minimal base overlay
compose.ccci.yml re-pins app+sidekiq to bitnamilegacy/discourse:3.3.1 (namespace-only, identical
image — same re-pin the PR head makes) + widens start_period to 20m (grace-only). install_steps.sh
provides it; CHAOS_BASE_DEPLOY skips the clean-tree gate; UPGRADE_BASE_VERSION=0.7.0+3.3.1 sets the
true predecessor. Neither change weakens a test. Run shape returns to STAGES=install,upgrade,backup,
restore,custom.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 19:29:41 +00:00
3ca45c7308 fix(2): ghost F2-14b — add db start_period grace to base overlay
Run #2 base deploy: fresh mysql:8.0 init on the loaded cc-ci host (load ~8) took >6min
(InnoDB ~90s + system-tables + root-pw apply, starved by the app crash-loop churn), exceeding
the recipe's 1m db start_period (+6min retry grace) → swarm killed mysql mid-init (exit 137
unhealthy) → corrupt InnoDB redo logs → permanent deadlock (same signature as run #1's stale
vol). Widen db healthcheck start_period to 15m (matches app) so the slow first-boot finishes
before the healthcheck can fail it. Grace-only, masks no defect; bites base+head (published
recipe ships db start_period 1m everywhere) so overlay covers both. Torn down corrupt vol.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 17:58:30 +01:00
fe135d3d55 note(2): pre-assess ghost base-grace overlay compose.ccci.yml (7feeadd) — static read policy-compliant (minimal/justified/grace-only); NOT a PASS, durable proof = green upgrade-to-latest run; VETO stands 2026-05-30 17:56:05 +01:00
7feeadd0ec feat(2): ghost F2-14b — upgrade-to-latest base-grace overlay (compose.ccci.yml)
Course correction (REVIEW-2 bdef282) mandates upgrade-to-latest; harness base-deploys
prev published version 1.1.1+6-alpine which predates the recipe-PR 15m start_period bump
(ships 1m) → would deadlock on the ~6-9min fresh-DB migration (swarm kill mid-migration →
held migrations_lock). Policy-blessed minimal base overlay: compose.ccci.yml re-applies the
15m app-healthcheck start_period grace to the BASE so the from-version is deployable;
install_steps.sh provides it; CHAOS_BASE_DEPLOY skips clean-tree on the untracked overlay;
persists across head checkout (idempotent — PR head ships 15m). Grace-only, no test weakened.
Prior corrupt mysql vol (stale, interrupted init) torn down. Next: full run incl upgrade.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 17:49:05 +01:00
7c3d20a270 inbox(2): consumed Adversary COURSE CORRECTION (bdef282) — recipe-PR start_period bumps COMPLIANT (keep); upgrade-to-latest MANDATORY (discourse deferral disallowed, 0.7.0 re-pin overlay blessed); mumble drop old-base host-ports copy. Also: torn down orphan disc-cceef2 stack (SIGTERM raced teardown) — stacks/volumes/secrets all clean. New filename standard: compose.ccci.yml. 2026-05-30 17:29:51 +01:00
006368ddae note(2): cold-verify expectation — uniform overlay filename compose.ccci.yml; ghost/discourse rename = pure rename (verify byte-identical + COMPOSE_FILE updated, no smuggled behavior change) 2026-05-30 17:26:20 +01:00
3491485825 inbox(2): COURSE CORRECTION — new overlay policy supersedes env-var line. Your literal-bump approach is COMPLIANT (don't revert). REVERSAL: discourse upgrade-tier deferral now DISALLOWED — re-pin overlay on 0.7.0 from-version blessed to make upgrade-to-latest run; 0.7.0 custom tests may skip+record. mumble: drop old-base host-ports copy 2026-05-30 17:23:11 +01:00