diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index c08353a..d2a43fb 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -932,3 +932,32 @@ script, `backup` = pg_dump|gzip → backup.sql, `backupbot.backup.volumes.postgr (archive only the dump), `restore` post-hook = terminate connections + DROP DATABASE FORCE + createdb + reimport. postgres:15 plain dump → no special handling (mechanism already proven generic on immich). Validated: `RECIPE=mattermost-lts PR=1` full lifecycle GREEN, restore tier PASSES (ci_marker survives). + +## 2026-05-30 — ghost MySQL cold-boot: healthcheck start_period overlay + abra +U marker normalization + +**Context (Phase 2 Q4.4 ghost).** The current ghost recipe (1.x+6.21.2) uses **MySQL** (not sqlite). +Its fresh-DB first boot runs a ~6-9min schema migration (dozens of CREATE TABLEs, each a separate +MySQL round-trip → round-trip-bound, NOT CPU-bound; reproduced on both 2- and 4-vCPU). The upstream +recipe healthcheck `start_period:1m` (+10×30s ≈ 6min grace) kills the still-migrating task, leaving a +stale `migrations_lock` → every later task deadlocks (`MigrationsAreLockedError`). + +**Decision 1 — cc-ci deploy overlay for the healthcheck (NOT a recipe change, NOT a test change).** +`tests/ghost/compose.ccci-health.yml` raises ONLY the app healthcheck `start_period` to 900s, applied +via `recipe_meta.EXTRA_ENV COMPOSE_FILE=compose.yml:compose.ccci-health.yml` + `install_steps.sh` +(copies the overlay into the recipe checkout) + `CHAOS_BASE_DEPLOY=True` (the untracked overlay trips +abra's pinned clean-tree check). Rationale: failures are ignored during start_period but a PASS still +marks healthy immediately, so the fresh migration finishes + releases the lock with no other change; +the app's real healthcheck still gates readiness — no assertion weakened. Only the install tier's +fresh migration needs it (the upgrade redeploys on the populated DB → fast boot). The overlay is an +untracked file, so `git checkout -f` (upgrade re-checkout) preserves it → COMPOSE_FILE keeps +resolving across install AND upgrade. This is the general pattern for any recipe whose upstream +healthcheck is too tight for its own first-boot migration on cc-ci infra. + +**Decision 2 — assert_upgraded normalizes abra's `+U` working-tree marker (shared harness).** +A cc-ci overlay sitting in the recipe checkout as an untracked file makes abra stamp +`coop-cloud..chaos-version='+U'` (U=untracked). The COMMIT still equals head_ref (HC1 +satisfied) but the `+U` suffix broke `assert_upgraded`'s exact-prefix match → spurious upgrade FAIL. +Fix: strip the `+...` working-tree-state marker before the commit match (`chaos.split('+',1)[0]`). +HC1 is preserved — the underlying commit must still equal head_ref; a stale prev-checkout chaos +redeploy stamps prev's commit (also `+U` if overlaid) and still won't match. General: every future +cc-ci overlay recipe (untracked overlay + CHAOS_BASE_DEPLOY) would otherwise hit this. diff --git a/machine-docs/JOURNAL-2.md b/machine-docs/JOURNAL-2.md index 1a9e87f..ac6b418 100644 --- a/machine-docs/JOURNAL-2.md +++ b/machine-docs/JOURNAL-2.md @@ -1315,3 +1315,36 @@ or a sqlite-aware restore hook). Deploy-test first to find out (don't assume). Checkpointing here (node clean, no gate pending — all 3 claims this session PASSed) to take bluesky fresh next cycle; the analysis above lets it start at the overlay, not the investigation. + +## 2026-05-30 — Q4.4 ghost: P3 create-post GREEN + P4 non-vacuous; migration-lock deadlock + +U fixes + +Authored ghost P4 overlays (MySQL `ci_marker` in the `ghost` DB — recipe is MySQL not sqlite; stale +comment) + §4.3 create-post round-trip (cookie-aware Admin API client `_ghost.py`). Run-4 results +(`/root/ccci-ghost-4.log`): deploy-count=1; install/backup/custom PASS; `test_create_post_roundtrip +PASSED (22s)`; P4 upgrade+backup markers PASS; restore RED (real recipe gap — no reimport-on-restore). + +**Why two deploys failed first (NOT test issues):** +1. **migrations_lock deadlock.** Ghost's fresh-DB first boot runs a ~6-9min schema migration (dozens + of CREATE TABLEs, each a separate MySQL round-trip — round-trip-bound, NOT CPU-bound: hit on BOTH + 2- and 4-vCPU). The recipe healthcheck `start_period:1m` (+10×30s ≈ 6min grace) marks the still- + migrating task unhealthy → swarm kills it mid-migration → leaves `migrations_lock.locked=1, + released_at=NULL` → every later task boots, sees the held lock, refuses (`MigrationsAreLockedError`) + → permanent deadlock. Bumping the abra TIMEOUT does NOT help (the lock never clears). FIX: a cc-ci + DEPLOY overlay `compose.ccci-health.yml` raising the app healthcheck start_period to 900s (failures + ignored during it; a PASS still marks healthy at once) so the fresh migration finishes + releases + the lock. Wired via recipe_meta COMPOSE_FILE + install_steps.sh + CHAOS_BASE_DEPLOY. NOT a test + change — the real healthcheck still gates readiness. Validated: migration ran past the old kill + point, install converged 1/1. (Operator bumped the VM 2→4 vCPU mid-session; didn't fix this — the + migration is round-trip-bound — but made everything else snappier.) +2. **`+U` chaos-version marker.** The untracked overlay makes abra stamp `chaos-version='+U'` + (U=untracked). The commit equals head_ref (HC1 satisfied) but `+U` broke assert_upgraded's exact- + prefix match → spurious upgrade FAIL. FIX: strip the working-tree-state marker before the commit + match (commit identity still enforced; HC1 preserved). mumble dodged this only because its overlay + is tracked natively in newer versions; cc-ci overlays generally aren't → general harness fix. + +**P4 restore gap (real recipe defect → recipe-PR):** ghost db service has `mysqldump --tab` backup +pre-hook but NO `backupbot.restore.*` hook, and the mysql data volume isn't backupbot-labelled → the +dump is restored to disk but never reimported → dropped `ci_marker` doesn't return. Non-vacuous +(backup PASS with marker, restore RED). Same class as immich#1 / mattermost-lts#1. FIX = recipe-PR +adding a mysql dump+reimport hook (mirror mattermost `pg_backup.sh` → `mysql_backup.sh`). Ghost not +yet mirrored on gitea (404) → mirror first (plan §0b), then PR, then final green run, then claim.