From 0e255d857028f64d69aa487c3f4d503b9b1bc565 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Thu, 18 Jun 2026 05:55:43 +0000 Subject: [PATCH] claim(redfix-M2): all 6 canon-sweep failures FIXED + verified green MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit mattermost-lts (PR #1, !testme #901), discourse (PR #4, !testme #849), keycloak (harness branch, promotes at warm-canon-keycloak), mumble (harness branch, budget 180s) — already verified. gitea (PR #2 @a0f2db8, app.ini seed-on-empty into writable volume) + bluesky-pds (PR #4 @4987ba9, caddy ${STACK_NAME}_app per operator, NO rename) verified by direct chaos-deploy reproducing the exact M1 scenario: gitea app.ini 0->1862, API 200, 0 RO crashes; bluesky external HTTPS /xrpc/_health 200 (M1 000), caddy resolves own internal app. Both promotes operator-merge-gated (harness WC5 force-fetches the published tag); direct deploy is the maximal pre-merge proof. No standing exceptions. Nothing merged. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01Bacy8SJhBebNAGoYsi1Fxt --- machine-docs/JOURNAL-redfix.md | 50 ++++++++++++++++ machine-docs/STATUS-redfix.md | 106 +++++++++++++++++++++++++++++---- 2 files changed, 145 insertions(+), 11 deletions(-) diff --git a/machine-docs/JOURNAL-redfix.md b/machine-docs/JOURNAL-redfix.md index 2784e70..4c37827 100644 --- a/machine-docs/JOURNAL-redfix.md +++ b/machine-docs/JOURNAL-redfix.md @@ -470,3 +470,53 @@ Restored the bluesky tag; node clean; warm-keycloak 200. - bluesky-pds: fix correct (PR #4 rename) + mechanically proven; end-to-end warm verify structurally blocked pre-merge -> direct-deploy or operator post-merge. - gitea: PR #2 seed fix BROKE 3.5.3->3.6.0 transition (wizard mode); testable via chaos; NEEDS REWORK (reproduce+inspect). NOT claiming M2 — bluesky end-to-end + gitea rework outstanding. + +## 2026-06-18T05:53Z — M2 gitea VERIFIED (v3 seed) + bluesky VERIFIED (${STACK_NAME}_app); 6/6 + +**gitea — rework was already done (v3, a0f2db8) but unverified; verified it.** The clone's HEAD +a0f2db8 ("fix v2 -s seed, v3") already addressed the v1 wizard-mode bug: docker-setup seeds app.ini +into the writable /etc/gitea volume `if [ ! -s /etc/gitea/app.ini ]` (seed-on-EMPTY, not -f +seed-on-missing — a 3.5.3-old-recipe canonical leaves a 0-byte app.ini placeholder in the config +volume, which -f wrongly treats as present). Also bumps DOCKER_SETUP_SH_VERSION v1->v3 (config names +are immutable; forces swarm to re-mount the new docker-setup) + app.ini config target -> +/etc/gitea/app.ini.init (staging). Pushed v3 to PR #2 (force-replaced the broken v1 d4145266). + +VERIFICATION (direct chaos-deploy onto the REAL idle 3.5.3 canonical volumes; /tmp/redfix-gitea-m2-directproof.log): +reattached the retained config volume (0-byte app.ini = genuine pre-fix M1 state) with the v3 recipe. +Result: app.ini seeded 0->1862 bytes, INSTALL_LOCK=true (not wizard), service 1/1, /api/v1/version +-> 200 {"version":"1.24.2"}, /api/healthz 200, retained 3.5.3 data adopted (data dirs dated +2026-06-17T08:39 = canonical seed time, not fresh), **0 read-only-app.ini crashes** (M1 crashed here). + +WHY NOT the harness WC5 promote: it is STRUCTURALLY merge-gated. run_recipe_ci.py:373 force-fetches +`refs/tags/*` from upstream even under CCCI_SKIP_FETCH, and abra itself force-fetches tags on deploy +(abra.py:135 documents this) — so a LOCAL tag-move to the fix commit is always reverted to the +published 357926f. promote_canonical does recipe_checkout(tag)+non-chaos deploy -> deploys the +PUBLISHED release, which pre-merge lacks the fix. Confirmed empirically: a full harness run's WC5 +promote deployed 357926f (caddyfile/app.ini OLD) -> crashed exactly like M1. So end-to-end +canonical-advance needs the operator to merge PR #2 + re-cut 3.6.0; the direct chaos-deploy is the +maximal+faithful pre-merge proof (chaos deploys the working-tree checkout = the PR fix). Node left +clean: warm-gitea undeployed (idle 3.5.3, volumes retained), app.ini reset to 0-byte for re-verify, +canonical.json UNCHANGED (3.5.3 idle e6a1cc79), recipe tag restored to upstream 357926f. + +**bluesky — operator directive (2026-06-18): NO rename; use ${STACK_NAME}_app.** Replaced the rename +(PR #4) with the minimal prefix fix: Caddyfile `ask http://{$APP_HOST}:3000/tls-check` + +`reverse_proxy {$APP_HOST}:3000` (caddy native {$ENV}, already used for {$DOMAIN}); compose caddy +service `- APP_HOST=${STACK_NAME}_app`; CADDYFILE_VERSION v1->v2. Service stays `app` -> NO coupled +cc-ci exec-ref change (reverted/dropped b96b8a4 from branch redfix-m2-harness; that branch is now +mumble+keycloak only). 3-file recipe-PR-only diff. Pushed to PR #4 ci/warm-routing-alias (4987ba9, +force-replaced the rename). Pattern per matrix-synapse/mailu/mumble. + +VERIFICATION (direct chaos-deploy at warm-bluesky-pds with secrets + PLC key; /tmp/redfix-bluesky-m2-directproof.log): +caddy APP_HOST=warm-bluesky-pds_ci_commoninternet_net_app; `getent ${STACK_NAME}_app` -> 10.0.3.x +(bluesky's OWN internal net) while `getent app` (M1's bare target) -> 10.10.0.12 (FOREIGN proxy net, +the collision); caddy log "certificate obtained successfully" (let's-encrypt, via the own-app +tls-check) with **0 connection-refused** (M1 cycled refused); external HTTPS +https://warm-bluesky-pds.../xrpc/_health -> **200** {"version":"0.4.219"} (M1 was 000). GOTCHA: abra +`secret insert` (no -C -o) force-fetches+checks out the .env TYPE tag, reverting the fix checkout -> +must re-checkout the fix AFTER secret ops, right before the chaos deploy. Same merge-gating as gitea +(bluesky has no upgrade tier -> warm-promote is the only failing path -> end-to-end canonical-advance +is operator-merge-gated; direct chaos-deploy is the maximal pre-merge proof). Node left clean +(warm-bluesky-pds torn down, volumes+secrets removed; no canonical, matching M1). Live warm-keycloak +200 throughout. + +**6/6 VERIFIED.** Claiming M2. diff --git a/machine-docs/STATUS-redfix.md b/machine-docs/STATUS-redfix.md index 3b7f842..21d74e3 100644 --- a/machine-docs/STATUS-redfix.md +++ b/machine-docs/STATUS-redfix.md @@ -78,20 +78,104 @@ mirrors via the recipe mirror+PR flow, verified `!testme` (NEVER merge). Harness on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my deploys (Adversary done with M1). -### M2 fix tracker (updated 2026-06-18T03:15Z) +### M2 fix tracker (updated 2026-06-18T05:53Z — ALL VERIFIED) -| Recipe | Fix | PR/branch | Status | -|---|---|---|---| -| mattermost-lts | recipe: pg_backup.sh + restore.post-hook | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green (restore_returns_state PASS) | -| discourse | recipe: official-image migration | mirror PR #4 `discourse-official-image` @53ba0910 | **VERIFIED** — !testme run #849 green (overlay passes on migrated head); re-verify fresh for claim | -| bluesky-pds | recipe: rename service app->pds (abra drops aliases) + cc-ci exec-refs service=pds | mirror PR #4 `ci/warm-routing-alias` (rename) + branch `redfix-m2-harness` | IN PROGRESS — cold install PASSES (caddy->pds routing works!) but backup/restore/custom fail on `no running container _pds after 60s` (backup-bot cycle + exec poll); re-running w/ live inspection. Warm-promote (the actual 000 fix) blocked until cold green. | -| gitea | recipe: app.ini writable (seed) | mirror PR #2 `ci/app-ini-writable` | **NEEDS REWORK** — seed fix works for fresh install but breaks 3.5.3->3.6.0 transition (wizard mode, /api/v1/version 404). Reverted clone. Rework: reproduce+inspect, or provide 1.24-valid oauth2 JWT. | -| keycloak | harness: collision-free canonical_domain + WARM_CANONICAL=True | branch `redfix-m2-harness` | code done; verify pending (run from branch checkout -> promote at warm-canon-keycloak, live warm-keycloak stays 200) | -| mumble | harness: handshake budget 60s->180s | branch `redfix-m2-harness` | code done; verify pending (green from branch checkout; load-green hard to repro) | +| Recipe | Class | Fix | PR/branch + ref | Status | +|---|---|---|---|---| +| mattermost-lts | recipe defect | pg_backup.sh + `backupbot.restore.post-hook` (immich pattern) | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green incl `test_restore_returns_state` | +| discourse | stale cc-ci overlay | recipe: bitnamilegacy->official discourse image migration | mirror PR #4 `discourse-official-image` @53ba0910 | **VERIFIED** — !testme run #849 green (overlay passes on migrated head) | +| keycloak | harness defect | collision-free `canonical_domain` (`warm-canon-` for WARM_DOMAINS recipes) + enroll | cc-ci branch `redfix-m2-harness` @61211db | **VERIFIED** — branch-checkout run promotes at warm-canon-keycloak; live warm-keycloak 200 throughout | +| mumble | load/timing flake | harness: handshake readiness budget 60s->180s | cc-ci branch `redfix-m2-harness` @07fc6d4 | **VERIFIED** — branch-checkout run all tiers green incl handshake; budget active+non-regressing | +| gitea | recipe defect | app.ini->staging `/etc/gitea/app.ini.init` + docker-setup seed-on-EMPTY + DOCKER_SETUP_SH_VERSION v3 | mirror PR #2 `ci/app-ini-writable` @a0f2db8 | **VERIFIED** (direct chaos-deploy; promote merge-gated — see below) | +| bluesky-pds | recipe defect (routing) | caddy `{$APP_HOST}=${STACK_NAME}_app` (operator: NO rename) + CADDYFILE_VERSION v2 | mirror PR #4 `ci/warm-routing-alias` @4987ba9 | **VERIFIED** (direct chaos-deploy; promote merge-gated — see below) | -Verification mechanism for cc-ci-side changes: run from a checkout of `redfix-m2-harness` at /tmp/cc-ci-m2run with CCCI_REPO set (never touches /etc/cc-ci or main). +cc-ci-side change verification: run from a checkout of `redfix-m2-harness` (CCCI_REPO=); +never touches /etc/cc-ci main. `redfix-m2-harness` is now mumble+keycloak ONLY (bluesky needs no +cc-ci change with the ${STACK_NAME}_app approach; the rename's exec-ref commit b96b8a4 was dropped). -## Gate: M1 — PASS (above). M2 not yet claimed. +## Gate: M2 — CLAIMED, awaiting Adversary (2026-06-18T05:53Z) + +**WHAT (M2 DoD).** All six canon-sweep failures FIXED — each via a recipe PR or a harness improvement — +and verified green. No recipe left as a standing exception. Nothing merged (operator merges). Per recipe: + +- **mattermost-lts** (recipe PR #1) — added `pg_backup.sh` + postgres `backupbot.restore.post-hook` so + the logical dump round-trips on restore. +- **discourse** (recipe PR #4) — migrated the head off deprecated `bitnamilegacy` to the official + `discourse/discourse` image so the stale PR-faithfulness overlay (`test_head_runs_official_image…`, + `test_sidekiq_service_dropped…`) passes on the migrated head (NOT a test-weakening). +- **keycloak** (harness branch) — `canonical_domain` returns a collision-free `warm-canon-` for + recipes in `warm.WARM_DOMAINS` (live-warm OIDC providers); keycloak enrolled (WARM_CANONICAL=True). +- **mumble** (harness branch) — handshake readiness budget widened 60s->180s (load-flake stabilization). +- **gitea** (recipe PR #2) — app.ini is now seeded into the WRITABLE `/etc/gitea` volume by + docker-setup (`if [ ! -s /etc/gitea/app.ini ]`, seed-on-EMPTY) from the read-only staging config + `app.ini.init`; `DOCKER_SETUP_SH_VERSION` v1->v3 forces the new docker-setup to re-mount. Gitea + 1.24.2 can then persist its JWT secret (the M1 read-only-app.ini crash is gone). +- **bluesky-pds** (recipe PR #4) — caddy resolves its OWN app via the fully-qualified swarm name + `${STACK_NAME}_app` (caddy `{$APP_HOST}` env, set in the caddy service) instead of bare `app`, which + collided with other stacks' `app` aliases on the shared `proxy` net. CADDYFILE_VERSION v1->v2. + +**HOW + EXPECTED + WHERE (Adversary cold-verify, one recipe at a time, no concurrent load):** + +- **mattermost-lts** — read-only artifact: `/var/lib/cc-ci-runs/901/` on cc-ci — all tiers pass, + `junit/restore__cc-ci__test_restore.xml` testsuite failures=0, `test_restore_returns_state` pass. + OR re-run !testme on PR #1 @4ca7f418. EXPECT restore green. +- **discourse** — !testme on PR #4 @53ba0910 (run #849 green) OR run from a checkout of the migrated + head: EXPECT install/backup/restore/custom + upgrade overlay all pass (head now official image). +- **keycloak** — from a `redfix-m2-harness` @61211db checkout (CCCI_REPO=), run + `RECIPE=keycloak CCCI_SKIP_FETCH=1 ... run_recipe_ci.py`. EXPECT all cold tiers pass + WC5 promote + succeeds at domain `warm-canon-keycloak.ci.commoninternet.net` (NOT warm-keycloak); live + `warm-keycloak.ci.commoninternet.net/realms/master` stays 200 throughout. Code: `canonical.py` + canonical_domain returns warm-canon- for r in warm.WARM_DOMAINS. +- **mumble** — from `redfix-m2-harness` @07fc6d4 checkout, run `RECIPE=mumble CCCI_SKIP_FETCH=1 …`. + EXPECT all 5 tiers green incl `custom/test_protocol_handshake.py::test_handshake_completes_with_ + channel_presence`; handshake budget = 36 attempts / 180s (was 60s). (Load-flake is not + deterministically reproducible; this verifies the stabilization is applied, sound, non-weakening.) +- **gitea** (recipe PR #2 @a0f2db8 on mirror branch `ci/app-ini-writable`) — DIRECT chaos-deploy proof + (the harness WC5 promote is merge-gated, see NOTE). With the idle 3.5.3 canonical present: + `cd ~/.abra/recipes/gitea && git checkout -f a0f2db8` then chaos-deploy onto the retained canonical + volumes (0-byte app.ini = genuine pre-fix 3.5.3 state): + `abra app deploy warm-gitea.ci.commoninternet.net -C -o -n`. EXPECT: service 1/1; the config volume's + `app.ini` seeded 0->~1862 bytes (`INSTALL_LOCK = true`); `/api/v1/version` -> 200 {"version":"1.24.2"} + and `/api/healthz` -> 200 (curl inside the app container); retained 3.5.3 data adopted (data dirs + dated 2026-06-17T08:39); ZERO `read-only file system` crashes in `docker service logs` (M1 crashed + here). Evidence: `/tmp/redfix-gitea-m2-directproof.log` on cc-ci. Teardown: `abra app undeploy … -n`, + truncate the volume app.ini to 0 (restore pre-fix state). canonical.json stays 3.5.3 idle e6a1cc79. +- **bluesky-pds** (recipe PR #4 @4987ba9 on mirror branch `ci/warm-routing-alias`) — DIRECT chaos-deploy + proof (warm-promote is the only failing path; merge-gated). `git checkout -f 4987ba9`; generate + secrets (`abra app secret generate warm-bluesky-pds.ci.commoninternet.net --all -m -C -o -n`) + insert + a PLC rotation key (tests/bluesky-pds/install_steps.sh logic: 32-byte hex into pds_plc_rotation_key + v1); **re-checkout 4987ba9 AFTER secret ops** (abra secret insert force-fetches+reverts the checkout); + `abra app deploy warm-bluesky-pds.ci.commoninternet.net -C -o -n` (EXPECT `caddyfile: v1 -> v2`, + NEW DEPLOYMENT 4987ba9). EXPECT: app+caddy 1/1; inside caddy `getent hosts + warm-bluesky-pds_ci_commoninternet_net_app` -> a 10.0.x.x INTERNAL ip (own stack) while + `getent hosts app` -> a 10.10.x.x proxy ip (foreign, the M1 collision); caddy log "certificate + obtained successfully" with 0 "connection refused"; external `curl https://warm-bluesky-pds.ci. + commoninternet.net/xrpc/_health` -> **200** {"version":"0.4.219"} (M1 was 000). Evidence: + `/tmp/redfix-bluesky-m2-directproof.log`. Teardown: undeploy + remove volumes (caddy_data, pds_data) + + secrets (no canonical, matching M1). + +**NOTE — gitea & bluesky end-to-end canonical-promote is OPERATOR-MERGE-GATED (not a shrug).** The +harness WC5 promote does a recipe_checkout(published-tag)+non-chaos deploy, and BOTH run_recipe_ci.py:373 +AND abra force-fetch `refs/tags/*` from upstream (abra.py:135 documents this), so any local move of the +release tag to the fix commit is reverted to the PUBLISHED commit. The published 3.6.0 / 0.3.0 tags do +NOT yet carry the fix (PR not merged — operator merges, per phase guardrail), so pre-merge the promote +necessarily deploys the unfixed published release. Confirmed empirically: a full gitea harness run's WC5 +promote deployed 357926f and crash-looped exactly like M1. The DIRECT chaos-deploy (chaos = deploy the +working-tree checkout = the PR fix) is therefore the MAXIMAL + faithful pre-merge proof — it reproduces +the EXACT M1 failing scenario (gitea: the retained canonical volumes; bluesky: warm-bluesky-pds on the +shared proxy) and shows the fix resolves it. End-to-end canonical advance follows automatically once the +operator merges PR #2 / #4 and the release tag carries the fix. This is NOT a standing exception — the +defect is fixed + proven; only the registry-advance awaits the operator's merge (the phase's own +"nothing merged" constraint). + +**WHERE (refs).** Recipe PRs on `git.autonomic.zone/recipe-maintainers/`: mattermost-lts +`ci/pg-restore`@4ca7f418, discourse `discourse-official-image`@53ba0910, gitea `ci/app-ini-writable` +@a0f2db8, bluesky-pds `ci/warm-routing-alias`@4987ba9. cc-ci harness branch +`redfix-m2-harness`@07fc6d4 (keycloak 61211db + mumble 07fc6d4). Reasoning/dead-ends in +JOURNAL-redfix.md. Node left clean (only infra + live warm-keycloak 200; gitea idle 3.5.3 volumes +retained, canonical e6a1cc79 unchanged; no bluesky/test stacks/volumes/secrets; no run procs). + +## Gate: M1 — PASS (above). **WHAT (M1 DoD).** All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe