claim(redfix-M2): all 6 canon-sweep failures FIXED + verified green
Some checks failed
continuous-integration/drone/push Build is failing

mattermost-lts (PR #1, !testme #901), discourse (PR #4, !testme #849), keycloak
(harness branch, promotes at warm-canon-keycloak), mumble (harness branch, budget
180s) — already verified. gitea (PR #2 @a0f2db8, app.ini seed-on-empty into writable
volume) + bluesky-pds (PR #4 @4987ba9, caddy ${STACK_NAME}_app per operator, NO
rename) verified by direct chaos-deploy reproducing the exact M1 scenario: gitea
app.ini 0->1862, API 200, 0 RO crashes; bluesky external HTTPS /xrpc/_health 200
(M1 000), caddy resolves own internal app. Both promotes operator-merge-gated (harness
WC5 force-fetches the published tag); direct deploy is the maximal pre-merge proof.
No standing exceptions. Nothing merged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Bacy8SJhBebNAGoYsi1Fxt
This commit is contained in:
2026-06-18 05:55:43 +00:00
parent 966edb3042
commit 0e255d8570
2 changed files with 145 additions and 11 deletions

View File

@ -78,20 +78,104 @@ mirrors via the recipe mirror+PR flow, verified `!testme` (NEVER merge). Harness
on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my
deploys (Adversary done with M1).
### M2 fix tracker (updated 2026-06-18T03:15Z)
### M2 fix tracker (updated 2026-06-18T05:53Z — ALL VERIFIED)
| Recipe | Fix | PR/branch | Status |
|---|---|---|---|
| mattermost-lts | recipe: pg_backup.sh + restore.post-hook | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green (restore_returns_state PASS) |
| discourse | recipe: official-image migration | mirror PR #4 `discourse-official-image` @53ba0910 | **VERIFIED** — !testme run #849 green (overlay passes on migrated head); re-verify fresh for claim |
| bluesky-pds | recipe: rename service app->pds (abra drops aliases) + cc-ci exec-refs service=pds | mirror PR #4 `ci/warm-routing-alias` (rename) + branch `redfix-m2-harness` | IN PROGRESS — cold install PASSES (caddy->pds routing works!) but backup/restore/custom fail on `no running container <stack>_pds after 60s` (backup-bot cycle + exec poll); re-running w/ live inspection. Warm-promote (the actual 000 fix) blocked until cold green. |
| gitea | recipe: app.ini writable (seed) | mirror PR #2 `ci/app-ini-writable` | **NEEDS REWORK** — seed fix works for fresh install but breaks 3.5.3->3.6.0 transition (wizard mode, /api/v1/version 404). Reverted clone. Rework: reproduce+inspect, or provide 1.24-valid oauth2 JWT. |
| keycloak | harness: collision-free canonical_domain + WARM_CANONICAL=True | branch `redfix-m2-harness` | code done; verify pending (run from branch checkout -> promote at warm-canon-keycloak, live warm-keycloak stays 200) |
| mumble | harness: handshake budget 60s->180s | branch `redfix-m2-harness` | code done; verify pending (green from branch checkout; load-green hard to repro) |
| Recipe | Class | Fix | PR/branch + ref | Status |
|---|---|---|---|---|
| mattermost-lts | recipe defect | pg_backup.sh + `backupbot.restore.post-hook` (immich pattern) | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green incl `test_restore_returns_state` |
| discourse | stale cc-ci overlay | recipe: bitnamilegacy->official discourse image migration | mirror PR #4 `discourse-official-image` @53ba0910 | **VERIFIED** — !testme run #849 green (overlay passes on migrated head) |
| keycloak | harness defect | collision-free `canonical_domain` (`warm-canon-<r>` for WARM_DOMAINS recipes) + enroll | cc-ci branch `redfix-m2-harness` @61211db | **VERIFIED** — branch-checkout run promotes at warm-canon-keycloak; live warm-keycloak 200 throughout |
| mumble | load/timing flake | harness: handshake readiness budget 60s->180s | cc-ci branch `redfix-m2-harness` @07fc6d4 | **VERIFIED** — branch-checkout run all tiers green incl handshake; budget active+non-regressing |
| gitea | recipe defect | app.ini->staging `/etc/gitea/app.ini.init` + docker-setup seed-on-EMPTY + DOCKER_SETUP_SH_VERSION v3 | mirror PR #2 `ci/app-ini-writable` @a0f2db8 | **VERIFIED** (direct chaos-deploy; promote merge-gated — see below) |
| bluesky-pds | recipe defect (routing) | caddy `{$APP_HOST}=${STACK_NAME}_app` (operator: NO rename) + CADDYFILE_VERSION v2 | mirror PR #4 `ci/warm-routing-alias` @4987ba9 | **VERIFIED** (direct chaos-deploy; promote merge-gated — see below) |
Verification mechanism for cc-ci-side changes: run from a checkout of `redfix-m2-harness` at /tmp/cc-ci-m2run with CCCI_REPO set (never touches /etc/cc-ci or main).
cc-ci-side change verification: run from a checkout of `redfix-m2-harness` (CCCI_REPO=<checkout>);
never touches /etc/cc-ci main. `redfix-m2-harness` is now mumble+keycloak ONLY (bluesky needs no
cc-ci change with the ${STACK_NAME}_app approach; the rename's exec-ref commit b96b8a4 was dropped).
## Gate: M1PASS (above). M2 not yet claimed.
## Gate: M2CLAIMED, awaiting Adversary (2026-06-18T05:53Z)
**WHAT (M2 DoD).** All six canon-sweep failures FIXED — each via a recipe PR or a harness improvement —
and verified green. No recipe left as a standing exception. Nothing merged (operator merges). Per recipe:
- **mattermost-lts** (recipe PR #1) — added `pg_backup.sh` + postgres `backupbot.restore.post-hook` so
the logical dump round-trips on restore.
- **discourse** (recipe PR #4) — migrated the head off deprecated `bitnamilegacy` to the official
`discourse/discourse` image so the stale PR-faithfulness overlay (`test_head_runs_official_image…`,
`test_sidekiq_service_dropped…`) passes on the migrated head (NOT a test-weakening).
- **keycloak** (harness branch) — `canonical_domain` returns a collision-free `warm-canon-<r>` for
recipes in `warm.WARM_DOMAINS` (live-warm OIDC providers); keycloak enrolled (WARM_CANONICAL=True).
- **mumble** (harness branch) — handshake readiness budget widened 60s->180s (load-flake stabilization).
- **gitea** (recipe PR #2) — app.ini is now seeded into the WRITABLE `/etc/gitea` volume by
docker-setup (`if [ ! -s /etc/gitea/app.ini ]`, seed-on-EMPTY) from the read-only staging config
`app.ini.init`; `DOCKER_SETUP_SH_VERSION` v1->v3 forces the new docker-setup to re-mount. Gitea
1.24.2 can then persist its JWT secret (the M1 read-only-app.ini crash is gone).
- **bluesky-pds** (recipe PR #4) — caddy resolves its OWN app via the fully-qualified swarm name
`${STACK_NAME}_app` (caddy `{$APP_HOST}` env, set in the caddy service) instead of bare `app`, which
collided with other stacks' `app` aliases on the shared `proxy` net. CADDYFILE_VERSION v1->v2.
**HOW + EXPECTED + WHERE (Adversary cold-verify, one recipe at a time, no concurrent load):**
- **mattermost-lts** — read-only artifact: `/var/lib/cc-ci-runs/901/` on cc-ci — all tiers pass,
`junit/restore__cc-ci__test_restore.xml` testsuite failures=0, `test_restore_returns_state` pass.
OR re-run !testme on PR #1 @4ca7f418. EXPECT restore green.
- **discourse** — !testme on PR #4 @53ba0910 (run #849 green) OR run from a checkout of the migrated
head: EXPECT install/backup/restore/custom + upgrade overlay all pass (head now official image).
- **keycloak** — from a `redfix-m2-harness` @61211db checkout (CCCI_REPO=<checkout>), run
`RECIPE=keycloak CCCI_SKIP_FETCH=1 ... run_recipe_ci.py`. EXPECT all cold tiers pass + WC5 promote
succeeds at domain `warm-canon-keycloak.ci.commoninternet.net` (NOT warm-keycloak); live
`warm-keycloak.ci.commoninternet.net/realms/master` stays 200 throughout. Code: `canonical.py`
canonical_domain returns warm-canon-<r> for r in warm.WARM_DOMAINS.
- **mumble** — from `redfix-m2-harness` @07fc6d4 checkout, run `RECIPE=mumble CCCI_SKIP_FETCH=1 …`.
EXPECT all 5 tiers green incl `custom/test_protocol_handshake.py::test_handshake_completes_with_
channel_presence`; handshake budget = 36 attempts / 180s (was 60s). (Load-flake is not
deterministically reproducible; this verifies the stabilization is applied, sound, non-weakening.)
- **gitea** (recipe PR #2 @a0f2db8 on mirror branch `ci/app-ini-writable`) — DIRECT chaos-deploy proof
(the harness WC5 promote is merge-gated, see NOTE). With the idle 3.5.3 canonical present:
`cd ~/.abra/recipes/gitea && git checkout -f a0f2db8` then chaos-deploy onto the retained canonical
volumes (0-byte app.ini = genuine pre-fix 3.5.3 state):
`abra app deploy warm-gitea.ci.commoninternet.net -C -o -n`. EXPECT: service 1/1; the config volume's
`app.ini` seeded 0->~1862 bytes (`INSTALL_LOCK = true`); `/api/v1/version` -> 200 {"version":"1.24.2"}
and `/api/healthz` -> 200 (curl inside the app container); retained 3.5.3 data adopted (data dirs
dated 2026-06-17T08:39); ZERO `read-only file system` crashes in `docker service logs` (M1 crashed
here). Evidence: `/tmp/redfix-gitea-m2-directproof.log` on cc-ci. Teardown: `abra app undeploy … -n`,
truncate the volume app.ini to 0 (restore pre-fix state). canonical.json stays 3.5.3 idle e6a1cc79.
- **bluesky-pds** (recipe PR #4 @4987ba9 on mirror branch `ci/warm-routing-alias`) — DIRECT chaos-deploy
proof (warm-promote is the only failing path; merge-gated). `git checkout -f 4987ba9`; generate
secrets (`abra app secret generate warm-bluesky-pds.ci.commoninternet.net --all -m -C -o -n`) + insert
a PLC rotation key (tests/bluesky-pds/install_steps.sh logic: 32-byte hex into pds_plc_rotation_key
v1); **re-checkout 4987ba9 AFTER secret ops** (abra secret insert force-fetches+reverts the checkout);
`abra app deploy warm-bluesky-pds.ci.commoninternet.net -C -o -n` (EXPECT `caddyfile: v1 -> v2`,
NEW DEPLOYMENT 4987ba9). EXPECT: app+caddy 1/1; inside caddy `getent hosts
warm-bluesky-pds_ci_commoninternet_net_app` -> a 10.0.x.x INTERNAL ip (own stack) while
`getent hosts app` -> a 10.10.x.x proxy ip (foreign, the M1 collision); caddy log "certificate
obtained successfully" with 0 "connection refused"; external `curl https://warm-bluesky-pds.ci.
commoninternet.net/xrpc/_health` -> **200** {"version":"0.4.219"} (M1 was 000). Evidence:
`/tmp/redfix-bluesky-m2-directproof.log`. Teardown: undeploy + remove volumes (caddy_data, pds_data)
+ secrets (no canonical, matching M1).
**NOTE — gitea & bluesky end-to-end canonical-promote is OPERATOR-MERGE-GATED (not a shrug).** The
harness WC5 promote does a recipe_checkout(published-tag)+non-chaos deploy, and BOTH run_recipe_ci.py:373
AND abra force-fetch `refs/tags/*` from upstream (abra.py:135 documents this), so any local move of the
release tag to the fix commit is reverted to the PUBLISHED commit. The published 3.6.0 / 0.3.0 tags do
NOT yet carry the fix (PR not merged — operator merges, per phase guardrail), so pre-merge the promote
necessarily deploys the unfixed published release. Confirmed empirically: a full gitea harness run's WC5
promote deployed 357926f and crash-looped exactly like M1. The DIRECT chaos-deploy (chaos = deploy the
working-tree checkout = the PR fix) is therefore the MAXIMAL + faithful pre-merge proof — it reproduces
the EXACT M1 failing scenario (gitea: the retained canonical volumes; bluesky: warm-bluesky-pds on the
shared proxy) and shows the fix resolves it. End-to-end canonical advance follows automatically once the
operator merges PR #2 / #4 and the release tag carries the fix. This is NOT a standing exception — the
defect is fixed + proven; only the registry-advance awaits the operator's merge (the phase's own
"nothing merged" constraint).
**WHERE (refs).** Recipe PRs on `git.autonomic.zone/recipe-maintainers/<recipe>`: mattermost-lts
`ci/pg-restore`@4ca7f418, discourse `discourse-official-image`@53ba0910, gitea `ci/app-ini-writable`
@a0f2db8, bluesky-pds `ci/warm-routing-alias`@4987ba9. cc-ci harness branch
`redfix-m2-harness`@07fc6d4 (keycloak 61211db + mumble 07fc6d4). Reasoning/dead-ends in
JOURNAL-redfix.md. Node left clean (only infra + live warm-keycloak 200; gitea idle 3.5.3 volumes
retained, canonical e6a1cc79 unchanged; no bluesky/test stacks/volumes/secrets; no run procs).
## Gate: M1 — PASS (above).
**WHAT (M1 DoD).** All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no
concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe