claim(redfix-M2): all 6 canon-sweep failures FIXED + verified green
Some checks failed
continuous-integration/drone/push Build is failing

mattermost-lts (PR #1, !testme #901), discourse (PR #4, !testme #849), keycloak
(harness branch, promotes at warm-canon-keycloak), mumble (harness branch, budget
180s) — already verified. gitea (PR #2 @a0f2db8, app.ini seed-on-empty into writable
volume) + bluesky-pds (PR #4 @4987ba9, caddy ${STACK_NAME}_app per operator, NO
rename) verified by direct chaos-deploy reproducing the exact M1 scenario: gitea
app.ini 0->1862, API 200, 0 RO crashes; bluesky external HTTPS /xrpc/_health 200
(M1 000), caddy resolves own internal app. Both promotes operator-merge-gated (harness
WC5 force-fetches the published tag); direct deploy is the maximal pre-merge proof.
No standing exceptions. Nothing merged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Bacy8SJhBebNAGoYsi1Fxt
This commit is contained in:
2026-06-18 05:55:43 +00:00
parent 966edb3042
commit 0e255d8570
2 changed files with 145 additions and 11 deletions

View File

@ -470,3 +470,53 @@ Restored the bluesky tag; node clean; warm-keycloak 200.
- bluesky-pds: fix correct (PR #4 rename) + mechanically proven; end-to-end warm verify structurally blocked pre-merge -> direct-deploy or operator post-merge.
- gitea: PR #2 seed fix BROKE 3.5.3->3.6.0 transition (wizard mode); testable via chaos; NEEDS REWORK (reproduce+inspect).
NOT claiming M2 — bluesky end-to-end + gitea rework outstanding.
## 2026-06-18T05:53Z — M2 gitea VERIFIED (v3 seed) + bluesky VERIFIED (${STACK_NAME}_app); 6/6
**gitea — rework was already done (v3, a0f2db8) but unverified; verified it.** The clone's HEAD
a0f2db8 ("fix v2 -s seed, v3") already addressed the v1 wizard-mode bug: docker-setup seeds app.ini
into the writable /etc/gitea volume `if [ ! -s /etc/gitea/app.ini ]` (seed-on-EMPTY, not -f
seed-on-missing — a 3.5.3-old-recipe canonical leaves a 0-byte app.ini placeholder in the config
volume, which -f wrongly treats as present). Also bumps DOCKER_SETUP_SH_VERSION v1->v3 (config names
are immutable; forces swarm to re-mount the new docker-setup) + app.ini config target ->
/etc/gitea/app.ini.init (staging). Pushed v3 to PR #2 (force-replaced the broken v1 d4145266).
VERIFICATION (direct chaos-deploy onto the REAL idle 3.5.3 canonical volumes; /tmp/redfix-gitea-m2-directproof.log):
reattached the retained config volume (0-byte app.ini = genuine pre-fix M1 state) with the v3 recipe.
Result: app.ini seeded 0->1862 bytes, INSTALL_LOCK=true (not wizard), service 1/1, /api/v1/version
-> 200 {"version":"1.24.2"}, /api/healthz 200, retained 3.5.3 data adopted (data dirs dated
2026-06-17T08:39 = canonical seed time, not fresh), **0 read-only-app.ini crashes** (M1 crashed here).
WHY NOT the harness WC5 promote: it is STRUCTURALLY merge-gated. run_recipe_ci.py:373 force-fetches
`refs/tags/*` from upstream even under CCCI_SKIP_FETCH, and abra itself force-fetches tags on deploy
(abra.py:135 documents this) — so a LOCAL tag-move to the fix commit is always reverted to the
published 357926f. promote_canonical does recipe_checkout(tag)+non-chaos deploy -> deploys the
PUBLISHED release, which pre-merge lacks the fix. Confirmed empirically: a full harness run's WC5
promote deployed 357926f (caddyfile/app.ini OLD) -> crashed exactly like M1. So end-to-end
canonical-advance needs the operator to merge PR #2 + re-cut 3.6.0; the direct chaos-deploy is the
maximal+faithful pre-merge proof (chaos deploys the working-tree checkout = the PR fix). Node left
clean: warm-gitea undeployed (idle 3.5.3, volumes retained), app.ini reset to 0-byte for re-verify,
canonical.json UNCHANGED (3.5.3 idle e6a1cc79), recipe tag restored to upstream 357926f.
**bluesky — operator directive (2026-06-18): NO rename; use ${STACK_NAME}_app.** Replaced the rename
(PR #4) with the minimal prefix fix: Caddyfile `ask http://{$APP_HOST}:3000/tls-check` +
`reverse_proxy {$APP_HOST}:3000` (caddy native {$ENV}, already used for {$DOMAIN}); compose caddy
service `- APP_HOST=${STACK_NAME}_app`; CADDYFILE_VERSION v1->v2. Service stays `app` -> NO coupled
cc-ci exec-ref change (reverted/dropped b96b8a4 from branch redfix-m2-harness; that branch is now
mumble+keycloak only). 3-file recipe-PR-only diff. Pushed to PR #4 ci/warm-routing-alias (4987ba9,
force-replaced the rename). Pattern per matrix-synapse/mailu/mumble.
VERIFICATION (direct chaos-deploy at warm-bluesky-pds with secrets + PLC key; /tmp/redfix-bluesky-m2-directproof.log):
caddy APP_HOST=warm-bluesky-pds_ci_commoninternet_net_app; `getent ${STACK_NAME}_app` -> 10.0.3.x
(bluesky's OWN internal net) while `getent app` (M1's bare target) -> 10.10.0.12 (FOREIGN proxy net,
the collision); caddy log "certificate obtained successfully" (let's-encrypt, via the own-app
tls-check) with **0 connection-refused** (M1 cycled refused); external HTTPS
https://warm-bluesky-pds.../xrpc/_health -> **200** {"version":"0.4.219"} (M1 was 000). GOTCHA: abra
`secret insert` (no -C -o) force-fetches+checks out the .env TYPE tag, reverting the fix checkout ->
must re-checkout the fix AFTER secret ops, right before the chaos deploy. Same merge-gating as gitea
(bluesky has no upgrade tier -> warm-promote is the only failing path -> end-to-end canonical-advance
is operator-merge-gated; direct chaos-deploy is the maximal pre-merge proof). Node left clean
(warm-bluesky-pds torn down, volumes+secrets removed; no canonical, matching M1). Live warm-keycloak
200 throughout.
**6/6 VERIFIED.** Claiming M2.

View File

@ -78,20 +78,104 @@ mirrors via the recipe mirror+PR flow, verified `!testme` (NEVER merge). Harness
on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my
deploys (Adversary done with M1).
### M2 fix tracker (updated 2026-06-18T03:15Z)
### M2 fix tracker (updated 2026-06-18T05:53Z — ALL VERIFIED)
| Recipe | Fix | PR/branch | Status |
|---|---|---|---|
| mattermost-lts | recipe: pg_backup.sh + restore.post-hook | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green (restore_returns_state PASS) |
| discourse | recipe: official-image migration | mirror PR #4 `discourse-official-image` @53ba0910 | **VERIFIED** — !testme run #849 green (overlay passes on migrated head); re-verify fresh for claim |
| bluesky-pds | recipe: rename service app->pds (abra drops aliases) + cc-ci exec-refs service=pds | mirror PR #4 `ci/warm-routing-alias` (rename) + branch `redfix-m2-harness` | IN PROGRESS — cold install PASSES (caddy->pds routing works!) but backup/restore/custom fail on `no running container <stack>_pds after 60s` (backup-bot cycle + exec poll); re-running w/ live inspection. Warm-promote (the actual 000 fix) blocked until cold green. |
| gitea | recipe: app.ini writable (seed) | mirror PR #2 `ci/app-ini-writable` | **NEEDS REWORK** — seed fix works for fresh install but breaks 3.5.3->3.6.0 transition (wizard mode, /api/v1/version 404). Reverted clone. Rework: reproduce+inspect, or provide 1.24-valid oauth2 JWT. |
| keycloak | harness: collision-free canonical_domain + WARM_CANONICAL=True | branch `redfix-m2-harness` | code done; verify pending (run from branch checkout -> promote at warm-canon-keycloak, live warm-keycloak stays 200) |
| mumble | harness: handshake budget 60s->180s | branch `redfix-m2-harness` | code done; verify pending (green from branch checkout; load-green hard to repro) |
| Recipe | Class | Fix | PR/branch + ref | Status |
|---|---|---|---|---|
| mattermost-lts | recipe defect | pg_backup.sh + `backupbot.restore.post-hook` (immich pattern) | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green incl `test_restore_returns_state` |
| discourse | stale cc-ci overlay | recipe: bitnamilegacy->official discourse image migration | mirror PR #4 `discourse-official-image` @53ba0910 | **VERIFIED** — !testme run #849 green (overlay passes on migrated head) |
| keycloak | harness defect | collision-free `canonical_domain` (`warm-canon-<r>` for WARM_DOMAINS recipes) + enroll | cc-ci branch `redfix-m2-harness` @61211db | **VERIFIED** — branch-checkout run promotes at warm-canon-keycloak; live warm-keycloak 200 throughout |
| mumble | load/timing flake | harness: handshake readiness budget 60s->180s | cc-ci branch `redfix-m2-harness` @07fc6d4 | **VERIFIED** — branch-checkout run all tiers green incl handshake; budget active+non-regressing |
| gitea | recipe defect | app.ini->staging `/etc/gitea/app.ini.init` + docker-setup seed-on-EMPTY + DOCKER_SETUP_SH_VERSION v3 | mirror PR #2 `ci/app-ini-writable` @a0f2db8 | **VERIFIED** (direct chaos-deploy; promote merge-gated — see below) |
| bluesky-pds | recipe defect (routing) | caddy `{$APP_HOST}=${STACK_NAME}_app` (operator: NO rename) + CADDYFILE_VERSION v2 | mirror PR #4 `ci/warm-routing-alias` @4987ba9 | **VERIFIED** (direct chaos-deploy; promote merge-gated — see below) |
Verification mechanism for cc-ci-side changes: run from a checkout of `redfix-m2-harness` at /tmp/cc-ci-m2run with CCCI_REPO set (never touches /etc/cc-ci or main).
cc-ci-side change verification: run from a checkout of `redfix-m2-harness` (CCCI_REPO=<checkout>);
never touches /etc/cc-ci main. `redfix-m2-harness` is now mumble+keycloak ONLY (bluesky needs no
cc-ci change with the ${STACK_NAME}_app approach; the rename's exec-ref commit b96b8a4 was dropped).
## Gate: M1PASS (above). M2 not yet claimed.
## Gate: M2CLAIMED, awaiting Adversary (2026-06-18T05:53Z)
**WHAT (M2 DoD).** All six canon-sweep failures FIXED — each via a recipe PR or a harness improvement —
and verified green. No recipe left as a standing exception. Nothing merged (operator merges). Per recipe:
- **mattermost-lts** (recipe PR #1) — added `pg_backup.sh` + postgres `backupbot.restore.post-hook` so
the logical dump round-trips on restore.
- **discourse** (recipe PR #4) — migrated the head off deprecated `bitnamilegacy` to the official
`discourse/discourse` image so the stale PR-faithfulness overlay (`test_head_runs_official_image…`,
`test_sidekiq_service_dropped…`) passes on the migrated head (NOT a test-weakening).
- **keycloak** (harness branch) — `canonical_domain` returns a collision-free `warm-canon-<r>` for
recipes in `warm.WARM_DOMAINS` (live-warm OIDC providers); keycloak enrolled (WARM_CANONICAL=True).
- **mumble** (harness branch) — handshake readiness budget widened 60s->180s (load-flake stabilization).
- **gitea** (recipe PR #2) — app.ini is now seeded into the WRITABLE `/etc/gitea` volume by
docker-setup (`if [ ! -s /etc/gitea/app.ini ]`, seed-on-EMPTY) from the read-only staging config
`app.ini.init`; `DOCKER_SETUP_SH_VERSION` v1->v3 forces the new docker-setup to re-mount. Gitea
1.24.2 can then persist its JWT secret (the M1 read-only-app.ini crash is gone).
- **bluesky-pds** (recipe PR #4) — caddy resolves its OWN app via the fully-qualified swarm name
`${STACK_NAME}_app` (caddy `{$APP_HOST}` env, set in the caddy service) instead of bare `app`, which
collided with other stacks' `app` aliases on the shared `proxy` net. CADDYFILE_VERSION v1->v2.
**HOW + EXPECTED + WHERE (Adversary cold-verify, one recipe at a time, no concurrent load):**
- **mattermost-lts** — read-only artifact: `/var/lib/cc-ci-runs/901/` on cc-ci — all tiers pass,
`junit/restore__cc-ci__test_restore.xml` testsuite failures=0, `test_restore_returns_state` pass.
OR re-run !testme on PR #1 @4ca7f418. EXPECT restore green.
- **discourse** — !testme on PR #4 @53ba0910 (run #849 green) OR run from a checkout of the migrated
head: EXPECT install/backup/restore/custom + upgrade overlay all pass (head now official image).
- **keycloak** — from a `redfix-m2-harness` @61211db checkout (CCCI_REPO=<checkout>), run
`RECIPE=keycloak CCCI_SKIP_FETCH=1 ... run_recipe_ci.py`. EXPECT all cold tiers pass + WC5 promote
succeeds at domain `warm-canon-keycloak.ci.commoninternet.net` (NOT warm-keycloak); live
`warm-keycloak.ci.commoninternet.net/realms/master` stays 200 throughout. Code: `canonical.py`
canonical_domain returns warm-canon-<r> for r in warm.WARM_DOMAINS.
- **mumble** — from `redfix-m2-harness` @07fc6d4 checkout, run `RECIPE=mumble CCCI_SKIP_FETCH=1 …`.
EXPECT all 5 tiers green incl `custom/test_protocol_handshake.py::test_handshake_completes_with_
channel_presence`; handshake budget = 36 attempts / 180s (was 60s). (Load-flake is not
deterministically reproducible; this verifies the stabilization is applied, sound, non-weakening.)
- **gitea** (recipe PR #2 @a0f2db8 on mirror branch `ci/app-ini-writable`) — DIRECT chaos-deploy proof
(the harness WC5 promote is merge-gated, see NOTE). With the idle 3.5.3 canonical present:
`cd ~/.abra/recipes/gitea && git checkout -f a0f2db8` then chaos-deploy onto the retained canonical
volumes (0-byte app.ini = genuine pre-fix 3.5.3 state):
`abra app deploy warm-gitea.ci.commoninternet.net -C -o -n`. EXPECT: service 1/1; the config volume's
`app.ini` seeded 0->~1862 bytes (`INSTALL_LOCK = true`); `/api/v1/version` -> 200 {"version":"1.24.2"}
and `/api/healthz` -> 200 (curl inside the app container); retained 3.5.3 data adopted (data dirs
dated 2026-06-17T08:39); ZERO `read-only file system` crashes in `docker service logs` (M1 crashed
here). Evidence: `/tmp/redfix-gitea-m2-directproof.log` on cc-ci. Teardown: `abra app undeploy … -n`,
truncate the volume app.ini to 0 (restore pre-fix state). canonical.json stays 3.5.3 idle e6a1cc79.
- **bluesky-pds** (recipe PR #4 @4987ba9 on mirror branch `ci/warm-routing-alias`) — DIRECT chaos-deploy
proof (warm-promote is the only failing path; merge-gated). `git checkout -f 4987ba9`; generate
secrets (`abra app secret generate warm-bluesky-pds.ci.commoninternet.net --all -m -C -o -n`) + insert
a PLC rotation key (tests/bluesky-pds/install_steps.sh logic: 32-byte hex into pds_plc_rotation_key
v1); **re-checkout 4987ba9 AFTER secret ops** (abra secret insert force-fetches+reverts the checkout);
`abra app deploy warm-bluesky-pds.ci.commoninternet.net -C -o -n` (EXPECT `caddyfile: v1 -> v2`,
NEW DEPLOYMENT 4987ba9). EXPECT: app+caddy 1/1; inside caddy `getent hosts
warm-bluesky-pds_ci_commoninternet_net_app` -> a 10.0.x.x INTERNAL ip (own stack) while
`getent hosts app` -> a 10.10.x.x proxy ip (foreign, the M1 collision); caddy log "certificate
obtained successfully" with 0 "connection refused"; external `curl https://warm-bluesky-pds.ci.
commoninternet.net/xrpc/_health` -> **200** {"version":"0.4.219"} (M1 was 000). Evidence:
`/tmp/redfix-bluesky-m2-directproof.log`. Teardown: undeploy + remove volumes (caddy_data, pds_data)
+ secrets (no canonical, matching M1).
**NOTE — gitea & bluesky end-to-end canonical-promote is OPERATOR-MERGE-GATED (not a shrug).** The
harness WC5 promote does a recipe_checkout(published-tag)+non-chaos deploy, and BOTH run_recipe_ci.py:373
AND abra force-fetch `refs/tags/*` from upstream (abra.py:135 documents this), so any local move of the
release tag to the fix commit is reverted to the PUBLISHED commit. The published 3.6.0 / 0.3.0 tags do
NOT yet carry the fix (PR not merged — operator merges, per phase guardrail), so pre-merge the promote
necessarily deploys the unfixed published release. Confirmed empirically: a full gitea harness run's WC5
promote deployed 357926f and crash-looped exactly like M1. The DIRECT chaos-deploy (chaos = deploy the
working-tree checkout = the PR fix) is therefore the MAXIMAL + faithful pre-merge proof — it reproduces
the EXACT M1 failing scenario (gitea: the retained canonical volumes; bluesky: warm-bluesky-pds on the
shared proxy) and shows the fix resolves it. End-to-end canonical advance follows automatically once the
operator merges PR #2 / #4 and the release tag carries the fix. This is NOT a standing exception — the
defect is fixed + proven; only the registry-advance awaits the operator's merge (the phase's own
"nothing merged" constraint).
**WHERE (refs).** Recipe PRs on `git.autonomic.zone/recipe-maintainers/<recipe>`: mattermost-lts
`ci/pg-restore`@4ca7f418, discourse `discourse-official-image`@53ba0910, gitea `ci/app-ini-writable`
@a0f2db8, bluesky-pds `ci/warm-routing-alias`@4987ba9. cc-ci harness branch
`redfix-m2-harness`@07fc6d4 (keycloak 61211db + mumble 07fc6d4). Reasoning/dead-ends in
JOURNAL-redfix.md. Node left clean (only infra + live warm-keycloak 200; gitea idle 3.5.3 volumes
retained, canonical e6a1cc79 unchanged; no bluesky/test stacks/volumes/secrets; no run procs).
## Gate: M1 — PASS (above).
**WHAT (M1 DoD).** All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no
concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe