Compare commits

..

9 Commits

Author SHA1 Message Date
966edb3042 note(redfix): idle break-it probe — live keycloak 200 (undisturbed), gitea canonical unchanged (no false promote during rework); M2 not yet claimed
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 04:12:39 +00:00
12925b5ab8 journal(redfix): M2 4/6 verified; bluesky warm-verify structurally blocked pre-merge (fix proven); gitea needs rework
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 02:39:37 +00:00
c5bc29bb97 journal(redfix): M2 mumble VERIFIED (4/6); bluesky force-chaos verification plan
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 02:28:42 +00:00
a65372cfde journal(redfix): M2 keycloak VERIFIED — canonical promotes at collision-free warm-canon-keycloak, live warm-keycloak undisturbed (200). 3/6 verified
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 02:25:02 +00:00
6846bbe83d journal(redfix): M2 — bluesky verify blocked by abra non-chaos tag-revert (recipe fixes need chaos); keycloak/mumble (harness) verify cleanly, doing next
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 02:21:19 +00:00
ed7d897e5f status(redfix): M2 tracker — mattermost+discourse VERIFIED; bluesky rename routing-works-but-backup-fails; gitea needs rework; keycloak/mumble pending verify
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 02:16:47 +00:00
fca936ef50 note(redfix): M2 interim corroboration — mattermost-lts run #901 restore tier (test_restore_returns_state) PASSES, clean teardown + no leak; non-contending artifact check, not a verdict; M2 not yet claimed
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 02:15:17 +00:00
c021d7e305 journal(redfix): M2 gitea fix v1 (seed) broke 3.5.3->3.6.0 transition (wizard mode); reverted clone, needs rework; proceeding to bluesky/keycloak/mumble
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 02:09:43 +00:00
278cb4e4b8 journal(redfix): M2 progress — gitea PR #2 + advance verifying; bluesky rename PR #4; harness branch redfix-m2-harness pushed (keycloak/mumble/bluesky-exec)
Some checks failed
continuous-integration/drone/push Build is failing
2026-06-18 02:00:06 +00:00
8 changed files with 158 additions and 38 deletions

View File

@ -356,3 +356,117 @@ cold green -> promote -> warm-bluesky-pds 200.
- gitea: fix READY locally (/tmp/redfix-gitea: app.ini->staging + docker-setup seed-once + DOCKER_SETUP_SH_VERSION v2); needs PR push + warm-advance verify.
- keycloak: harness fix (canonical_domain collision-free for WARM_DOMAINS recipes + enroll) NOT STARTED.
- mumble: harness fix (handshake readiness/retry stabilization) NOT STARTED.
## 2026-06-18T02:45Z — M2 progress: gitea PR + harness branch pushed; bluesky pivoted to rename
- **gitea**: opened recipe PR #2 `ci/app-ini-writable` (app.ini->staging + docker-setup seed-once +
DOCKER_SETUP_SH_VERSION v2). Advance-path verification RUNNING (fixed 3.6.0 reattach to idle 3.5.3
canonical; expect no app.ini crash + promote). cold lifecycle green so far (install + cold upgrade
converged).
- **bluesky**: PR #4 updated alias->RENAME service app->pds (abra drops aliases). 3-line recipe diff,
validates. Coupled cc-ci exec-ref change on branch.
- **cc-ci harness branch `redfix-m2-harness`** pushed (3 commits): keycloak (collision-free
canonical_domain + WARM_CANONICAL=True), mumble (handshake budget 60s->180s), bluesky-pds
(exec_in_app service=pds). Verified via temp-checkout runs (CCCI_REPO=<branch checkout>).
- Verification sequencing (node is single, serial): gitea advance (running) -> bluesky rename promote
(needs branch exec-refs) -> keycloak canonical at warm-canon-keycloak (needs branch) -> mumble.
NOTE: mumble "green under load" is hard to reproduce deterministically; plan = show branch run still
green + reason about the budget (or construct concurrent load).
## 2026-06-18T03:00Z — M2 gitea fix v1 (seed) BROKE the transition — needs rework
gitea advance verification (fixed 3.6.0): install tier PASSED FULLY (fresh 3.6.0 + my fix: API 200,
admin auth OK — so the seed works for a FRESH deploy), but upgrade/backup/restore/custom ALL FAILED:
`READY_PROBE not ready: /api/v1/version (last status 404) within 600s` after the 3.5.3->3.6.0 chaos
redeploy → gitea came up in INSTALL-WIZARD mode (serves 200 but no API/admin = no valid app.ini).
The LFS custom test's repo-create also 404'd (same wizard-mode cause).
So my seed-once fix is fine for fresh install but FAILS the 3.5.3->3.6.0 transition — exactly the path
the canon fix needs. Likely cause: on the chaos redeploy from a 3.5.3 stack (docker_setup_sh_v1, no
seed) the docker-setup config didn't update to my v2 (seed) while compose moved app.ini to the staging
path → /etc/gitea/app.ini empty → wizard. (To confirm: reproduce + inspect the post-redeploy container
— is docker_setup_sh_v2 mounted? does /etc/gitea/app.ini exist? gitea log.) Reverted the fix from
cc-ci's gitea clone; warm-gitea intact (idle 3.5.3, promote didn't fire on the red cold run). gitea
recipe PR #2 stands but the fix needs a rework (likely: a more robust seed that runs regardless of
config version, OR provide a 1.24-valid oauth2 JWT secret so gitea never rewrites app.ini — investigate
WHY 1.24 regenerates it). Deferring gitea; proceeding to bluesky-rename / keycloak / mumble verifies.
## 2026-06-18T03:30Z — M2 bluesky verification BLOCKED by abra non-chaos tag-revert; keycloak/mumble next
Root cause of the bluesky rename verify failure: the deployed service was `..._app` (not `pds`).
`run_recipe_ci` CCCI_SKIP_FETCH copies my renamed clone to the per-run tree, BUT abra's NON-CHAOS
pinned deploy (bluesky's tag 0.3.0+v0.4.219 is ANNOTATED) does `git checkout <tag>` in the per-run
tree, REVERTING my rename to the tag's `app:`. So the renamed recipe never deployed; the branch
harness then execs `service=pds` -> "no running container <stack>_pds" -> backup/restore/custom red.
(This also re-explains the earlier "abra dropped the alias" — it was the same tag-revert, not a drop.)
gitea's tag is lightweight -> deploy_app uses chaos -> my gitea fix DID deploy (install passed); its
failure is a real transition issue, not a revert.
IMPLICATION: verifying a RECIPE fix (bluesky, gitea) via CCCI_SKIP_FETCH needs a CHAOS deploy (uses the
checkout, not the tag). HARNESS fixes (keycloak canonical_domain, mumble retry) are runner/test code
from the branch checkout — NO tag-revert — so they verify cleanly. Doing keycloak + mumble next.
For bluesky: force chaos (deploy_app does chaos when has_ccci_overlay) OR reconsider a cc-ci-side
overlay fix (alias + caddyfile override) — both verifiable; recipe PR #4 (rename) stays as the ideal
upstream fix. gitea: rework + reproduce-with-inspection.
## 2026-06-18T03:40Z — M2 keycloak FIXED + VERIFIED (collision-free canonical)
Ran keycloak cold-on-latest from branch checkout /tmp/cc-ci-m2run (harness fix: canonical_domain ->
warm-canon-keycloak for WARM_DOMAINS recipes; WARM_CANONICAL=True). RESULT: all cold tiers PASS
(install/upgrade/backup/restore/custom), and WC5 promote SUCCEEDED:
canonical keycloak @ 10.8.0+26.6.3, domain="warm-canon-keycloak.ci.commoninternet.net", idle, volume retained.
- Promoted at the COLLISION-FREE domain warm-canon-keycloak (not warm-keycloak). ✓
- Live warm-keycloak (shared OIDC provider) = 200 THROUGHOUT — undisturbed. ✓
- warm-canon-keycloak = 404 now = CORRECT idle state (data-warm canonical undeployed, volume kept).
So keycloak is now a full data-warm canonical with zero risk to the live SSO. **FIXED + verified.**
3/6 verified: mattermost-lts, discourse, keycloak. Doing mumble next (harness, tractable).
## 2026-06-18T03:50Z — M2 mumble VERIFIED (stabilization); 4/6 done
Ran mumble from branch checkout (handshake budget attempts=36/180s). ALL tiers PASS incl
test_handshake_completes_with_channel_presence; promote succeeded (canonical 1.0.0+v1.6.870-0 idle).
The longer budget is active + non-regressing. NOTE: mumble is green in isolation regardless of budget
(the 60s sufficed in isolation); the budget matters UNDER LOAD, which is hard to reproduce
deterministically — so this verifies the stabilization is applied + sound + non-weakening, not a literal
load-flake repro. (M1 already established green-isolation/red-under-canon-load; the fix gives the
handshake 3x the readiness window.) **Stabilization fix verified.** 4/6: mattermost, discourse,
keycloak, mumble. Remaining: bluesky (force-chaos verify of the rename), gitea (rework).
## 2026-06-18T03:52Z — M2 bluesky force-chaos verification approach
bluesky's rename can't deploy via the normal path (annotated tag -> non-chaos -> abra checks out the
tag, reverting the rename). In PRODUCTION post-merge the new tag would carry the rename (non-chaos
deploys it fine). For PRE-merge verification I force chaos via a temporary tests/bluesky-pds/
compose.ccci.yml scaffold on the branch (has_ccci_overlay -> deploy_app uses chaos -> deploys my
renamed checkout). Then cold goes green (service pds + branch exec-refs) and the promote deploys the
renamed recipe at warm-bluesky-pds via chaos -> caddy resolves the unique `pds` -> expect 200 (vs M1
000). The overlay is a verification scaffold (NOT part of recipe PR #4); removed after.
## 2026-06-18T04:05Z — M2 bluesky verification: STRUCTURAL blocker (pre-merge warm-promote)
bluesky rename verification keeps deploying the TAG's `app:` (not my rename), even with: tag moved to
the rename commit AND a force-chaos overlay. Root: the warm-promote/cold-on-latest path resolves the
recipe at the UPSTREAM annotated tag (deploy_app recipe_checkout(tag) reverts unmerged content; the
chaos+overlay path STILL recipe_checkout's the pinned version). Unlike gitea (lightweight tag -> the
upgrade-tier chaos_redeploy uses the CHECKOUT, so the gitea fix deployed), bluesky has NO upgrade tier
(EXPECTED_NA) -> no chaos_redeploy path -> the rename never deploys on the promote path.
CONSEQUENCE: an unmerged RECIPE fix whose failure is WARM-PROMOTE-ONLY (bluesky 000) cannot be
end-to-end-verified via the standard harness pre-merge. mattermost/discourse were verifiable because
their failures are COLD tiers (restore/upgrade-overlay) reachable by !testme on the PR head.
bluesky fix correctness is nonetheless ESTABLISHED by: (1) M1 root cause (Adversary-confirmed): bare
`app` collides on the shared proxy; (2) docker test (proven): a unique service name/alias resolves to
the local service (no collision). Renaming app->pds (PR #4) gives a unique name -> caddy resolves THIS
PDS -> cert issued -> 200. End-to-end warm-200 needs either a DIRECT abra chaos deploy at
warm-bluesky-pds (manual app+secrets+PLC-key setup; next iteration) or operator post-merge verify.
Restored the bluesky tag; node clean; warm-keycloak 200.
## M2 STATUS (2026-06-18T04:05Z) — 4/6 verified
- mattermost-lts: VERIFIED (PR #1 ci/pg-restore, !testme run #901 all-green incl restore).
- discourse: VERIFIED (PR #4 discourse-official-image, !testme run #849 green).
- keycloak: VERIFIED (branch redfix-m2-harness; canonical promotes at warm-canon-keycloak, live warm-keycloak undisturbed 200).
- mumble: VERIFIED-stabilization (branch; green + budget 180s active; load-flake not deterministically reproducible).
- bluesky-pds: fix correct (PR #4 rename) + mechanically proven; end-to-end warm verify structurally blocked pre-merge -> direct-deploy or operator post-merge.
- gitea: PR #2 seed fix BROKE 3.5.3->3.6.0 transition (wizard mode); testable via chaos; NEEDS REWORK (reproduce+inspect).
NOT claiming M2 — bluesky end-to-end + gitea rework outstanding.

View File

@ -133,3 +133,26 @@ _(prior placeholder removed)_
save vs read-only app.ini config mount). Cold passes (fresh render, no runtime save). Builder's
classification + proposed fix (render app.ini into the writable volume) CORRECT. Will verify
canonical stays 3.5.3 (promote refused) + restore warm-gitea to undeployed idle.
- 2026-06-18T02:15Z — **M2 interim corroboration (NOT a verdict — M2 not yet claimed).** Node cold-checked
idle (load 0.07, no run_recipe_ci/abra, only live warm-keycloak) — Builder between M2 fixes, so I stayed
OFF the swarm (no contending deploy). Non-contending read-only check of the one fix marked DONE
(mattermost-lts PR #1, ref `4ca7f4182d83`): cc-ci run **#901** artifacts on cc-ci
(`/var/lib/cc-ci-runs/901/`) confirm all tiers pass (install/upgrade/backup/restore/custom), rungs all
pass, `flags.clean_teardown=true`, `flags.no_secret_leak=true`, `WARM_CANONICAL=true`. The exact
M1-failing test now PASSES: `junit/restore__cc-ci__test_restore.xml` → testsuite
`failures="0" errors="0" skipped="0" tests="1"`, testcase `test_restore_returns_state`. This is a
read-only artifact check, NOT my own cold re-run — the formal M2 PASS will require my own cold
re-verification of all six fixes once the Builder claims M2. Pre-staged anchor only.
- 2026-06-18T04:12Z — **Idle break-it probe (NOT a verdict — M2 not yet claimed).** Cold-checked node
while Builder reworks bluesky+gitea (their journal: 4/6 verified, bluesky warm-verify structurally
blocked pre-merge, gitea needs rework). Stayed OFF the swarm. Observations: live
`warm-keycloak.ci.commoninternet.net/realms/master` = **200** (live shared SSO undisturbed by the
keycloak harness fix + its verify run — the keycloak DoD's hard constraint holds). Deployed stacks =
infra + live warm-keycloak + a `warm-gitea` (Builder's active rework; app `/api/v1/version`=404 =
wizard mode, consistent with their "gitea fix v1 broke 3.5.3→3.6.0 transition"). No orphan
test/bluesky stacks, no `run_recipe_ci` procs, load 0.44. **Critical break-it check PASSED: gitea
canonical is UNCHANGED** — `/var/lib/ci-warm/gitea/canonical.json` still `3.5.3+1.24.2-rootless`,
commit `e6a1cc79`, status `idle`, ts `20260617T083930Z` (identical to M1). The Builder's broken gitea
fix attempts did NOT falsely promote 3.6.0 to canonical. Idling for the M2 gate claim.

View File

@ -78,16 +78,18 @@ mirrors via the recipe mirror+PR flow, verified `!testme` (NEVER merge). Harness
on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my
deploys (Adversary done with M1).
### M2 fix tracker
### M2 fix tracker (updated 2026-06-18T03:15Z)
| Recipe | Fix type | PR/branch | Status |
| Recipe | Fix | PR/branch | Status |
|---|---|---|---|
| mattermost-lts | recipe PR (pg_backup.sh + restore.post-hook) | mirror PR #1 `ci/pg-restore` @4ca7f418 | **DONE — !testme run #901 ALL tiers green** (restore__cc-ci failures=0 skipped=0; the M1-failing test_restore_returns_state now PASSES) |
| bluesky-pds | recipe PR (unique `pds` internal alias for caddy) | mirror PR #4 `ci/warm-routing-alias` | PR created; verifying on PROMOTE path (warm-bluesky-pds → expect 200 vs M1 000; !testme cold-only won't reproduce) |
| gitea | recipe PR (app.ini → writable volume) | — | pending |
| keycloak | harness (collision-free canonical_domain) + enroll | — | pending |
| mumble | harness (handshake readiness/retry stabilization) | — | pending |
| discourse | recipe PR (official-image migration) | mirror PR #4 `discourse-official-image` | already !testme-GREEN @53ba0910 (run #849, 16:36Z); re-verify fresh |
| mattermost-lts | recipe: pg_backup.sh + restore.post-hook | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green (restore_returns_state PASS) |
| discourse | recipe: official-image migration | mirror PR #4 `discourse-official-image` @53ba0910 | **VERIFIED** — !testme run #849 green (overlay passes on migrated head); re-verify fresh for claim |
| bluesky-pds | recipe: rename service app->pds (abra drops aliases) + cc-ci exec-refs service=pds | mirror PR #4 `ci/warm-routing-alias` (rename) + branch `redfix-m2-harness` | IN PROGRESS — cold install PASSES (caddy->pds routing works!) but backup/restore/custom fail on `no running container <stack>_pds after 60s` (backup-bot cycle + exec poll); re-running w/ live inspection. Warm-promote (the actual 000 fix) blocked until cold green. |
| gitea | recipe: app.ini writable (seed) | mirror PR #2 `ci/app-ini-writable` | **NEEDS REWORK** — seed fix works for fresh install but breaks 3.5.3->3.6.0 transition (wizard mode, /api/v1/version 404). Reverted clone. Rework: reproduce+inspect, or provide 1.24-valid oauth2 JWT. |
| keycloak | harness: collision-free canonical_domain + WARM_CANONICAL=True | branch `redfix-m2-harness` | code done; verify pending (run from branch checkout -> promote at warm-canon-keycloak, live warm-keycloak stays 200) |
| mumble | harness: handshake budget 60s->180s | branch `redfix-m2-harness` | code done; verify pending (green from branch checkout; load-green hard to repro) |
Verification mechanism for cc-ci-side changes: run from a checkout of `redfix-m2-harness` at /tmp/cc-ci-m2run with CCCI_REPO set (never touches /etc/cc-ci or main).
## Gate: M1 — PASS (above). M2 not yet claimed.

View File

@ -40,17 +40,7 @@ def is_enrolled(recipe: str) -> bool:
def canonical_domain(recipe: str) -> str:
"""Stable data-warm domain for the recipe's canonical.
For a recipe that is ALSO a live-warm provider (in `warm.WARM_DOMAINS` — e.g. keycloak, whose
always-on shared OIDC instance lives at `warm-keycloak…`), the data-warm canonical MUST use a
DISTINCT domain: otherwise the sweep's promote deploy/teardown at `warm-<recipe>` collides with —
and could disrupt — the live shared service that other recipes (lasuite-*/drone) depend on. Give
those recipes a collision-free `warm-canon-<recipe>` namespace (a separate stack/domain that can
never touch the live provider); every other recipe keeps the plain `warm-<recipe>` scheme
(zero blast radius on the 15 existing canonicals)."""
if recipe in warm.WARM_DOMAINS:
return f"warm-canon-{recipe}.ci.commoninternet.net"
"""Stable data-warm domain for the recipe's canonical."""
return warm.stable_domain(recipe)

View File

@ -37,7 +37,7 @@ def _goat_admin(domain: str, args: str) -> str:
f'--admin-password "$(cat /run/secrets/pds_admin_password)" '
f"--pds-host {PDS_HOST_LOCAL} 2>&1"
)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="pds", timeout=120)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], timeout=120)
def account_did(domain: str) -> str | None:

View File

@ -46,7 +46,7 @@ def _in_container(domain: str, shell_cmd: str) -> str:
"""Run `shell_cmd` inside the PDS app container via exec_in_app (sh -c wrapper)."""
# The admin_pw_flag uses $(cat ...) which only the sh inside the container can expand —
# callers pass the raw shell command including those substitutions.
return lifecycle.exec_in_app(domain, ["sh", "-c", shell_cmd], service="pds", timeout=120)
return lifecycle.exec_in_app(domain, ["sh", "-c", shell_cmd], timeout=120)
def _goat_admin(domain: str, args: str) -> str:

View File

@ -7,12 +7,10 @@ DEPLOY_TIMEOUT = (
)
HTTP_TIMEOUT = 900
# phase redfix: keycloak IS now a data-warm canonical. The original canon §2.B exception de-enrolled
# it because its canonical would have used the SAME domain as the live-warm OIDC provider
# (warm-keycloak.ci.commoninternet.net), so the sweep's promote deploy/teardown would collide with the
# live service lasuite-*/drone depend on. That collision is now structurally impossible:
# `canonical.canonical_domain()` routes any recipe in `warm.WARM_DOMAINS` (keycloak) to a distinct
# `warm-canon-<recipe>` domain/stack, so the data-warm canonical and the live-warm provider are
# separate deployments that can never touch each other. keycloak therefore gets full data-warm
# canonical coverage (a real promote on its latest release) without risking the live OIDC service.
WARM_CANONICAL = True
# canon §2.B EXCEPTION (recorded in DECISIONS): keycloak is NOT a data-warm canonical. It is the
# project's LIVE-WARM OIDC dep provider — an always-on shared service at the SAME stable domain a
# data-warm canonical would use (warm-keycloak.ci.commoninternet.net). Enrolling it would make the
# sweep's promote deploy/teardown collide with the live provider that lasuite-*/drone depend on for
# SSO. keycloak is instead kept current by the sweep's roll_warm_infra step (the health-gated
# warm/infra reconciler, WC1.1) — so it never lacks coverage. WARM_CANONICAL stays False.
WARM_CANONICAL = False

View File

@ -19,14 +19,7 @@ import _mumble_proto # noqa: E402
def test_handshake_completes_with_channel_presence(live_app):
# Readiness budget: 36×5s = 180s. The TCP READY_PROBE (recipe_meta) only proves port 64738 is
# LISTENING; the murmur control channel needs additional warmup before it completes a full
# TLS+Version+ServerSync handshake. Under concurrent node load (the canon sweep) that warmup
# exceeded the old 60s budget and flaked this test RED, while it is reliably GREEN in isolation
# (phase redfix M1: 3× isolation green, 0 isolation reds). The longer budget absorbs the
# load-induced readiness delay WITHOUT weakening the assertion — a genuinely non-responsive
# server still exhausts all retries and FAILs (the asserts below are unchanged).
r = _mumble_proto.retry_handshake(attempts=36, interval=5.0)
r = _mumble_proto.retry_handshake(attempts=12, interval=5.0)
assert r["tls_connect"], f"TLS connection to 127.0.0.1:64738 failed — {r.get('error')}"
assert r["server_version"] is not None, "server did not send a Version message"