Compare commits
9 Commits
redfix-m2-
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 966edb3042 | |||
| 12925b5ab8 | |||
| c5bc29bb97 | |||
| a65372cfde | |||
| 6846bbe83d | |||
| ed7d897e5f | |||
| fca936ef50 | |||
| c021d7e305 | |||
| 278cb4e4b8 |
@ -356,3 +356,117 @@ cold green -> promote -> warm-bluesky-pds 200.
|
||||
- gitea: fix READY locally (/tmp/redfix-gitea: app.ini->staging + docker-setup seed-once + DOCKER_SETUP_SH_VERSION v2); needs PR push + warm-advance verify.
|
||||
- keycloak: harness fix (canonical_domain collision-free for WARM_DOMAINS recipes + enroll) NOT STARTED.
|
||||
- mumble: harness fix (handshake readiness/retry stabilization) NOT STARTED.
|
||||
|
||||
## 2026-06-18T02:45Z — M2 progress: gitea PR + harness branch pushed; bluesky pivoted to rename
|
||||
|
||||
- **gitea**: opened recipe PR #2 `ci/app-ini-writable` (app.ini->staging + docker-setup seed-once +
|
||||
DOCKER_SETUP_SH_VERSION v2). Advance-path verification RUNNING (fixed 3.6.0 reattach to idle 3.5.3
|
||||
canonical; expect no app.ini crash + promote). cold lifecycle green so far (install + cold upgrade
|
||||
converged).
|
||||
- **bluesky**: PR #4 updated alias->RENAME service app->pds (abra drops aliases). 3-line recipe diff,
|
||||
validates. Coupled cc-ci exec-ref change on branch.
|
||||
- **cc-ci harness branch `redfix-m2-harness`** pushed (3 commits): keycloak (collision-free
|
||||
canonical_domain + WARM_CANONICAL=True), mumble (handshake budget 60s->180s), bluesky-pds
|
||||
(exec_in_app service=pds). Verified via temp-checkout runs (CCCI_REPO=<branch checkout>).
|
||||
- Verification sequencing (node is single, serial): gitea advance (running) -> bluesky rename promote
|
||||
(needs branch exec-refs) -> keycloak canonical at warm-canon-keycloak (needs branch) -> mumble.
|
||||
NOTE: mumble "green under load" is hard to reproduce deterministically; plan = show branch run still
|
||||
green + reason about the budget (or construct concurrent load).
|
||||
|
||||
## 2026-06-18T03:00Z — M2 gitea fix v1 (seed) BROKE the transition — needs rework
|
||||
|
||||
gitea advance verification (fixed 3.6.0): install tier PASSED FULLY (fresh 3.6.0 + my fix: API 200,
|
||||
admin auth OK — so the seed works for a FRESH deploy), but upgrade/backup/restore/custom ALL FAILED:
|
||||
`READY_PROBE not ready: /api/v1/version (last status 404) within 600s` after the 3.5.3->3.6.0 chaos
|
||||
redeploy → gitea came up in INSTALL-WIZARD mode (serves 200 but no API/admin = no valid app.ini).
|
||||
The LFS custom test's repo-create also 404'd (same wizard-mode cause).
|
||||
|
||||
So my seed-once fix is fine for fresh install but FAILS the 3.5.3->3.6.0 transition — exactly the path
|
||||
the canon fix needs. Likely cause: on the chaos redeploy from a 3.5.3 stack (docker_setup_sh_v1, no
|
||||
seed) the docker-setup config didn't update to my v2 (seed) while compose moved app.ini to the staging
|
||||
path → /etc/gitea/app.ini empty → wizard. (To confirm: reproduce + inspect the post-redeploy container
|
||||
— is docker_setup_sh_v2 mounted? does /etc/gitea/app.ini exist? gitea log.) Reverted the fix from
|
||||
cc-ci's gitea clone; warm-gitea intact (idle 3.5.3, promote didn't fire on the red cold run). gitea
|
||||
recipe PR #2 stands but the fix needs a rework (likely: a more robust seed that runs regardless of
|
||||
config version, OR provide a 1.24-valid oauth2 JWT secret so gitea never rewrites app.ini — investigate
|
||||
WHY 1.24 regenerates it). Deferring gitea; proceeding to bluesky-rename / keycloak / mumble verifies.
|
||||
|
||||
## 2026-06-18T03:30Z — M2 bluesky verification BLOCKED by abra non-chaos tag-revert; keycloak/mumble next
|
||||
|
||||
Root cause of the bluesky rename verify failure: the deployed service was `..._app` (not `pds`).
|
||||
`run_recipe_ci` CCCI_SKIP_FETCH copies my renamed clone to the per-run tree, BUT abra's NON-CHAOS
|
||||
pinned deploy (bluesky's tag 0.3.0+v0.4.219 is ANNOTATED) does `git checkout <tag>` in the per-run
|
||||
tree, REVERTING my rename to the tag's `app:`. So the renamed recipe never deployed; the branch
|
||||
harness then execs `service=pds` -> "no running container <stack>_pds" -> backup/restore/custom red.
|
||||
(This also re-explains the earlier "abra dropped the alias" — it was the same tag-revert, not a drop.)
|
||||
gitea's tag is lightweight -> deploy_app uses chaos -> my gitea fix DID deploy (install passed); its
|
||||
failure is a real transition issue, not a revert.
|
||||
|
||||
IMPLICATION: verifying a RECIPE fix (bluesky, gitea) via CCCI_SKIP_FETCH needs a CHAOS deploy (uses the
|
||||
checkout, not the tag). HARNESS fixes (keycloak canonical_domain, mumble retry) are runner/test code
|
||||
from the branch checkout — NO tag-revert — so they verify cleanly. Doing keycloak + mumble next.
|
||||
For bluesky: force chaos (deploy_app does chaos when has_ccci_overlay) OR reconsider a cc-ci-side
|
||||
overlay fix (alias + caddyfile override) — both verifiable; recipe PR #4 (rename) stays as the ideal
|
||||
upstream fix. gitea: rework + reproduce-with-inspection.
|
||||
|
||||
## 2026-06-18T03:40Z — M2 keycloak FIXED + VERIFIED (collision-free canonical)
|
||||
|
||||
Ran keycloak cold-on-latest from branch checkout /tmp/cc-ci-m2run (harness fix: canonical_domain ->
|
||||
warm-canon-keycloak for WARM_DOMAINS recipes; WARM_CANONICAL=True). RESULT: all cold tiers PASS
|
||||
(install/upgrade/backup/restore/custom), and WC5 promote SUCCEEDED:
|
||||
canonical keycloak @ 10.8.0+26.6.3, domain="warm-canon-keycloak.ci.commoninternet.net", idle, volume retained.
|
||||
- Promoted at the COLLISION-FREE domain warm-canon-keycloak (not warm-keycloak). ✓
|
||||
- Live warm-keycloak (shared OIDC provider) = 200 THROUGHOUT — undisturbed. ✓
|
||||
- warm-canon-keycloak = 404 now = CORRECT idle state (data-warm canonical undeployed, volume kept).
|
||||
So keycloak is now a full data-warm canonical with zero risk to the live SSO. **FIXED + verified.**
|
||||
3/6 verified: mattermost-lts, discourse, keycloak. Doing mumble next (harness, tractable).
|
||||
|
||||
## 2026-06-18T03:50Z — M2 mumble VERIFIED (stabilization); 4/6 done
|
||||
|
||||
Ran mumble from branch checkout (handshake budget attempts=36/180s). ALL tiers PASS incl
|
||||
test_handshake_completes_with_channel_presence; promote succeeded (canonical 1.0.0+v1.6.870-0 idle).
|
||||
The longer budget is active + non-regressing. NOTE: mumble is green in isolation regardless of budget
|
||||
(the 60s sufficed in isolation); the budget matters UNDER LOAD, which is hard to reproduce
|
||||
deterministically — so this verifies the stabilization is applied + sound + non-weakening, not a literal
|
||||
load-flake repro. (M1 already established green-isolation/red-under-canon-load; the fix gives the
|
||||
handshake 3x the readiness window.) **Stabilization fix verified.** 4/6: mattermost, discourse,
|
||||
keycloak, mumble. Remaining: bluesky (force-chaos verify of the rename), gitea (rework).
|
||||
|
||||
## 2026-06-18T03:52Z — M2 bluesky force-chaos verification approach
|
||||
|
||||
bluesky's rename can't deploy via the normal path (annotated tag -> non-chaos -> abra checks out the
|
||||
tag, reverting the rename). In PRODUCTION post-merge the new tag would carry the rename (non-chaos
|
||||
deploys it fine). For PRE-merge verification I force chaos via a temporary tests/bluesky-pds/
|
||||
compose.ccci.yml scaffold on the branch (has_ccci_overlay -> deploy_app uses chaos -> deploys my
|
||||
renamed checkout). Then cold goes green (service pds + branch exec-refs) and the promote deploys the
|
||||
renamed recipe at warm-bluesky-pds via chaos -> caddy resolves the unique `pds` -> expect 200 (vs M1
|
||||
000). The overlay is a verification scaffold (NOT part of recipe PR #4); removed after.
|
||||
|
||||
## 2026-06-18T04:05Z — M2 bluesky verification: STRUCTURAL blocker (pre-merge warm-promote)
|
||||
|
||||
bluesky rename verification keeps deploying the TAG's `app:` (not my rename), even with: tag moved to
|
||||
the rename commit AND a force-chaos overlay. Root: the warm-promote/cold-on-latest path resolves the
|
||||
recipe at the UPSTREAM annotated tag (deploy_app recipe_checkout(tag) reverts unmerged content; the
|
||||
chaos+overlay path STILL recipe_checkout's the pinned version). Unlike gitea (lightweight tag -> the
|
||||
upgrade-tier chaos_redeploy uses the CHECKOUT, so the gitea fix deployed), bluesky has NO upgrade tier
|
||||
(EXPECTED_NA) -> no chaos_redeploy path -> the rename never deploys on the promote path.
|
||||
|
||||
CONSEQUENCE: an unmerged RECIPE fix whose failure is WARM-PROMOTE-ONLY (bluesky 000) cannot be
|
||||
end-to-end-verified via the standard harness pre-merge. mattermost/discourse were verifiable because
|
||||
their failures are COLD tiers (restore/upgrade-overlay) reachable by !testme on the PR head.
|
||||
|
||||
bluesky fix correctness is nonetheless ESTABLISHED by: (1) M1 root cause (Adversary-confirmed): bare
|
||||
`app` collides on the shared proxy; (2) docker test (proven): a unique service name/alias resolves to
|
||||
the local service (no collision). Renaming app->pds (PR #4) gives a unique name -> caddy resolves THIS
|
||||
PDS -> cert issued -> 200. End-to-end warm-200 needs either a DIRECT abra chaos deploy at
|
||||
warm-bluesky-pds (manual app+secrets+PLC-key setup; next iteration) or operator post-merge verify.
|
||||
Restored the bluesky tag; node clean; warm-keycloak 200.
|
||||
|
||||
## M2 STATUS (2026-06-18T04:05Z) — 4/6 verified
|
||||
- mattermost-lts: VERIFIED (PR #1 ci/pg-restore, !testme run #901 all-green incl restore).
|
||||
- discourse: VERIFIED (PR #4 discourse-official-image, !testme run #849 green).
|
||||
- keycloak: VERIFIED (branch redfix-m2-harness; canonical promotes at warm-canon-keycloak, live warm-keycloak undisturbed 200).
|
||||
- mumble: VERIFIED-stabilization (branch; green + budget 180s active; load-flake not deterministically reproducible).
|
||||
- bluesky-pds: fix correct (PR #4 rename) + mechanically proven; end-to-end warm verify structurally blocked pre-merge -> direct-deploy or operator post-merge.
|
||||
- gitea: PR #2 seed fix BROKE 3.5.3->3.6.0 transition (wizard mode); testable via chaos; NEEDS REWORK (reproduce+inspect).
|
||||
NOT claiming M2 — bluesky end-to-end + gitea rework outstanding.
|
||||
|
||||
@ -133,3 +133,26 @@ _(prior placeholder removed)_
|
||||
save vs read-only app.ini config mount). Cold passes (fresh render, no runtime save). Builder's
|
||||
classification + proposed fix (render app.ini into the writable volume) CORRECT. Will verify
|
||||
canonical stays 3.5.3 (promote refused) + restore warm-gitea to undeployed idle.
|
||||
|
||||
- 2026-06-18T02:15Z — **M2 interim corroboration (NOT a verdict — M2 not yet claimed).** Node cold-checked
|
||||
idle (load 0.07, no run_recipe_ci/abra, only live warm-keycloak) — Builder between M2 fixes, so I stayed
|
||||
OFF the swarm (no contending deploy). Non-contending read-only check of the one fix marked DONE
|
||||
(mattermost-lts PR #1, ref `4ca7f4182d83`): cc-ci run **#901** artifacts on cc-ci
|
||||
(`/var/lib/cc-ci-runs/901/`) confirm all tiers pass (install/upgrade/backup/restore/custom), rungs all
|
||||
pass, `flags.clean_teardown=true`, `flags.no_secret_leak=true`, `WARM_CANONICAL=true`. The exact
|
||||
M1-failing test now PASSES: `junit/restore__cc-ci__test_restore.xml` → testsuite
|
||||
`failures="0" errors="0" skipped="0" tests="1"`, testcase `test_restore_returns_state`. This is a
|
||||
read-only artifact check, NOT my own cold re-run — the formal M2 PASS will require my own cold
|
||||
re-verification of all six fixes once the Builder claims M2. Pre-staged anchor only.
|
||||
|
||||
- 2026-06-18T04:12Z — **Idle break-it probe (NOT a verdict — M2 not yet claimed).** Cold-checked node
|
||||
while Builder reworks bluesky+gitea (their journal: 4/6 verified, bluesky warm-verify structurally
|
||||
blocked pre-merge, gitea needs rework). Stayed OFF the swarm. Observations: live
|
||||
`warm-keycloak.ci.commoninternet.net/realms/master` = **200** (live shared SSO undisturbed by the
|
||||
keycloak harness fix + its verify run — the keycloak DoD's hard constraint holds). Deployed stacks =
|
||||
infra + live warm-keycloak + a `warm-gitea` (Builder's active rework; app `/api/v1/version`=404 =
|
||||
wizard mode, consistent with their "gitea fix v1 broke 3.5.3→3.6.0 transition"). No orphan
|
||||
test/bluesky stacks, no `run_recipe_ci` procs, load 0.44. **Critical break-it check PASSED: gitea
|
||||
canonical is UNCHANGED** — `/var/lib/ci-warm/gitea/canonical.json` still `3.5.3+1.24.2-rootless`,
|
||||
commit `e6a1cc79`, status `idle`, ts `20260617T083930Z` (identical to M1). The Builder's broken gitea
|
||||
fix attempts did NOT falsely promote 3.6.0 to canonical. Idling for the M2 gate claim.
|
||||
|
||||
@ -78,16 +78,18 @@ mirrors via the recipe mirror+PR flow, verified `!testme` (NEVER merge). Harness
|
||||
on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my
|
||||
deploys (Adversary done with M1).
|
||||
|
||||
### M2 fix tracker
|
||||
### M2 fix tracker (updated 2026-06-18T03:15Z)
|
||||
|
||||
| Recipe | Fix type | PR/branch | Status |
|
||||
| Recipe | Fix | PR/branch | Status |
|
||||
|---|---|---|---|
|
||||
| mattermost-lts | recipe PR (pg_backup.sh + restore.post-hook) | mirror PR #1 `ci/pg-restore` @4ca7f418 | **DONE — !testme run #901 ALL tiers green** (restore__cc-ci failures=0 skipped=0; the M1-failing test_restore_returns_state now PASSES) |
|
||||
| bluesky-pds | recipe PR (unique `pds` internal alias for caddy) | mirror PR #4 `ci/warm-routing-alias` | PR created; verifying on PROMOTE path (warm-bluesky-pds → expect 200 vs M1 000; !testme cold-only won't reproduce) |
|
||||
| gitea | recipe PR (app.ini → writable volume) | — | pending |
|
||||
| keycloak | harness (collision-free canonical_domain) + enroll | — | pending |
|
||||
| mumble | harness (handshake readiness/retry stabilization) | — | pending |
|
||||
| discourse | recipe PR (official-image migration) | mirror PR #4 `discourse-official-image` | already !testme-GREEN @53ba0910 (run #849, 16:36Z); re-verify fresh |
|
||||
| mattermost-lts | recipe: pg_backup.sh + restore.post-hook | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green (restore_returns_state PASS) |
|
||||
| discourse | recipe: official-image migration | mirror PR #4 `discourse-official-image` @53ba0910 | **VERIFIED** — !testme run #849 green (overlay passes on migrated head); re-verify fresh for claim |
|
||||
| bluesky-pds | recipe: rename service app->pds (abra drops aliases) + cc-ci exec-refs service=pds | mirror PR #4 `ci/warm-routing-alias` (rename) + branch `redfix-m2-harness` | IN PROGRESS — cold install PASSES (caddy->pds routing works!) but backup/restore/custom fail on `no running container <stack>_pds after 60s` (backup-bot cycle + exec poll); re-running w/ live inspection. Warm-promote (the actual 000 fix) blocked until cold green. |
|
||||
| gitea | recipe: app.ini writable (seed) | mirror PR #2 `ci/app-ini-writable` | **NEEDS REWORK** — seed fix works for fresh install but breaks 3.5.3->3.6.0 transition (wizard mode, /api/v1/version 404). Reverted clone. Rework: reproduce+inspect, or provide 1.24-valid oauth2 JWT. |
|
||||
| keycloak | harness: collision-free canonical_domain + WARM_CANONICAL=True | branch `redfix-m2-harness` | code done; verify pending (run from branch checkout -> promote at warm-canon-keycloak, live warm-keycloak stays 200) |
|
||||
| mumble | harness: handshake budget 60s->180s | branch `redfix-m2-harness` | code done; verify pending (green from branch checkout; load-green hard to repro) |
|
||||
|
||||
Verification mechanism for cc-ci-side changes: run from a checkout of `redfix-m2-harness` at /tmp/cc-ci-m2run with CCCI_REPO set (never touches /etc/cc-ci or main).
|
||||
|
||||
## Gate: M1 — PASS (above). M2 not yet claimed.
|
||||
|
||||
|
||||
@ -40,17 +40,7 @@ def is_enrolled(recipe: str) -> bool:
|
||||
|
||||
|
||||
def canonical_domain(recipe: str) -> str:
|
||||
"""Stable data-warm domain for the recipe's canonical.
|
||||
|
||||
For a recipe that is ALSO a live-warm provider (in `warm.WARM_DOMAINS` — e.g. keycloak, whose
|
||||
always-on shared OIDC instance lives at `warm-keycloak…`), the data-warm canonical MUST use a
|
||||
DISTINCT domain: otherwise the sweep's promote deploy/teardown at `warm-<recipe>` collides with —
|
||||
and could disrupt — the live shared service that other recipes (lasuite-*/drone) depend on. Give
|
||||
those recipes a collision-free `warm-canon-<recipe>` namespace (a separate stack/domain that can
|
||||
never touch the live provider); every other recipe keeps the plain `warm-<recipe>` scheme
|
||||
(zero blast radius on the 15 existing canonicals)."""
|
||||
if recipe in warm.WARM_DOMAINS:
|
||||
return f"warm-canon-{recipe}.ci.commoninternet.net"
|
||||
"""Stable data-warm domain for the recipe's canonical."""
|
||||
return warm.stable_domain(recipe)
|
||||
|
||||
|
||||
|
||||
@ -37,7 +37,7 @@ def _goat_admin(domain: str, args: str) -> str:
|
||||
f'--admin-password "$(cat /run/secrets/pds_admin_password)" '
|
||||
f"--pds-host {PDS_HOST_LOCAL} 2>&1"
|
||||
)
|
||||
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="pds", timeout=120)
|
||||
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], timeout=120)
|
||||
|
||||
|
||||
def account_did(domain: str) -> str | None:
|
||||
|
||||
@ -46,7 +46,7 @@ def _in_container(domain: str, shell_cmd: str) -> str:
|
||||
"""Run `shell_cmd` inside the PDS app container via exec_in_app (sh -c wrapper)."""
|
||||
# The admin_pw_flag uses $(cat ...) which only the sh inside the container can expand —
|
||||
# callers pass the raw shell command including those substitutions.
|
||||
return lifecycle.exec_in_app(domain, ["sh", "-c", shell_cmd], service="pds", timeout=120)
|
||||
return lifecycle.exec_in_app(domain, ["sh", "-c", shell_cmd], timeout=120)
|
||||
|
||||
|
||||
def _goat_admin(domain: str, args: str) -> str:
|
||||
|
||||
@ -7,12 +7,10 @@ DEPLOY_TIMEOUT = (
|
||||
)
|
||||
HTTP_TIMEOUT = 900
|
||||
|
||||
# phase redfix: keycloak IS now a data-warm canonical. The original canon §2.B exception de-enrolled
|
||||
# it because its canonical would have used the SAME domain as the live-warm OIDC provider
|
||||
# (warm-keycloak.ci.commoninternet.net), so the sweep's promote deploy/teardown would collide with the
|
||||
# live service lasuite-*/drone depend on. That collision is now structurally impossible:
|
||||
# `canonical.canonical_domain()` routes any recipe in `warm.WARM_DOMAINS` (keycloak) to a distinct
|
||||
# `warm-canon-<recipe>` domain/stack, so the data-warm canonical and the live-warm provider are
|
||||
# separate deployments that can never touch each other. keycloak therefore gets full data-warm
|
||||
# canonical coverage (a real promote on its latest release) without risking the live OIDC service.
|
||||
WARM_CANONICAL = True
|
||||
# canon §2.B EXCEPTION (recorded in DECISIONS): keycloak is NOT a data-warm canonical. It is the
|
||||
# project's LIVE-WARM OIDC dep provider — an always-on shared service at the SAME stable domain a
|
||||
# data-warm canonical would use (warm-keycloak.ci.commoninternet.net). Enrolling it would make the
|
||||
# sweep's promote deploy/teardown collide with the live provider that lasuite-*/drone depend on for
|
||||
# SSO. keycloak is instead kept current by the sweep's roll_warm_infra step (the health-gated
|
||||
# warm/infra reconciler, WC1.1) — so it never lacks coverage. WARM_CANONICAL stays False.
|
||||
WARM_CANONICAL = False
|
||||
|
||||
@ -19,14 +19,7 @@ import _mumble_proto # noqa: E402
|
||||
|
||||
|
||||
def test_handshake_completes_with_channel_presence(live_app):
|
||||
# Readiness budget: 36×5s = 180s. The TCP READY_PROBE (recipe_meta) only proves port 64738 is
|
||||
# LISTENING; the murmur control channel needs additional warmup before it completes a full
|
||||
# TLS+Version+ServerSync handshake. Under concurrent node load (the canon sweep) that warmup
|
||||
# exceeded the old 60s budget and flaked this test RED, while it is reliably GREEN in isolation
|
||||
# (phase redfix M1: 3× isolation green, 0 isolation reds). The longer budget absorbs the
|
||||
# load-induced readiness delay WITHOUT weakening the assertion — a genuinely non-responsive
|
||||
# server still exhausts all retries and FAILs (the asserts below are unchanged).
|
||||
r = _mumble_proto.retry_handshake(attempts=36, interval=5.0)
|
||||
r = _mumble_proto.retry_handshake(attempts=12, interval=5.0)
|
||||
|
||||
assert r["tls_connect"], f"TLS connection to 127.0.0.1:64738 failed — {r.get('error')}"
|
||||
assert r["server_version"] is not None, "server did not send a Version message"
|
||||
|
||||
Reference in New Issue
Block a user