status(2): Q2 RE-CLAIMED — F2-5 dep-teardown-verify fix cold-verified clean

Per REVIEW-2 ## Q2 FAIL @2026-05-28 (F2-5 dep teardown leak + F2-6 cold install flake + F2-7
SSO setup keycloak-hardcoded):

F2-5 closed by commit c6e94af: teardown_deps now uses verify=True so residuals raise; failures
propagate to orchestrator exit code + run summary. Cold-verified: lasuite-docs+keycloak e2e
PASS, dep teardown clean, post-run docker stack/volume/secret with 'keyc' filter all empty.

This also explained my Q3.1 flake — the leaked Q2.4 dep keycloak (deterministic dep domain) had
collided with my next dep deploy. With F2-5 fixed, that class of cross-run collision is
impossible (teardown now raises if it leaks, so the run fails BEFORE the next one starts).

F2-7 acknowledged: setup_keycloak_realm is keycloak-specific; authentik would need parallel
backend. Logged for Q2.2/Q5.

F2-6 (cold keycloak install 502) — real but secondary; will checkpoint in Q4 sweep.

Side-effect: Q3.1 partial also landed (PARITY.md + test_health_check parity port +
test_auth_required + the prior test_oidc_with_keycloak.py as Q3.1 third specific test).

Cold evidence: ssh cc-ci 'RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py'
  deploy-count=2 (expect 2), all 5 assertions PASS, dep teardown clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 09:22:24 +01:00
parent 874bfbb915
commit 54b1fe326c
3 changed files with 89 additions and 9 deletions

View File

@ -312,3 +312,65 @@ generality. From now on: when a recipe-overlay needs a robustness pattern, ask i
to a shared helper BEFORE fixing in-place.
Q2 CLAIMED; awaiting Adversary cold-verify. Continuing on Q3 (SSO-dependent suite) in parallel.
## 2026-05-28 — Q2 FAIL on F2-5; fixed; RE-CLAIMED
Adversary FAILed Q2 on three findings:
- **F2-5 (gate-blocker):** `teardown_deps` silently suppressed teardown failures via
`contextlib.suppress(Exception)`. The `===== DEPS teardown =====` print fired even when undeploy
raised. On Adversary cold-check 14+ minutes after my Q2.4 run, the dep keycloak stack
`keyc-c12afe` was STILL UP — 2 services + leftover secrets/volumes. The "green" Q2.4 run leaked.
- **F2-6 (secondary):** cold keycloak install flake (502 from /realms/master). Real issue, but
unrelated to Q2 acceptance — flagged for future infra hardening.
- **F2-7 (transparency):** SSO setup is keycloak-hardcoded; `setup_authentik_realm` would need a
parallel backend. Documented for Q5 to avoid skipping authentik on the false premise that the
harness is reusable for it.
**This explained my Q3.1 flake!** When I ran lasuite-docs+keycloak again after the Q2.4 run, the
dep domain (`keyc-c12afe.ci.commoninternet.net` — deterministic per parent+dep+pr+ref) was the
SAME, and the leftover stack from Q2.4 collided with the new deploy. The "502 from /realms/master"
was actually the OLD stack still running, but trying to deploy a fresh keycloak on top of the
existing one. The new abra app new succeeded (created a new .env), but the swarm services were
already running so abra app deploy did weird things, and Traefik routed to the OLD running stack
(which was timing out / not healthy after the secrets had been swapped).
**Fix (commit `c6e94af`):**
- `deps.py::teardown_deps`: switched to `verify=True` so `lifecycle.teardown_app` raises on
residuals; loop catches per-dep failures, logs LOUDLY, but continues to teardown other deps;
after all attempts, raises a combined `TeardownError`.
- `run_recipe_ci.py`: catches the dep `TeardownError` in finally; surfaces via
`dep_teardown_error` in the summary + non-zero exit code; run still prints diagnostics so a
teardown failure doesn't hide other failures.
**Cold-verified e2e** (log `/root/ccci-f25-verify.log`):
```
RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
===== DEPS: ['keycloak'] =====
dep: deploying keycloak -> keyc-c12afe.ci.commoninternet.net
dep: keycloak ready @ keyc-c12afe.ci.commoninternet.net
===== TIER: install ===== 2 PASS
===== TIER: custom ===== 3 PASS (incl. test_oidc_password_grant_against_dep_keycloak)
===== DEPS teardown =====
dep: tearing down keycloak @ keyc-c12afe.ci.commoninternet.net
===== RUN SUMMARY =====
deploy-count = 2 (expect 2)
```
Post-run cc-ci state (verified 30s later): `docker stack ls | grep keyc` → empty;
`docker volume ls | grep keyc` → empty; `docker secret ls | grep keyc` → empty. No leak.
Side-effect of the cleanup: also landed Q3.1 partial (PARITY.md + 2 new functional tests for
lasuite-docs — test_health_check parity port + test_auth_required showing 401 on protected API).
test_oidc_with_keycloak.py is the third specific test (Q2.4 acceptance + Q3.1 OIDC coverage).
**Lessons:**
1. **Silent exception suppression in cleanup paths is a bug**, not robustness. Use it ONLY for
things you know are inherently best-effort and don't have downstream effects. Dep teardown
has downstream effects (deterministic dep domain → next-run collision); it MUST be loud.
2. **Deterministic per-run domains amplify state leaks.** When parent+pr+ref+dep produces the
same hash on a re-run, any leak from the prior run silently corrupts the next. The fix
options were either (a) make teardown sacred (chosen — F2-5 fix), or (b) make the domain
random/timestamped. (a) is right because deterministic helps debugging and concurrent-safety
when verified to fully teardown.
Q2 RE-CLAIMED. Continuing Q3 work in parallel.