diff --git a/machine-docs/BACKLOG-2.md b/machine-docs/BACKLOG-2.md index 205c4dc..17c1dbd 100644 --- a/machine-docs/BACKLOG-2.md +++ b/machine-docs/BACKLOG-2.md @@ -50,11 +50,10 @@ Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md` + deploy_deps/teardown_deps + run state) + SSO-setup harness (`runner/harness/sso.py` — setup_keycloak_realm + oidc_password_grant + assert_discovery_endpoint) + orchestrator wiring. 7 new unit tests; 28/28 PASS. **Subsumes Q0.4.** Commit `4d6b040`. -- [x] **Q2.4** — **CLAIMED @2026-05-28** (commit `9e88741`). `tests/lasuite-docs/recipe_meta.py - DEPS = ["keycloak"]`; `tests/lasuite-docs/functional/test_oidc_with_keycloak.py` proves the - full SSO flow against the per-run keycloak dep: realm/client/user setup, OIDC discovery, - password grant, JWT claim validation. Cold-run: deploy-count=2 (1 parent + 1 dep), all - stages PASS, dep teardown clean. +- [x] **Q2.4** — **RE-CLAIMED @2026-05-28** (commit `c6e94af` F2-5 fix on top of `9e88741`). + `tests/lasuite-docs/recipe_meta.py DEPS = ["keycloak"]`; `test_oidc_with_keycloak.py` + proves the full SSO flow. F2-5 verified: dep teardown now uses verify=True, raises + + surfaces leak failures; cold re-verify on cc-ci → no leftover keycloak after teardown. ### Q3 — SSO-dependent suite (lasuite-docs, lasuite-drive, lasuite-meet, cryptpad, immich) - [ ] **Q3.1** — lasuite-docs: parity (health_check, oidc_login, upload_conversion) + specific diff --git a/machine-docs/JOURNAL-2.md b/machine-docs/JOURNAL-2.md index e79aea8..348bfb5 100644 --- a/machine-docs/JOURNAL-2.md +++ b/machine-docs/JOURNAL-2.md @@ -312,3 +312,65 @@ generality. From now on: when a recipe-overlay needs a robustness pattern, ask i to a shared helper BEFORE fixing in-place. Q2 CLAIMED; awaiting Adversary cold-verify. Continuing on Q3 (SSO-dependent suite) in parallel. + +## 2026-05-28 — Q2 FAIL on F2-5; fixed; RE-CLAIMED + +Adversary FAILed Q2 on three findings: +- **F2-5 (gate-blocker):** `teardown_deps` silently suppressed teardown failures via + `contextlib.suppress(Exception)`. The `===== DEPS teardown =====` print fired even when undeploy + raised. On Adversary cold-check 14+ minutes after my Q2.4 run, the dep keycloak stack + `keyc-c12afe` was STILL UP — 2 services + leftover secrets/volumes. The "green" Q2.4 run leaked. +- **F2-6 (secondary):** cold keycloak install flake (502 from /realms/master). Real issue, but + unrelated to Q2 acceptance — flagged for future infra hardening. +- **F2-7 (transparency):** SSO setup is keycloak-hardcoded; `setup_authentik_realm` would need a + parallel backend. Documented for Q5 to avoid skipping authentik on the false premise that the + harness is reusable for it. + +**This explained my Q3.1 flake!** When I ran lasuite-docs+keycloak again after the Q2.4 run, the +dep domain (`keyc-c12afe.ci.commoninternet.net` — deterministic per parent+dep+pr+ref) was the +SAME, and the leftover stack from Q2.4 collided with the new deploy. The "502 from /realms/master" +was actually the OLD stack still running, but trying to deploy a fresh keycloak on top of the +existing one. The new abra app new succeeded (created a new .env), but the swarm services were +already running so abra app deploy did weird things, and Traefik routed to the OLD running stack +(which was timing out / not healthy after the secrets had been swapped). + +**Fix (commit `c6e94af`):** +- `deps.py::teardown_deps`: switched to `verify=True` so `lifecycle.teardown_app` raises on + residuals; loop catches per-dep failures, logs LOUDLY, but continues to teardown other deps; + after all attempts, raises a combined `TeardownError`. +- `run_recipe_ci.py`: catches the dep `TeardownError` in finally; surfaces via + `dep_teardown_error` in the summary + non-zero exit code; run still prints diagnostics so a + teardown failure doesn't hide other failures. + +**Cold-verified e2e** (log `/root/ccci-f25-verify.log`): +``` +RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py +===== DEPS: ['keycloak'] ===== + dep: deploying keycloak -> keyc-c12afe.ci.commoninternet.net + dep: keycloak ready @ keyc-c12afe.ci.commoninternet.net +===== TIER: install ===== 2 PASS +===== TIER: custom ===== 3 PASS (incl. test_oidc_password_grant_against_dep_keycloak) +===== DEPS teardown ===== + dep: tearing down keycloak @ keyc-c12afe.ci.commoninternet.net +===== RUN SUMMARY ===== +deploy-count = 2 (expect 2) +``` + +Post-run cc-ci state (verified 30s later): `docker stack ls | grep keyc` → empty; +`docker volume ls | grep keyc` → empty; `docker secret ls | grep keyc` → empty. No leak. + +Side-effect of the cleanup: also landed Q3.1 partial (PARITY.md + 2 new functional tests for +lasuite-docs — test_health_check parity port + test_auth_required showing 401 on protected API). +test_oidc_with_keycloak.py is the third specific test (Q2.4 acceptance + Q3.1 OIDC coverage). + +**Lessons:** +1. **Silent exception suppression in cleanup paths is a bug**, not robustness. Use it ONLY for + things you know are inherently best-effort and don't have downstream effects. Dep teardown + has downstream effects (deterministic dep domain → next-run collision); it MUST be loud. +2. **Deterministic per-run domains amplify state leaks.** When parent+pr+ref+dep produces the + same hash on a re-run, any leak from the prior run silently corrupts the next. The fix + options were either (a) make teardown sacred (chosen — F2-5 fix), or (b) make the domain + random/timestamped. (a) is right because deterministic helps debugging and concurrent-safety + when verified to fully teardown. + +Q2 RE-CLAIMED. Continuing Q3 work in parallel. diff --git a/machine-docs/STATUS-2.md b/machine-docs/STATUS-2.md index a7caf77..5306020 100644 --- a/machine-docs/STATUS-2.md +++ b/machine-docs/STATUS-2.md @@ -57,10 +57,18 @@ Q2 PASS as it's lower-priority (the SSO harness is provider-pluggable and Q2.4 a already proven via keycloak). ## Gate -**Gate: Q2 — CLAIMED, awaiting Adversary @2026-05-28** (commits `d5f5e86` Q2.1 keycloak; `4d6b040` -Q2.3 dep resolver + SSO harness primitives; `47f7cb4` harness.browser hardening across all install -overlays; `9e88741` Q2.4 acceptance). Acceptance per plan §6 Q2: "a dependent recipe deploys its -provider + runs an OIDC login test in one run." Proven cold: +**Gate: Q2 — RE-CLAIMED, awaiting Adversary @2026-05-28** (commit `c6e94af` F2-5 fix on top of +the prior Q2 changeset). Adversary FAIL on F2-5 (dep teardown silent suppress) + F2-6 (cold +keycloak install flake, secondary) + F2-7 (SSO setup keycloak-hardcoded, transparency). F2-5 +fixed: `teardown_deps` now uses `verify=True`, errors propagate to the orchestrator's exit code, +the run summary surfaces leaks. Cold-verified: dep keycloak deployed → tests PASS → DEPS +teardown ran clean → `docker stack ls | grep keyc` → empty. F2-7 ack as a real scope gap (when +Q2.2 authentik enrolls, `setup_authentik_realm` will need a parallel backend in `harness.sso`). +F2-6 cold-flake on keycloak install is real but unrelated to Q2 acceptance (a flake-handling +finding for the install layer; will checkpoint when Q4 reaches keycloak again). + +Acceptance per plan §6 Q2: "a dependent recipe deploys its provider + runs an OIDC login test +in one run." Proven cold: **Objective evidence pointers (Q2):** - **Q2.1 keycloak parity + 2 NEW specific tests** — commit `d5f5e86`: @@ -84,6 +92,17 @@ provider + runs an OIDC login test in one run." Proven cold: - `tests/conftest.py` — `deps_apps` fixture exposes dep domains to dependent tests. - 7 new unit tests in `tests/unit/test_deps.py`; **28/28 unit tests PASS** cold. +- **F2-5 fix — dep teardown verify=True** — commit `c6e94af`, log `/root/ccci-f25-verify.log`: + - `runner/harness/deps.py::teardown_deps` now uses `lifecycle.teardown_app(..., verify=True)` + so residuals raise `TeardownError`. Errors are logged per-dep but we continue to other deps; + a combined `TeardownError` is raised after all attempts. + - `runner/run_recipe_ci.py` catches the dep `TeardownError` in finally, surfaces via + `dep_teardown_error` in the run summary + non-zero exit code. + - Cold-verified: lasuite-docs+keycloak dep e2e PASSED clean (3 custom + 2 lifecycle install = + 5 PASS); post-run cc-ci state has NO leftover keycloak (`docker stack ls | grep keyc` → + empty; `docker volume ls | grep keyc` → empty; `docker secret ls | grep keyc` → empty). + - deploy-count=2, expected 2. + - **Q2.4 acceptance (the gate)** — commit `9e88741`, log `/root/ccci-q24-lasuite-keycloak.log`: - `tests/lasuite-docs/recipe_meta.py` declares `DEPS = ["keycloak"]`. - `tests/lasuite-docs/functional/test_oidc_with_keycloak.py`: