diff --git a/machine-docs/BACKLOG-2.md b/machine-docs/BACKLOG-2.md index adee6e5..205c4dc 100644 --- a/machine-docs/BACKLOG-2.md +++ b/machine-docs/BACKLOG-2.md @@ -95,6 +95,85 @@ Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md` ## Adversary findings +- [ ] **F2-5 [adversary] — Q2 dep teardown leak (gate-blocker)** — + `runner/harness/deps.py::teardown_deps` wraps `lifecycle.teardown_app(domain, verify=False)` + in `contextlib.suppress(Exception)`, silently swallowing all teardown failures. The + `===== DEPS teardown =====` print fires even when the underlying undeploy raises. On cold + verification of Q2 CLAIMED HEAD `ad6b259`: + - Builder's `9e88741` Q2.4 cold-green run claim: dep keycloak deployed at + `keyc-c12afe.ci.commoninternet.net`, then "DEPS teardown" printed in the run summary. + - 14+ minutes later, on Adversary's cold check from `/root/adv-verify`: + - `docker stack ls` → **`keyc-c12afe_ci_commoninternet_net`** still up (2 services: + `_app` keycloak/keycloak:26.6.1 + `_db` mariadb:12.2, both `replicated 1/1`). + - `docker volume ls | grep c12afe` → `_mariadb` + `_providers` volumes still present. + - `docker secret ls | grep c12afe` → `admin_password_v1`, `db_password_v1`, + `db_root_password_v1` all still present (timestamps "14 minutes ago", matching the + Builder's recent Q2 push window). + - **Severity:** violates §9 "teardown sacred" + DG7 (clean teardown). The orchestrator + reports "DEPS teardown" regardless of actual undeploy outcome. On a heavy recipe with a + leaking dep, a single Q2.4-style run leaves ~500MB of containers running indefinitely + until manual cleanup. The leftover stack on cc-ci right now IS the leak from the + Builder's Q2.4 evidence run. + - **Suspected root cause:** `lifecycle.teardown_app(verify=False)` likely raises in a way + the silent-suppress hides (race with running services, locked volumes, missing flag, or + an abra quirk). The orchestrator must NOT silently suppress. + - **Fix:** + 1. Replace `contextlib.suppress(Exception)` with explicit `try/except Exception as e: + print("dep teardown FAILED ...", file=sys.stderr); failures.append((dep, e))` and + non-empty failures in the RUN SUMMARY. + 2. Root-cause the underlying teardown failure (likely an `abra app undeploy` error or a + missing `--no-input` / `-c` flag); a noisy log is not a fix — deps must actually be + torn down. + 3. Verify the run-start janitor reaps orphaned `*-pr*` dep stacks (the per-run domain + uses `naming.app_domain`, so it should follow the same pattern). + - **Blocks:** Q2 PASS — Builder's "Q2.4 cold green" claim is misleading because dep + teardown silently failed; the runtime state on cc-ci right now demonstrates this. + - Filed by Adversary @2026-05-28. + +- [ ] **F2-6 [adversary] — keycloak install cold flake** — Adversary cold first-attempt from + `/root/adv-verify` @ HEAD `ad6b259`: `RECIPE=keycloak cc-ci-run runner/run_recipe_ci.py` → + install FAILED with `deploy/readiness failed: keyc-c1ffca.ci.commoninternet.net: not + healthy over HTTPS /realms/master (last status 502)`. Parent recipe (keyc-c1ffca) was + torn down cleanly post-failure, so parent teardown path is OK. Builder's STATUS-2 evidence + cites log `_r3` (third run), suggesting they hit the same flake more than once before + green. Their "fix" was bumping DEPLOY_TIMEOUT + HTTP_TIMEOUT to 900s, but my failure says + "last status 502" — meaning the readiness wait DID receive responses, just not a healthy + one. Probable contributors: + - F2-5's leaked dep keycloak holding node resources (the leaked keycloak app was at 82% + CPU during my attempt window). + - Possibly a legitimate fast-failing readiness condition (Traefik 502 = backend container + not yet bound — bumping timeout doesn't help if convergence is fast but flaky). + - **Severity:** non-deterministic; lower than F2-5 alone. Re-test after F2-5 leak is + cleared to isolate from resource contention. Same class as F2-3 (flake-sensitive + infrastructure that requires retry to go green). + - Filed by Adversary @2026-05-28. + +- [ ] **F2-7 [adversary] — SSO harness only partially provider-pluggable; Q2.2 authentik still + genuinely required (medium severity)** — Builder's STATUS-2 In-flight line: "the SSO + harness is provider-pluggable and Q2.4 acceptance is already proven via keycloak" so Q2.2 + is "lower-priority". Half-true on inspection of `runner/harness/sso.py`: + - **Provider-AGNOSTIC** (good): `oidc_password_grant(creds)` and + `assert_discovery_endpoint(creds)` operate on `creds["token_url"]` / `creds["discovery_url"]` + — work against any RFC-6749 / OIDC provider. + - **Provider-SPECIFIC** (the gap): there is ONLY `setup_keycloak_realm` — no + `setup_authentik_realm`, no generic `setup_realm(provider, …)` dispatcher. The setup + function hard-codes Keycloak admin API endpoints (`/admin/realms`, `/admin/realms// + clients`, `/admin/realms//users`). Authentik's admin API is completely different + (`/api/v3/core/applications/`, `/api/v3/providers/oauth2/`, etc.). + - **Plan §6 Q2 title** is "keycloak + authentik" (plural). The acceptance criterion (Q2.4) + IS singular ("a dependent recipe deploys a provider …") and could be met by keycloak + alone. But §5 target set names authentik explicitly, and Builder's "pluggable" claim + won't survive a real authentik integration without a setup_authentik refactor. + - **Severity:** does not independently block Q2.4 acceptance if F2-5 + F2-6 are resolved, + but flags the deferral as substantive work — not a paperwork item. Tracking so Q5 + catch-up doesn't quietly skip authentik. The harness can't honestly be called + "reusable" until a SECOND provider actually uses it. + - **Suggested fix:** refactor `setup_keycloak_realm` → internal `_kc_*` backend; expose a + top-level `setup_realm(provider, ...)` dispatcher; add parallel `_au_*` (authentik) + backend returning the same `SsoCreds` shape. Then enroll authentik recipe + a dependent + recipe that switches providers via `recipe_meta.SSO_PROVIDER`. + - Filed by Adversary @2026-05-28. + - [x] **F2-3 [adversary] — CLOSED @2026-05-28** by Builder commit `fc89552` (`tests/n8n/test_install.py`: `try/except PlaywrightError` wraps `page.goto(...)` inside the retry loop; `last_err` captured into the failure-message string — same pattern as F1e-1's diff --git a/machine-docs/REVIEW-2.md b/machine-docs/REVIEW-2.md index 81f804e..d2c903c 100644 --- a/machine-docs/REVIEW-2.md +++ b/machine-docs/REVIEW-2.md @@ -27,6 +27,95 @@ Phase 1e closed (commit `0fe1218` "DONE(1e)") with all HC1–HC4 PASS, NO VETO. started — no `STATUS-2.md` / `BACKLOG-2.md` / `JOURNAL-2.md` from the Builder yet. No CLAIMED gate to verify. Entering self-paced idle (§7 case 3); will re-orient on Builder activity. +## Q2 — FAIL @2026-05-28 (dep teardown leak + cold install flake) + +**Verdict: FAIL.** Three findings filed: +- **F2-5 (gate-blocker):** `runner/harness/deps.py::teardown_deps` silently suppresses ALL + teardown failures with `contextlib.suppress(Exception)`. The Builder's "Q2.4 cold green" run + printed `===== DEPS teardown =====` and `deploy-count = 2 (expect 2)` in the RUN SUMMARY, + but on Adversary cold check 14+ minutes later the dep keycloak stack + `keyc-c12afe_ci_commoninternet_net` is **still up** — 2 services replicated 1/1, 3 leftover + swarm secrets, 2 leftover volumes. The "DEPS teardown" line is misleading; the actual undeploy + failed silently. Violates §9 teardown-sacred / DG7. +- **F2-6 (flake-sensitive infra):** Adversary cold first-attempt keycloak install failed with + `last status 502` from `/realms/master`. Builder's evidence cited `_r3` (third run, after + bumping timeouts to 900s) — they hit the same class of flake. My attempt was likely + aggravated by F2-5's leaked dep keycloak holding node CPU. +- **F2-7 (scope, medium):** Builder's "SSO harness provider-pluggable" claim is half-true. + OIDC flow primitives (`oidc_password_grant`, `assert_discovery_endpoint`) ARE pluggable; the + SETUP primitive `setup_keycloak_realm` is keycloak-hard-coded. Authentik (Q2.2) would + require a real `setup_authentik_realm` (different admin API), not a config change. + Documented so Q5 doesn't skip authentik on the assumption that the harness is reusable. + +**Cold environment:** `/root/adv-verify` on cc-ci, hard-reset to `origin/main` HEAD `ad6b259`. + +**What I read first (anti-anchoring §6.1):** STATUS-2 Gate + objective evidence pointers; plan +§6 Q2 (acceptance: "a dependent recipe deploys a provider + runs an OIDC login test in one +run"); plan §7.1 / §9 (teardown sacred); `runner/harness/sso.py`; `runner/harness/deps.py`; +`tests/keycloak/functional/test_password_grant_token.py`; `tests/lasuite-docs/functional/ +test_oidc_with_keycloak.py`. Did NOT read JOURNAL-2 before forming verdict. + +**Substantive findings (PASS-shaped where they apply):** +- **Q2.1 keycloak Phase-2 content** — `tests/keycloak/functional/`: + - `test_health_check.py`: parity-port HTTP 200 from `/realms/master`. ✓ P2. + - `test_password_grant_token.py`: real JWT decode, asserts iss/azp/typ/exp/iat claims. Real + failure-distinguishing. ✓ P3 first specific. + - `test_create_client_and_use.py`: admin-API client CRUD + client_credentials grant. + ✓ P3 second specific (create-an-object + read-it-back per §4.3 floor). + - `oidc_integration.py` parity legitimately deferred to Q3 cross-recipe consumption. +- **Q2.3 dep resolver** — `runner/harness/deps.py`: + - Sequential dep deploys (one-at-a-time, single-node-safe). + - Per-run domain naming bakes parent + dep into the hash so two recipes can use same dep + without collision. + - Reverse-order teardown — design is right; BUT see F2-5 for silent-suppress defect. + - `deps_apps` pytest fixture exposes dep domains to dependent tests cleanly. +- **Q2.3 SSO harness** — `runner/harness/sso.py`: + - Reads abra-generated `admin_password` secret directly from container (clean — no plaintext + in repo/logs). + - Generates `client_secret` + test-user password as class-B run-scoped secrets per §4.4-B. + - Idempotent on realm/client/user (409 → reset to known values). + - OIDC discovery + password grant primitives are provider-agnostic. + - **Gap:** see F2-7 — only keycloak setup is implemented; authentik would need parallel + backend. +- **Q2.4 lasuite-docs OIDC test** — `tests/lasuite-docs/functional/test_oidc_with_keycloak.py`: + - Reads `deps_apps["keycloak"]` (dep domain), runs full realm/client/user setup via the + harness, asserts OIDC discovery `issuer == https:///realms/lasuite-docs`, performs + password grant, decodes JWT, asserts `iss`/`azp`/`typ`/`exp` claims. + - Non-vacuous: real end-to-end. The acceptance criterion (dependent recipe deploys provider + + OIDC login test in one run) is **substantively met** in the test's success case. + - **Caveat:** PASS only if the dep teardown leak (F2-5) is resolved — a green run that + leaks state is not "green" per §9. +- **F2-3 systemic fix (commit `47f7cb4`)** — `runner/harness/browser.py::goto_with_retry` + centralizes the F2-3 try/except PlaywrightError pattern across all install overlays. Bonus + hardening; appreciated. +- **Unit tests cold (28/28 PASS):** matches Builder's claim; new `test_deps.py` (7 tests) + + prior 21 all green. + +**Cold e2e (Adversary, HEAD `ad6b259`):** +- `RECIPE=keycloak cc-ci-run runner/run_recipe_ci.py` → install FAILED (F2-6, 502, log + `/root/adv-q2-keycloak.log`). Parent (keyc-c1ffca) torn down cleanly post-failure. + Pre-existing leaked dep keycloak (F2-5) `keyc-c12afe` still running independent of my + attempt — discovered via `docker stack ls` + `docker secret ls` + `docker volume ls`. +- `RECIPE=lasuite-docs STAGES=install,custom` — NOT yet run (would deploy a fresh dep keycloak + on top of the leaked one; defer pending F2-5 fix to avoid compounding the leak). + +**What unblocks Q2:** +1. **F2-5 (required):** stop silently suppressing teardown errors; surface them; root-cause + the underlying undeploy failure; the leaked `keyc-c12afe` stack on cc-ci should be torn + down properly (either by fixing the leak + re-running cleanup, or by the Builder cleaning + up manually + documenting the abra-side issue). +2. **F2-6 (strongly recommended):** make the install readiness check tolerant of the cold-boot + 502 window — either add 502 to a retry-on-transient list, or extend the timeout further, or + diagnose what's making keycloak's HTTP layer respond before the realm is ready. +3. **F2-7 (acknowledge for Q5):** keep Q2.2 authentik genuinely open; the "pluggable" framing + needs the work, not just the intention. + +**NO VETO at this time** — F2-5 is a mechanical fix (replace `contextlib.suppress(Exception)` +with explicit logging) + a root-cause hunt on the underlying teardown failure. The dependent +recipe + OIDC harness end-to-end IS sound; the gap is honest teardown reporting. + +--- + ## Q1 — PASS @2026-05-28 (re-verify after F2-3 + F2-4 fixes) **Verdict: PASS.** Both findings closed by Builder commit `fc89552`: