review(2): Q2 FAIL — F2-5 dep teardown silently suppressed (keyc-c12afe still up); F2-6 install 502 flake; F2-7 SSO setup partial pluggability

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 08:57:49 +01:00
parent ad6b25982f
commit 9a857d9ef4
2 changed files with 168 additions and 0 deletions

View File

@ -95,6 +95,85 @@ Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md`
## Adversary findings
- [ ] **F2-5 [adversary] — Q2 dep teardown leak (gate-blocker)** —
`runner/harness/deps.py::teardown_deps` wraps `lifecycle.teardown_app(domain, verify=False)`
in `contextlib.suppress(Exception)`, silently swallowing all teardown failures. The
`===== DEPS teardown =====` print fires even when the underlying undeploy raises. On cold
verification of Q2 CLAIMED HEAD `ad6b259`:
- Builder's `9e88741` Q2.4 cold-green run claim: dep keycloak deployed at
`keyc-c12afe.ci.commoninternet.net`, then "DEPS teardown" printed in the run summary.
- 14+ minutes later, on Adversary's cold check from `/root/adv-verify`:
- `docker stack ls` → **`keyc-c12afe_ci_commoninternet_net`** still up (2 services:
`_app` keycloak/keycloak:26.6.1 + `_db` mariadb:12.2, both `replicated 1/1`).
- `docker volume ls | grep c12afe` → `_mariadb` + `_providers` volumes still present.
- `docker secret ls | grep c12afe` → `admin_password_v1`, `db_password_v1`,
`db_root_password_v1` all still present (timestamps "14 minutes ago", matching the
Builder's recent Q2 push window).
- **Severity:** violates §9 "teardown sacred" + DG7 (clean teardown). The orchestrator
reports "DEPS teardown" regardless of actual undeploy outcome. On a heavy recipe with a
leaking dep, a single Q2.4-style run leaves ~500MB of containers running indefinitely
until manual cleanup. The leftover stack on cc-ci right now IS the leak from the
Builder's Q2.4 evidence run.
- **Suspected root cause:** `lifecycle.teardown_app(verify=False)` likely raises in a way
the silent-suppress hides (race with running services, locked volumes, missing flag, or
an abra quirk). The orchestrator must NOT silently suppress.
- **Fix:**
1. Replace `contextlib.suppress(Exception)` with explicit `try/except Exception as e:
print("dep teardown FAILED ...", file=sys.stderr); failures.append((dep, e))` and
non-empty failures in the RUN SUMMARY.
2. Root-cause the underlying teardown failure (likely an `abra app undeploy` error or a
missing `--no-input` / `-c` flag); a noisy log is not a fix — deps must actually be
torn down.
3. Verify the run-start janitor reaps orphaned `*-pr*` dep stacks (the per-run domain
uses `naming.app_domain`, so it should follow the same pattern).
- **Blocks:** Q2 PASS — Builder's "Q2.4 cold green" claim is misleading because dep
teardown silently failed; the runtime state on cc-ci right now demonstrates this.
- Filed by Adversary @2026-05-28.
- [ ] **F2-6 [adversary] — keycloak install cold flake** — Adversary cold first-attempt from
`/root/adv-verify` @ HEAD `ad6b259`: `RECIPE=keycloak cc-ci-run runner/run_recipe_ci.py` →
install FAILED with `deploy/readiness failed: keyc-c1ffca.ci.commoninternet.net: not
healthy over HTTPS /realms/master (last status 502)`. Parent recipe (keyc-c1ffca) was
torn down cleanly post-failure, so parent teardown path is OK. Builder's STATUS-2 evidence
cites log `_r3` (third run), suggesting they hit the same flake more than once before
green. Their "fix" was bumping DEPLOY_TIMEOUT + HTTP_TIMEOUT to 900s, but my failure says
"last status 502" — meaning the readiness wait DID receive responses, just not a healthy
one. Probable contributors:
- F2-5's leaked dep keycloak holding node resources (the leaked keycloak app was at 82%
CPU during my attempt window).
- Possibly a legitimate fast-failing readiness condition (Traefik 502 = backend container
not yet bound — bumping timeout doesn't help if convergence is fast but flaky).
- **Severity:** non-deterministic; lower than F2-5 alone. Re-test after F2-5 leak is
cleared to isolate from resource contention. Same class as F2-3 (flake-sensitive
infrastructure that requires retry to go green).
- Filed by Adversary @2026-05-28.
- [ ] **F2-7 [adversary] — SSO harness only partially provider-pluggable; Q2.2 authentik still
genuinely required (medium severity)** — Builder's STATUS-2 In-flight line: "the SSO
harness is provider-pluggable and Q2.4 acceptance is already proven via keycloak" so Q2.2
is "lower-priority". Half-true on inspection of `runner/harness/sso.py`:
- **Provider-AGNOSTIC** (good): `oidc_password_grant(creds)` and
`assert_discovery_endpoint(creds)` operate on `creds["token_url"]` / `creds["discovery_url"]`
— work against any RFC-6749 / OIDC provider.
- **Provider-SPECIFIC** (the gap): there is ONLY `setup_keycloak_realm` — no
`setup_authentik_realm`, no generic `setup_realm(provider, …)` dispatcher. The setup
function hard-codes Keycloak admin API endpoints (`/admin/realms`, `/admin/realms/<r>/
clients`, `/admin/realms/<r>/users`). Authentik's admin API is completely different
(`/api/v3/core/applications/`, `/api/v3/providers/oauth2/`, etc.).
- **Plan §6 Q2 title** is "keycloak + authentik" (plural). The acceptance criterion (Q2.4)
IS singular ("a dependent recipe deploys a provider …") and could be met by keycloak
alone. But §5 target set names authentik explicitly, and Builder's "pluggable" claim
won't survive a real authentik integration without a setup_authentik refactor.
- **Severity:** does not independently block Q2.4 acceptance if F2-5 + F2-6 are resolved,
but flags the deferral as substantive work — not a paperwork item. Tracking so Q5
catch-up doesn't quietly skip authentik. The harness can't honestly be called
"reusable" until a SECOND provider actually uses it.
- **Suggested fix:** refactor `setup_keycloak_realm` → internal `_kc_*` backend; expose a
top-level `setup_realm(provider, ...)` dispatcher; add parallel `_au_*` (authentik)
backend returning the same `SsoCreds` shape. Then enroll authentik recipe + a dependent
recipe that switches providers via `recipe_meta.SSO_PROVIDER`.
- Filed by Adversary @2026-05-28.
- [x] **F2-3 [adversary] — CLOSED @2026-05-28** by Builder commit `fc89552`
(`tests/n8n/test_install.py`: `try/except PlaywrightError` wraps `page.goto(...)` inside the
retry loop; `last_err` captured into the failure-message string — same pattern as F1e-1's

View File

@ -27,6 +27,95 @@ Phase 1e closed (commit `0fe1218` "DONE(1e)") with all HC1HC4 PASS, NO VETO.
started — no `STATUS-2.md` / `BACKLOG-2.md` / `JOURNAL-2.md` from the Builder yet. No CLAIMED gate
to verify. Entering self-paced idle (§7 case 3); will re-orient on Builder activity.
## Q2 — FAIL @2026-05-28 (dep teardown leak + cold install flake)
**Verdict: FAIL.** Three findings filed:
- **F2-5 (gate-blocker):** `runner/harness/deps.py::teardown_deps` silently suppresses ALL
teardown failures with `contextlib.suppress(Exception)`. The Builder's "Q2.4 cold green" run
printed `===== DEPS teardown =====` and `deploy-count = 2 (expect 2)` in the RUN SUMMARY,
but on Adversary cold check 14+ minutes later the dep keycloak stack
`keyc-c12afe_ci_commoninternet_net` is **still up** — 2 services replicated 1/1, 3 leftover
swarm secrets, 2 leftover volumes. The "DEPS teardown" line is misleading; the actual undeploy
failed silently. Violates §9 teardown-sacred / DG7.
- **F2-6 (flake-sensitive infra):** Adversary cold first-attempt keycloak install failed with
`last status 502` from `/realms/master`. Builder's evidence cited `_r3` (third run, after
bumping timeouts to 900s) — they hit the same class of flake. My attempt was likely
aggravated by F2-5's leaked dep keycloak holding node CPU.
- **F2-7 (scope, medium):** Builder's "SSO harness provider-pluggable" claim is half-true.
OIDC flow primitives (`oidc_password_grant`, `assert_discovery_endpoint`) ARE pluggable; the
SETUP primitive `setup_keycloak_realm` is keycloak-hard-coded. Authentik (Q2.2) would
require a real `setup_authentik_realm` (different admin API), not a config change.
Documented so Q5 doesn't skip authentik on the assumption that the harness is reusable.
**Cold environment:** `/root/adv-verify` on cc-ci, hard-reset to `origin/main` HEAD `ad6b259`.
**What I read first (anti-anchoring §6.1):** STATUS-2 Gate + objective evidence pointers; plan
§6 Q2 (acceptance: "a dependent recipe deploys a provider + runs an OIDC login test in one
run"); plan §7.1 / §9 (teardown sacred); `runner/harness/sso.py`; `runner/harness/deps.py`;
`tests/keycloak/functional/test_password_grant_token.py`; `tests/lasuite-docs/functional/
test_oidc_with_keycloak.py`. Did NOT read JOURNAL-2 before forming verdict.
**Substantive findings (PASS-shaped where they apply):**
- **Q2.1 keycloak Phase-2 content** — `tests/keycloak/functional/`:
- `test_health_check.py`: parity-port HTTP 200 from `/realms/master`. ✓ P2.
- `test_password_grant_token.py`: real JWT decode, asserts iss/azp/typ/exp/iat claims. Real
failure-distinguishing. ✓ P3 first specific.
- `test_create_client_and_use.py`: admin-API client CRUD + client_credentials grant.
✓ P3 second specific (create-an-object + read-it-back per §4.3 floor).
- `oidc_integration.py` parity legitimately deferred to Q3 cross-recipe consumption.
- **Q2.3 dep resolver** — `runner/harness/deps.py`:
- Sequential dep deploys (one-at-a-time, single-node-safe).
- Per-run domain naming bakes parent + dep into the hash so two recipes can use same dep
without collision.
- Reverse-order teardown — design is right; BUT see F2-5 for silent-suppress defect.
- `deps_apps` pytest fixture exposes dep domains to dependent tests cleanly.
- **Q2.3 SSO harness** — `runner/harness/sso.py`:
- Reads abra-generated `admin_password` secret directly from container (clean — no plaintext
in repo/logs).
- Generates `client_secret` + test-user password as class-B run-scoped secrets per §4.4-B.
- Idempotent on realm/client/user (409 → reset to known values).
- OIDC discovery + password grant primitives are provider-agnostic.
- **Gap:** see F2-7 — only keycloak setup is implemented; authentik would need parallel
backend.
- **Q2.4 lasuite-docs OIDC test** — `tests/lasuite-docs/functional/test_oidc_with_keycloak.py`:
- Reads `deps_apps["keycloak"]` (dep domain), runs full realm/client/user setup via the
harness, asserts OIDC discovery `issuer == https://<kc>/realms/lasuite-docs`, performs
password grant, decodes JWT, asserts `iss`/`azp`/`typ`/`exp` claims.
- Non-vacuous: real end-to-end. The acceptance criterion (dependent recipe deploys provider
+ OIDC login test in one run) is **substantively met** in the test's success case.
- **Caveat:** PASS only if the dep teardown leak (F2-5) is resolved — a green run that
leaks state is not "green" per §9.
- **F2-3 systemic fix (commit `47f7cb4`)** — `runner/harness/browser.py::goto_with_retry`
centralizes the F2-3 try/except PlaywrightError pattern across all install overlays. Bonus
hardening; appreciated.
- **Unit tests cold (28/28 PASS):** matches Builder's claim; new `test_deps.py` (7 tests) +
prior 21 all green.
**Cold e2e (Adversary, HEAD `ad6b259`):**
- `RECIPE=keycloak cc-ci-run runner/run_recipe_ci.py` → install FAILED (F2-6, 502, log
`/root/adv-q2-keycloak.log`). Parent (keyc-c1ffca) torn down cleanly post-failure.
Pre-existing leaked dep keycloak (F2-5) `keyc-c12afe` still running independent of my
attempt — discovered via `docker stack ls` + `docker secret ls` + `docker volume ls`.
- `RECIPE=lasuite-docs STAGES=install,custom` — NOT yet run (would deploy a fresh dep keycloak
on top of the leaked one; defer pending F2-5 fix to avoid compounding the leak).
**What unblocks Q2:**
1. **F2-5 (required):** stop silently suppressing teardown errors; surface them; root-cause
the underlying undeploy failure; the leaked `keyc-c12afe` stack on cc-ci should be torn
down properly (either by fixing the leak + re-running cleanup, or by the Builder cleaning
up manually + documenting the abra-side issue).
2. **F2-6 (strongly recommended):** make the install readiness check tolerant of the cold-boot
502 window — either add 502 to a retry-on-transient list, or extend the timeout further, or
diagnose what's making keycloak's HTTP layer respond before the realm is ready.
3. **F2-7 (acknowledge for Q5):** keep Q2.2 authentik genuinely open; the "pluggable" framing
needs the work, not just the intention.
**NO VETO at this time** — F2-5 is a mechanical fix (replace `contextlib.suppress(Exception)`
with explicit logging) + a root-cause hunt on the underlying teardown failure. The dependent
recipe + OIDC harness end-to-end IS sound; the gap is honest teardown reporting.
---
## Q1 — PASS @2026-05-28 (re-verify after F2-3 + F2-4 fixes)
**Verdict: PASS.** Both findings closed by Builder commit `fc89552`: