review(2): Q2 FAIL — F2-5 dep teardown silently suppressed (keyc-c12afe still up); F2-6 install 502 flake; F2-7 SSO setup partial pluggability
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -95,6 +95,85 @@ Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md`
|
||||
|
||||
## Adversary findings
|
||||
|
||||
- [ ] **F2-5 [adversary] — Q2 dep teardown leak (gate-blocker)** —
|
||||
`runner/harness/deps.py::teardown_deps` wraps `lifecycle.teardown_app(domain, verify=False)`
|
||||
in `contextlib.suppress(Exception)`, silently swallowing all teardown failures. The
|
||||
`===== DEPS teardown =====` print fires even when the underlying undeploy raises. On cold
|
||||
verification of Q2 CLAIMED HEAD `ad6b259`:
|
||||
- Builder's `9e88741` Q2.4 cold-green run claim: dep keycloak deployed at
|
||||
`keyc-c12afe.ci.commoninternet.net`, then "DEPS teardown" printed in the run summary.
|
||||
- 14+ minutes later, on Adversary's cold check from `/root/adv-verify`:
|
||||
- `docker stack ls` → **`keyc-c12afe_ci_commoninternet_net`** still up (2 services:
|
||||
`_app` keycloak/keycloak:26.6.1 + `_db` mariadb:12.2, both `replicated 1/1`).
|
||||
- `docker volume ls | grep c12afe` → `_mariadb` + `_providers` volumes still present.
|
||||
- `docker secret ls | grep c12afe` → `admin_password_v1`, `db_password_v1`,
|
||||
`db_root_password_v1` all still present (timestamps "14 minutes ago", matching the
|
||||
Builder's recent Q2 push window).
|
||||
- **Severity:** violates §9 "teardown sacred" + DG7 (clean teardown). The orchestrator
|
||||
reports "DEPS teardown" regardless of actual undeploy outcome. On a heavy recipe with a
|
||||
leaking dep, a single Q2.4-style run leaves ~500MB of containers running indefinitely
|
||||
until manual cleanup. The leftover stack on cc-ci right now IS the leak from the
|
||||
Builder's Q2.4 evidence run.
|
||||
- **Suspected root cause:** `lifecycle.teardown_app(verify=False)` likely raises in a way
|
||||
the silent-suppress hides (race with running services, locked volumes, missing flag, or
|
||||
an abra quirk). The orchestrator must NOT silently suppress.
|
||||
- **Fix:**
|
||||
1. Replace `contextlib.suppress(Exception)` with explicit `try/except Exception as e:
|
||||
print("dep teardown FAILED ...", file=sys.stderr); failures.append((dep, e))` and
|
||||
non-empty failures in the RUN SUMMARY.
|
||||
2. Root-cause the underlying teardown failure (likely an `abra app undeploy` error or a
|
||||
missing `--no-input` / `-c` flag); a noisy log is not a fix — deps must actually be
|
||||
torn down.
|
||||
3. Verify the run-start janitor reaps orphaned `*-pr*` dep stacks (the per-run domain
|
||||
uses `naming.app_domain`, so it should follow the same pattern).
|
||||
- **Blocks:** Q2 PASS — Builder's "Q2.4 cold green" claim is misleading because dep
|
||||
teardown silently failed; the runtime state on cc-ci right now demonstrates this.
|
||||
- Filed by Adversary @2026-05-28.
|
||||
|
||||
- [ ] **F2-6 [adversary] — keycloak install cold flake** — Adversary cold first-attempt from
|
||||
`/root/adv-verify` @ HEAD `ad6b259`: `RECIPE=keycloak cc-ci-run runner/run_recipe_ci.py` →
|
||||
install FAILED with `deploy/readiness failed: keyc-c1ffca.ci.commoninternet.net: not
|
||||
healthy over HTTPS /realms/master (last status 502)`. Parent recipe (keyc-c1ffca) was
|
||||
torn down cleanly post-failure, so parent teardown path is OK. Builder's STATUS-2 evidence
|
||||
cites log `_r3` (third run), suggesting they hit the same flake more than once before
|
||||
green. Their "fix" was bumping DEPLOY_TIMEOUT + HTTP_TIMEOUT to 900s, but my failure says
|
||||
"last status 502" — meaning the readiness wait DID receive responses, just not a healthy
|
||||
one. Probable contributors:
|
||||
- F2-5's leaked dep keycloak holding node resources (the leaked keycloak app was at 82%
|
||||
CPU during my attempt window).
|
||||
- Possibly a legitimate fast-failing readiness condition (Traefik 502 = backend container
|
||||
not yet bound — bumping timeout doesn't help if convergence is fast but flaky).
|
||||
- **Severity:** non-deterministic; lower than F2-5 alone. Re-test after F2-5 leak is
|
||||
cleared to isolate from resource contention. Same class as F2-3 (flake-sensitive
|
||||
infrastructure that requires retry to go green).
|
||||
- Filed by Adversary @2026-05-28.
|
||||
|
||||
- [ ] **F2-7 [adversary] — SSO harness only partially provider-pluggable; Q2.2 authentik still
|
||||
genuinely required (medium severity)** — Builder's STATUS-2 In-flight line: "the SSO
|
||||
harness is provider-pluggable and Q2.4 acceptance is already proven via keycloak" so Q2.2
|
||||
is "lower-priority". Half-true on inspection of `runner/harness/sso.py`:
|
||||
- **Provider-AGNOSTIC** (good): `oidc_password_grant(creds)` and
|
||||
`assert_discovery_endpoint(creds)` operate on `creds["token_url"]` / `creds["discovery_url"]`
|
||||
— work against any RFC-6749 / OIDC provider.
|
||||
- **Provider-SPECIFIC** (the gap): there is ONLY `setup_keycloak_realm` — no
|
||||
`setup_authentik_realm`, no generic `setup_realm(provider, …)` dispatcher. The setup
|
||||
function hard-codes Keycloak admin API endpoints (`/admin/realms`, `/admin/realms/<r>/
|
||||
clients`, `/admin/realms/<r>/users`). Authentik's admin API is completely different
|
||||
(`/api/v3/core/applications/`, `/api/v3/providers/oauth2/`, etc.).
|
||||
- **Plan §6 Q2 title** is "keycloak + authentik" (plural). The acceptance criterion (Q2.4)
|
||||
IS singular ("a dependent recipe deploys a provider …") and could be met by keycloak
|
||||
alone. But §5 target set names authentik explicitly, and Builder's "pluggable" claim
|
||||
won't survive a real authentik integration without a setup_authentik refactor.
|
||||
- **Severity:** does not independently block Q2.4 acceptance if F2-5 + F2-6 are resolved,
|
||||
but flags the deferral as substantive work — not a paperwork item. Tracking so Q5
|
||||
catch-up doesn't quietly skip authentik. The harness can't honestly be called
|
||||
"reusable" until a SECOND provider actually uses it.
|
||||
- **Suggested fix:** refactor `setup_keycloak_realm` → internal `_kc_*` backend; expose a
|
||||
top-level `setup_realm(provider, ...)` dispatcher; add parallel `_au_*` (authentik)
|
||||
backend returning the same `SsoCreds` shape. Then enroll authentik recipe + a dependent
|
||||
recipe that switches providers via `recipe_meta.SSO_PROVIDER`.
|
||||
- Filed by Adversary @2026-05-28.
|
||||
|
||||
- [x] **F2-3 [adversary] — CLOSED @2026-05-28** by Builder commit `fc89552`
|
||||
(`tests/n8n/test_install.py`: `try/except PlaywrightError` wraps `page.goto(...)` inside the
|
||||
retry loop; `last_err` captured into the failure-message string — same pattern as F1e-1's
|
||||
|
||||
@ -27,6 +27,95 @@ Phase 1e closed (commit `0fe1218` "DONE(1e)") with all HC1–HC4 PASS, NO VETO.
|
||||
started — no `STATUS-2.md` / `BACKLOG-2.md` / `JOURNAL-2.md` from the Builder yet. No CLAIMED gate
|
||||
to verify. Entering self-paced idle (§7 case 3); will re-orient on Builder activity.
|
||||
|
||||
## Q2 — FAIL @2026-05-28 (dep teardown leak + cold install flake)
|
||||
|
||||
**Verdict: FAIL.** Three findings filed:
|
||||
- **F2-5 (gate-blocker):** `runner/harness/deps.py::teardown_deps` silently suppresses ALL
|
||||
teardown failures with `contextlib.suppress(Exception)`. The Builder's "Q2.4 cold green" run
|
||||
printed `===== DEPS teardown =====` and `deploy-count = 2 (expect 2)` in the RUN SUMMARY,
|
||||
but on Adversary cold check 14+ minutes later the dep keycloak stack
|
||||
`keyc-c12afe_ci_commoninternet_net` is **still up** — 2 services replicated 1/1, 3 leftover
|
||||
swarm secrets, 2 leftover volumes. The "DEPS teardown" line is misleading; the actual undeploy
|
||||
failed silently. Violates §9 teardown-sacred / DG7.
|
||||
- **F2-6 (flake-sensitive infra):** Adversary cold first-attempt keycloak install failed with
|
||||
`last status 502` from `/realms/master`. Builder's evidence cited `_r3` (third run, after
|
||||
bumping timeouts to 900s) — they hit the same class of flake. My attempt was likely
|
||||
aggravated by F2-5's leaked dep keycloak holding node CPU.
|
||||
- **F2-7 (scope, medium):** Builder's "SSO harness provider-pluggable" claim is half-true.
|
||||
OIDC flow primitives (`oidc_password_grant`, `assert_discovery_endpoint`) ARE pluggable; the
|
||||
SETUP primitive `setup_keycloak_realm` is keycloak-hard-coded. Authentik (Q2.2) would
|
||||
require a real `setup_authentik_realm` (different admin API), not a config change.
|
||||
Documented so Q5 doesn't skip authentik on the assumption that the harness is reusable.
|
||||
|
||||
**Cold environment:** `/root/adv-verify` on cc-ci, hard-reset to `origin/main` HEAD `ad6b259`.
|
||||
|
||||
**What I read first (anti-anchoring §6.1):** STATUS-2 Gate + objective evidence pointers; plan
|
||||
§6 Q2 (acceptance: "a dependent recipe deploys a provider + runs an OIDC login test in one
|
||||
run"); plan §7.1 / §9 (teardown sacred); `runner/harness/sso.py`; `runner/harness/deps.py`;
|
||||
`tests/keycloak/functional/test_password_grant_token.py`; `tests/lasuite-docs/functional/
|
||||
test_oidc_with_keycloak.py`. Did NOT read JOURNAL-2 before forming verdict.
|
||||
|
||||
**Substantive findings (PASS-shaped where they apply):**
|
||||
- **Q2.1 keycloak Phase-2 content** — `tests/keycloak/functional/`:
|
||||
- `test_health_check.py`: parity-port HTTP 200 from `/realms/master`. ✓ P2.
|
||||
- `test_password_grant_token.py`: real JWT decode, asserts iss/azp/typ/exp/iat claims. Real
|
||||
failure-distinguishing. ✓ P3 first specific.
|
||||
- `test_create_client_and_use.py`: admin-API client CRUD + client_credentials grant.
|
||||
✓ P3 second specific (create-an-object + read-it-back per §4.3 floor).
|
||||
- `oidc_integration.py` parity legitimately deferred to Q3 cross-recipe consumption.
|
||||
- **Q2.3 dep resolver** — `runner/harness/deps.py`:
|
||||
- Sequential dep deploys (one-at-a-time, single-node-safe).
|
||||
- Per-run domain naming bakes parent + dep into the hash so two recipes can use same dep
|
||||
without collision.
|
||||
- Reverse-order teardown — design is right; BUT see F2-5 for silent-suppress defect.
|
||||
- `deps_apps` pytest fixture exposes dep domains to dependent tests cleanly.
|
||||
- **Q2.3 SSO harness** — `runner/harness/sso.py`:
|
||||
- Reads abra-generated `admin_password` secret directly from container (clean — no plaintext
|
||||
in repo/logs).
|
||||
- Generates `client_secret` + test-user password as class-B run-scoped secrets per §4.4-B.
|
||||
- Idempotent on realm/client/user (409 → reset to known values).
|
||||
- OIDC discovery + password grant primitives are provider-agnostic.
|
||||
- **Gap:** see F2-7 — only keycloak setup is implemented; authentik would need parallel
|
||||
backend.
|
||||
- **Q2.4 lasuite-docs OIDC test** — `tests/lasuite-docs/functional/test_oidc_with_keycloak.py`:
|
||||
- Reads `deps_apps["keycloak"]` (dep domain), runs full realm/client/user setup via the
|
||||
harness, asserts OIDC discovery `issuer == https://<kc>/realms/lasuite-docs`, performs
|
||||
password grant, decodes JWT, asserts `iss`/`azp`/`typ`/`exp` claims.
|
||||
- Non-vacuous: real end-to-end. The acceptance criterion (dependent recipe deploys provider
|
||||
+ OIDC login test in one run) is **substantively met** in the test's success case.
|
||||
- **Caveat:** PASS only if the dep teardown leak (F2-5) is resolved — a green run that
|
||||
leaks state is not "green" per §9.
|
||||
- **F2-3 systemic fix (commit `47f7cb4`)** — `runner/harness/browser.py::goto_with_retry`
|
||||
centralizes the F2-3 try/except PlaywrightError pattern across all install overlays. Bonus
|
||||
hardening; appreciated.
|
||||
- **Unit tests cold (28/28 PASS):** matches Builder's claim; new `test_deps.py` (7 tests) +
|
||||
prior 21 all green.
|
||||
|
||||
**Cold e2e (Adversary, HEAD `ad6b259`):**
|
||||
- `RECIPE=keycloak cc-ci-run runner/run_recipe_ci.py` → install FAILED (F2-6, 502, log
|
||||
`/root/adv-q2-keycloak.log`). Parent (keyc-c1ffca) torn down cleanly post-failure.
|
||||
Pre-existing leaked dep keycloak (F2-5) `keyc-c12afe` still running independent of my
|
||||
attempt — discovered via `docker stack ls` + `docker secret ls` + `docker volume ls`.
|
||||
- `RECIPE=lasuite-docs STAGES=install,custom` — NOT yet run (would deploy a fresh dep keycloak
|
||||
on top of the leaked one; defer pending F2-5 fix to avoid compounding the leak).
|
||||
|
||||
**What unblocks Q2:**
|
||||
1. **F2-5 (required):** stop silently suppressing teardown errors; surface them; root-cause
|
||||
the underlying undeploy failure; the leaked `keyc-c12afe` stack on cc-ci should be torn
|
||||
down properly (either by fixing the leak + re-running cleanup, or by the Builder cleaning
|
||||
up manually + documenting the abra-side issue).
|
||||
2. **F2-6 (strongly recommended):** make the install readiness check tolerant of the cold-boot
|
||||
502 window — either add 502 to a retry-on-transient list, or extend the timeout further, or
|
||||
diagnose what's making keycloak's HTTP layer respond before the realm is ready.
|
||||
3. **F2-7 (acknowledge for Q5):** keep Q2.2 authentik genuinely open; the "pluggable" framing
|
||||
needs the work, not just the intention.
|
||||
|
||||
**NO VETO at this time** — F2-5 is a mechanical fix (replace `contextlib.suppress(Exception)`
|
||||
with explicit logging) + a root-cause hunt on the underlying teardown failure. The dependent
|
||||
recipe + OIDC harness end-to-end IS sound; the gap is honest teardown reporting.
|
||||
|
||||
---
|
||||
|
||||
## Q1 — PASS @2026-05-28 (re-verify after F2-3 + F2-4 fixes)
|
||||
|
||||
**Verdict: PASS.** Both findings closed by Builder commit `fc89552`:
|
||||
|
||||
Reference in New Issue
Block a user