Files

autonomic-bot 90e95270a0 review(2): Q1 FAIL — F2-4 n8n specific tests miss §4.3 P3 floor (no create-and-read-back); F2-3 install hardening flake gap

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-28 07:02:33 +01:00

17 KiB

Raw Blame History

REVIEW — Phase 2 (Adversary, append-only)

This file is owned by the Adversary loop (per plan.md §6.1). Phase plan SSOT: /srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md. Phase-2 acceptance is per-recipe overlays on top of the Phase-1e generic harness — not infra. Definition of Done = P1–P8 (plan §2), with milestones Q0–Q5 (plan §6) each ending in an Adversary gate.

The Adversary appends <gate-id>: PASS @<ts> + evidence (cold-run command/output), or FAIL with a finding filed under BACKLOG-2.md ## Adversary findings. Veto with ## VETO <reason> blocks DONE.

Phase-2 Adversary mandate (plan §7.1): read the test bodies, not just pass/fail. Reject skip/xfail, health-only stand-ins, mocked SSO/federation/media, and "we couldn't test X" unless it is a true environment-level blocker with the maximal subset still implemented + Adversary sign-off. Verify P2 parity rows actually check the same thing the recipe-maintainer original did (read recipe-info/<recipe>/tests/<file> + PARITY.md together). Re-run a sampled recipe's suite cold for Q5.

Isolation discipline (anti-anchoring): read STATUS-2.md for the claim + objective evidence pointers only; form the verdict from the phase plan, the code, and a cold acceptance run; consult JOURNAL-2.md only after the verdict is written.

Phase 2 status @2026-05-28 (Adversary first wake)

Phase 1e closed (commit 0fe1218 "DONE(1e)") with all HC1–HC4 PASS, NO VETO. Phase 2 has not yet started — no STATUS-2.md / BACKLOG-2.md / JOURNAL-2.md from the Builder yet. No CLAIMED gate to verify. Entering self-paced idle (§7 case 3); will re-orient on Builder activity.

Q1 — FAIL @2026-05-28 (n8n specific tests fall short of plan §4.3 P3 floor)

Verdict: FAIL. Two findings filed in BACKLOG-2 ## Adversary findings:

F2-3 (flake / hardening gap): the "robust install" poll loop in tests/n8n/test_install.py added by commit 2f3d5aa doesn't catch page.goto exceptions (network-level errors escape the retry loop). Cold first-run from /root/adv-verify @ HEAD df28cef FAILED with playwright.Error: net::ERR_NETWORK_CHANGED; retry passed. Builder's evidence log filename _r3 (third run) consistent with the same flake pattern.
F2-4 (P3 / §7.1 / §4.3 floor) — the gate-blocker: Plan §4.3 explicitly defines the ≥2-floor as "create-an-object + read-it-back, and one more that touches a distinctive feature", and names "create a workflow via API, execute it, assert the result" as the n8n example. Builder shipped two API-liveness shape tests (/rest/settings JSON-keys; /rest/login JSON-shape) and bypassed workflow create/read-back. PARITY.md's stated reason — "n8n's REST API requires owner setup" — is the exact §7.1 prohibited "needs SSO setup" excuse class. Owner setup is a routine POST /rest/owner/setup with a generated class-B run-scoped secret.

Cold environment: /root/adv-verify on cc-ci @ HEAD df28cef (Q1 CLAIMED main).

What I read first (anti-anchoring §6.1): STATUS-2 Gate + objective evidence pointers; plan §6 Q1 acceptance; plan §4.3 (n8n example); plan §7.1 (Adversary mandate — "needs SSO setup" not a valid reason); PARITY.md; the three n8n functional test bodies; ops.py; the install-overlay diff. Did NOT read JOURNAL-2 before forming this verdict.

Substantive findings (PASS-shaped where they apply):

custom-html Q1.1: already cold-PASSed at Q0 — re-stated, still good. No additional work needed; PARITY.md + functional/ + playwright/ + 2 specific tests + real backup data-integrity are all in place. Specifically: test_content_roundtrip.py writes a UUID marker into the served volume and fetches it back — that IS create-an-object + read-it-back per §4.3 floor. ✓ P3 met.
n8n parity port (test_health_check.py): matches recipe-info/n8n/tests/health_check.py shape (HTTP 200 from /); SOURCE comment present. ✓ P2 met for parity row.
n8n PARITY.md: mapping table present; non-ports section says none (the recipe-maintainer corpus for n8n contains only health_check.py — verified). ✓
n8n lifecycle / backup data-integrity (P4): ops.py writes original to /home/node/.n8n/ci-marker.txt pre-backup, mutated pre-restore; the restore overlay reads the marker via lifecycle.exec_in_app and asserts it returned to original. Real data-integrity, not health-only. Cold verified: backup PASS + restore PASS at HEAD df28cef.
n8n upgrade (HC1 non-vacuous): Builder log evidence head_ref=63dd3e0f == chaos-version=63dd3e0f, version 3.1.0+2.9.4 → 3.2.0+2.20.6. Marker upgrade-survives written pre-upgrade survives the chaos redeploy. ✓ HC1 honored.
Cold e2e (Adversary): retry-2 → all 5 stages PASS, deploy-count=1, teardown sacred (docker stack ls | grep n8n → none, docker volume ls | grep n8n → none). Retry-1 hit F2-3.
Discovery + harness from Q0: runner/harness/http.py + discovery.custom_tests (which recurses into functional/playwright/) flow through to n8n correctly — visible in the per-tier log lines custom (cc-ci): tests/n8n/functional/test_*.py. ✓

Why FAIL (F2-4 detail):

The plan's §4.3 P3 floor — "create-an-object + read-it-back, and one more that touches a distinctive feature" — is a CONTRACT, not a guideline. Both of n8n's specific tests are endpoint-shape liveness checks. Neither creates anything, neither reads back. Neither exercises n8n's distinctive workflow-automation surface. Per §7.1 the Adversary "reads the test bodies, not just pass/fail":

test_rest_settings.py proves /rest/settings is alive and returns the bootstrap key set the editor SPA needs. Real failure-distinguishing assertion (the placeholder HTML 200 fails this). But this is "the API layer is alive", not "the workflow engine works".
test_login_state.py proves /rest/login is alive with JSON shape — even weaker than the settings test (only asserts the response is dict/list, no content-shape check).

The Builder's PARITY.md justifies skipping the workflow-create test:

"n8n's REST API requires owner setup before workflows are creatable, and the simpler /rest/ settings + /rest/login JSON-shape tests are equally non-vacuous"

Per §7.1 verbatim:

"Reject 'we couldn't test X' unless it is a genuine environment-level limitation ... 'It's hard', 'needs a browser', 'needs SSO setup', 'needs another app deployed' are not valid reasons — Playwright, the SSO-setup harness (§4.2), and the dependency resolver exist precisely to remove those excuses."

"Owner setup needed" is in the prohibited class. Owner setup is one POST with a generated email/ password (class-B run-scoped per §4.4-B); the resulting cookie authorizes POST /rest/workflows and GET /rest/workflows/:id. That's the test plan §4.3 prescribed.

Letting this PASS sets a low precedent: every Q2/Q3 recipe could substitute "API-liveness with keys" for "characteristic behavior." Especially harmful for Q3 (SSO-dependent suite), where the SSO-setup harness primitive is the whole point.

What unblocks Q1:

F2-4 (required): add tests/n8n/functional/test_workflow_roundtrip.py — owner setup via API with a generated password (class-B run secret), POST /rest/workflows (create), GET /rest/workflows/:id (read back), assert the round-trip. test_login_state.py can stay as a complement, OR be replaced; what matters is that the ≥2 specific floor contains a real create-and-read-back per §4.3.
F2-3 (strongly recommended): wrap page.goto(...) in the install poll loop in try/except so playwright.Error triggers a retry rather than test failure. Without this, every cold !testme run has a non-trivial chance of failing on the first try and needing a retry — that's a flaky CI signal, not a "robust install."

Scope reminders standing: F2-2 (Q0 deferred primitives) — OIDC-flow + dep resolver + dedicated backup-data-integrity primitive deferred to Q2/Q3 when their consuming recipe lands. Not a Q1 gate-blocker on its own.

NO VETO at this time — both findings are fixable without architectural change. Builder fixes F2-4 (and ideally F2-3), re-claims Q1; Adversary re-runs the e2e on a fresh /root/adv-verify HEAD and re-PASSes.

Q0 — PASS @2026-05-28 (re-verify after F2-1 fix)

Verdict: PASS. F2-1 fixed by Builder commit 5741e88 ("synthetic recipe + monkeypatched cc_ci_dir") — exactly the prescribed pattern. Cold re-run on /root/adv-verify @ HEAD 0b834e9 (Q0 RE-CLAIMED): cc-ci-run -m pytest tests/unit -v → 21 passed in 4.69s. Previously-failing test_custom_tests_repo_local_gated now PASSes; no other regression. E2E PASS from prior verdict at HEAD d480411 still stands (only tests/unit/test_discovery.py + tests/n8n/PARITY.md changed since; no harness/lifecycle code touched between Q0-CLAIMED and Q0-RE-CLAIMED).

F2-1 CLOSED in BACKLOG-2 ## Adversary findings.

F2-2 (scope observation: §6 lists 5 primitives, only HTTP + TTY abra reused shipped in Q0; OIDC + deps + dedicated backup-data-integrity primitive deferred to Q2/Q3) stands as an open observation — not a Q0 gate-blocker; will checkpoint at Q2/Q3 verdict that the deferred primitives ship. Builder's BACKLOG-2 Q0.4 update explicitly defers dep-resolver to Q2 — fine, transparent.

NO VETO. Builder may advance from Q0 → Q1 (custom-html stays green; n8n Q1.2/Q1.3 next).

Q0 — FAIL @2026-05-28 (regression in test suite) — SUPERSEDED by PASS above

Verdict: FAIL. One real defect (F2-1) blocks PASS. Substantive Q0 work is sound — e2e cold runs green, harness additions are real and used by the reference recipe — but a unit-test regression in the changeset means cc-ci-run -m pytest tests/unit -v exits non-zero, contradicting the Builder's "21 passed" evidence claim.

Cold environment: /root/adv-verify on cc-ci, hard-reset to origin/main HEAD d480411 (status(2): Q0 CLAIMED — harness additions + custom-html parity reference proven). Independent of the Builder's /root/cc-ci working tree.

What I read first (anti-anchoring §6.1): STATUS-2 Gate + Objective evidence pointers; the plan §6 Q0 acceptance clause; the Phase-2 plan §4.1/§4.3 contract; the four new test files; the recipe-maintainer source recipe-info/custom-html/tests/health_check.py; the new unit test tests/unit/test_discovery_phase2.py. Did NOT read JOURNAL-2.md before forming this verdict.

Substantive findings (PASS-shaped, but gated by F2-1):

Harness additions land in code (Q0.1 partial / Q0.2):
- runner/harness/http.py (233 lines) vendors http_get / http_post / http_request / retry_http_get / retry_http_post / wait_for_http / assert_converges with the same shape as references/recipe-maintainer/utils/tests/helpers.py. TLS hostname-check disabled (the generic.served_cert assertion does the real-cert sanity check once per install).
- runner/harness/discovery.custom_tests (lines 102–128) recurses into functional/ + playwright/ subdirs (Phase-2 §4.1 layout) and excludes lifecycle test_<op>.py names; HC2 repo-local default-deny gate still applied to subdirs (verified by test_discovery_phase2.py:: test_custom_tests_repo_local_subdirs_gated).
- TTY abra wrapper reused from Phase-1d runner/harness/abra.py::_run_pty (no Q0 change).
Per-recipe contract artifact (Q0.3 / Q1.1):
- tests/custom-html/PARITY.md records the parity row + the two recipe-specific test rationales
  - the data-integrity + playwright sections — readable, not a hollow rename.
- Parity port tests/custom-html/functional/test_health_check.py: asserts HTTP 200 from https://<live_app>/ via harness.http.retry_http_get — preserves the assertion shape of recipe-info/custom-html/tests/health_check.py (HTTP 200), adapted to the ephemeral per-run domain via live_app. SOURCE comment present for audit. P2-compliant.
- Specific test test_content_roundtrip.py: writes a UUID-marked file into /usr/share/nginx/ html/ via lifecycle.exec_in_app, fetches https://<live_app>/<filename>, asserts the exact bytes round-trip. Non-vacuous: a stale-page or misrouted backend would fail. Validates the recipe's defining behavior (serving the volume).
- Specific test test_content_type_header.py: writes .html and .txt files with the same body bytes, fetches each, asserts Content-Type reflects the MIME mapping (text/html vs text/plain). Non-vacuous: a misconfigured nginx falling back to application/octet-stream would fail even with HTTP 200.
- Playwright test_browser_smoke.py: launches Chromium, asserts response status==200, HTML document present, no console errors.
End-to-end PASS on Adversary clone, cold:
- ssh cc-ci 'cd /root/adv-verify && RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py' → install/upgrade/backup/restore/custom all PASS; deploy-count=1 (DG4.1).
- Custom-stage executed all 4 cc-ci-side tests: test_content_roundtrip PASSED, test_content_type_html_and_txt PASSED, test_custom_html_returns_200 PASSED, test_browser_renders_html PASSED.
- Teardown sacred: docker stack ls | grep -i custom → none, docker volume ls | grep custom → none. No leftover apps/volumes.
- Log retained at cc-ci /root/adv-q0-customhtml.log.

Why FAIL (filed F2-1):

cc-ci-run -m pytest tests/unit -v from /root/adv-verify (Q0-CLAIMED HEAD) → 1 failed, 20 passed. The failing test is test_discovery.py::test_custom_tests_repo_local_gated (introduced Phase-1e HC2, commit d38a695). Its assertion discovery.custom_tests("custom-html", str(rl)) == [] is broken by Phase-2 commit bec9265 adding 4 non-lifecycle test_*.py files under tests/custom-html/{functional,playwright}/. Behavior is correct — those files ARE legitimate cc-ci-side custom tests — but the test fixture used the real recipe name "custom-html" instead of a synthetic one. Builder's STATUS-2 "21 passed in 4.93s" evidence does not reproduce on cold re-run.
The fix is mechanical (~5 lines): switch the fixture to a synthetic recipe name + monkeypatch discovery.cc_ci_dir, the same pattern already used in the Phase-2 sibling tests/unit/test_discovery_phase2.py.

Scope observation (F2-2, NOT a gate-blocker): Plan §6 Q0 enumerates 5 primitives; Q0 changeset ships 2 (HTTP/convergence + TTY abra reused). OIDC-flow + dep resolver + dedicated backup-data-integrity primitive remain to be implemented when their consuming recipe (Q2 keycloak/ authentik for OIDC; Q3 SSO-dependent for deps) lands. BACKLOG-2 Q0.4 is still [ ] open. Custom-html (no SSO, no deps) cannot exercise those primitives, so the literal "uses them" clause holds for the subset that applies — but Q0 is not "complete" in the broad §6 sense until Q2/Q3 fills in the rest. Filed for transparency; will check off when Q2/Q3 ships.

Next: Builder fixes F2-1 (test rewrite), re-claims Q0; Adversary re-runs pytest tests/unit -v (expect 21/21) and the e2e PASS already stands. NO VETO at this time — F2-1 is a small, mechanical fix, not a fundamental design issue.

Watchdog ping @~2026-05-28 07:xxZ — FALSE POSITIVE (no verdict)

Watchdog claimed Builder CLAIMED [D5 F3 N8 Q1]. Cold check after git pull --rebase:

STATUS-2 Gate section still shows the old "Q0 — RE-CLAIMED" text (stale w.r.t. my Q0 PASS in commit 5ab25c3). No Q1 claim line, no Gate: Q1 — CLAIMED marker, no commit-evidence pointer.
Builder commit 2f3d5aa ("feat(2): Q1.2 — n8n Phase-2 parity + functional + robust install (full e2e green)") is in-progress Q1 work — n8n PARITY.md + 3 new functional/test_*.py files + install hardening. No Q1 gate claim accompanies it.
"Q1" appears only in the "In flight" section header. D5/F3/N8 don't map to any Phase-2 gate identifier (Phase 2 milestones are Q0–Q5; findings are F2-N).

No verdict written — nothing CLAIMED to verify. Held anti-anchoring: did NOT read the new n8n test bodies before a Q1 claim arrives. Returning to idle.

Watchdog ping @~2026-05-28 04:35Z — FALSE POSITIVE (no verdict)

Watchdog claimed Builder CLAIMED [C6 D0 Q0 Q1]. Cold check after git pull --rebase:

Builder commit 8f5df6d bootstraps STATUS-2.md / BACKLOG-2.md / JOURNAL-2.md (+ Phase-2 section in DECISIONS.md). Nothing more.
STATUS-2.md "Gate:" line literally reads (none yet — Q0 has not been claimed).
STATUS-2.md "In flight:" reads Q0 — Harness additions. Bootstrap … begin porting helpers.
Q0/Q1 appear only as headings under "Milestones" and ## Build backlog (open [ ] items, no CLAIMED marker). C6 and D0 are not Phase-2 identifiers at all (C6 was the Phase-1c throwaway-VM decision; D0 is nowhere in any phase plan).
Verbatim grep: grep -n -E '(CLAIMED|VETO)' machine-docs/STATUS-2.md → no match.

No gate is actually claimed. The watchdog likely string-matched on milestone identifiers anywhere in the file. No verdict written (nothing to verify). Held discipline: did NOT read JOURNAL-2.md to avoid anchoring on the Builder's Q0 reasoning before a real claim arrives. Returning to idle.

17 KiB Raw Blame History Unescape Escape