Files
cc-ci/REVIEW-1b.md

134 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — Phase 1b (review & lint pass) — Adversary ledger
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase1b-review-lint.md`
**Loop state for THIS phase:** STATUS-1b / BACKLOG-1b / JOURNAL-1b (Builder) · **REVIEW-1b (Adversary, this file)** · DECISIONS.md shared.
Phase-1 STATUS.md/BACKLOG.md/REVIEW.md and the Phase-1c `*-1c.md` files are HISTORY — not this phase's state.
This phase the Adversary is **also the white-box reviewer** (§3 checklist), so this ledger holds both
white-box review findings and the eventual cold RL3 re-verification of D1D10.
DoD I must independently confirm (RL1 lint-in-CI-green · RL2 §3 checklist run, blocking fixed · **RL3
full cold D1D10 re-verify — the final gate** · RL4 docs). Order per §2: tooling → review fixes → *then*
RL3. **Cardinal rule:** never weaken a test to satisfy a lint/review nit; RL3 must confirm cleanup
softened/skipped/regressed nothing.
---
## Phase-1b orientation @2026-05-27 (Adversary cold start)
- Pulled clean; Phase 1c is signed-off DONE (commit 6d2bc3d). Phase 1b kicked off by operator (manual transition).
- Builder has **not yet seeded** STATUS-1b/BACKLOG-1b/JOURNAL-1b and has not claimed W0. No gate pending.
- I began the independent white-box §3 review immediately (it's my role this phase and needs no Builder gate).
## White-box §3 prep pass #1 @2026-05-27 — over post-1c codebase (PRE-cleanup baseline; advisory until RL3)
Recording the baseline state *before* any W0/W1 cleanup, so I can later confirm the cleanup regressed nothing.
- **Tests are real** — PASS (provisional). Swept all 6 recipe suites (custom-html, lasuite-docs, keycloak,
matrix-synapse, n8n, cryptpad) × install/upgrade/backup + conftest + runner/harness. No
`@pytest.mark.skip/xfail/skipif`, no commented-out asserts, no tautologies. Install tests assert real
app content (matrix: parses `versions` JSON non-empty; keycloak: admin DOM; others: Playwright body).
Upgrade tests deploy v(n-1) → write marker → upgrade → assert exact marker survives. Backup tests
establish+verify state → backup → mutate+verify → restore → assert exact pre-mutation state (keycloak
deletes a realm). **Watch-item (to re-check black-box at RL3):** every upgrade test has a *conditional*
`pytest.skip()` when no previous published version exists (e.g. custom-html test_upgrade.py:17-18). Valid
by design, but if it ALWAYS skips, the upgrade stage would be silently fake — RL3 must confirm the
upgrade stage actually RUNS (prev version found) for ≥1 recipe, not just skips. (1c E2E exercised this.)
- **Server state Nix-declared & idempotent** — PASS (provisional). No `.bootstrapped`/run-once sentinels in
modules/ or scripts/ (grep clean). Convergence/oneshot pattern per §9 to be re-read fully in pass #2.
- **No footguns / sleep** — PASS (provisional). All `time.sleep()` in runner/harness/lifecycle.py (147,157,
212,238) + bridge.py (280) are **poll-loop intervals inside `while time.time() < deadline:` loops**, not
bare readiness waits. `wait_healthy` polls converge-then-HTTP with timeouts. Teardown (lifecycle.py:215)
is correctly ordered (undeploy → `docker stack rm` fallback → volumes/secrets while .env exists → drop
.env last), retries volume removal, and **verifies residual is empty (raises TeardownError otherwise)**.
- **No secrets in code/committed files** — PASS (provisional). Grep for inline passwords/tokens/private-key
blocks across *.py/*.nix/*.sh/*.yml clean (only env/file references + generators). Full leak re-verify
(incl. published logs + dashboard, and generated app passwords) belongs to RL3 D6.
Still owed in white-box pass #2 (after I read the rest): **harness DRY** (recipe quirks in shared harness,
not per-recipe copy-paste), **log redaction real** (bridge/dashboard/log pipeline), **architecture matches
plan** (layout/§3, poll-primary trigger §4.1, traefik-is-coop-cloud-recipe §4.2; drift → DECISIONS.md).
## W0 (RL1 — lint/format tooling + green) : **PASS** @2026-05-27 (Adversary cold)
Gate claimed in STATUS-1b. Acceptance: clean checkout → `nix develop .#lint --command bash
scripts/lint.sh``lint: PASS`; lint stage wired in `.drone.yml` push pipeline. **Verified cold,
independently** (no nix on sandbox; ran on cc-ci over a *pristine* tree, not the Builder's working copy):
- **Cold checkout = exact reviewed SHA.** `git archive 233939a` (= my `origin/main` HEAD) piped to
cc-ci → `/tmp/ccci-cold` (clean tree, no untracked/cached state, secrets submodule empty as lint
excludes it). Not cloned from `/root/cc-ci` (that's a non-git plain copy) — archived from my own clone.
- **Lint PASS cold.** `HOME=/root nix develop .#lint --command bash scripts/lint.sh` → **exit 0,
`lint: PASS`.** All 8 linters ran clean: nixpkgs-fmt (0/14 reformat), statix, deadnix, ruff format
(32 files), ruff check (all passed), shfmt, shellcheck, yamllint.
- **Stage real, not rigged.** `scripts/lint.sh` genuinely invokes each linter in check mode and
accumulates a `fail` flag → `exit "$fail"` (correct `set -uo pipefail`, no `-e`, so all run). The
`.drone.yml` `self-test` push pipeline runs the *exact* command `nix develop .#lint --command bash
scripts/lint.sh` and FAILs the build on non-zero. Toolchain pinned from nixpkgs in `flake.nix`
(`devShells.lint`), so CI == local.
- **Gate has TEETH (break-it probe).** Injected violations into the cold tree (a `.py` with
`import os,sys` + `x=1+2`, and a mis-formatted `.nix`) → re-ran lint → **exit 1, `lint: FAIL`**
(ruff E401/I001/F401 + nixpkgs-fmt). So the stage is not vacuously green.
Verdict: **W0 PASS.** Builder may proceed to W1.
Advisory (not W0-blocking; re-confirm at RL3): Builder notes the Gitea→Drone *push* webhook is flaky
(§4.1), so the lint stage may not auto-fire as a real Drone build on every push — RL1's intent
("future commits stay clean") depends on that path actually firing. The stage IS wired and proven
green via its exact command; I'll confirm a real push triggers the Drone lint build when I re-verify
M2/D-gates at RL3 (it overlaps). Not filing a finding now — bounded phase, acceptance-as-stated is met.
## White-box §3 pass #2 @2026-05-27 (Adversary, post-W0 formatted code) — RL2 input
Remaining §3 checklist items. **No blocking findings.**
- **Harness is DRY** — PASS. Recipe quirks live in shared harness + per-recipe *declarative* metadata
(`tests/<recipe>/recipe_meta.py`: HEALTH_PATH/HEALTH_OK/timeouts/EXTRA_ENV), consumed uniformly by
`tests/conftest.py` (`_recipe_meta`, `deployed`/`deployed_app` fixtures) and
`runner/harness/lifecycle.py` (`_recipe_extra_env`). **No `if recipe == "..."` branches in the shared
harness** (the M6.5 no-surgery rule holds). Recipe-specific logic is isolated to that recipe's dir
(e.g. keycloak `kc_admin.py`, cryptpad's derived SANDBOX_DOMAIN). Only smell: the ~6-8-line `old_app`
upgrade fixture is copy-pasted across recipes — thin boilerplate over shared metadata; **advisory**,
not a violation (factoring it would just add another per-recipe injection point). → IDEAS, not blocking.
- **Architecture matches plan** — PASS. §4.1 trigger is **poll-primary** (`bridge/bridge.py` `poll_loop`
runs unconditionally every ≤60s; webhook is optional + dedup'd by comment id; exact trimmed `!testme`;
commenter-auth via read-level `GET /orgs/{owner}/members/{user}` 204=allow, fail-closed). §4.2 Traefik
is the **real coop-cloud/traefik recipe via abra** (`modules/proxy.nix`: `abra recipe fetch/app new
traefik`, `WILDCARDS_ENABLED=1`, `compose.wildcard.yml`, `LETS_ENCRYPT_ENV=""` → no ACME, cert as
`ssl_cert`/`ssl_key` swarm secrets) — no hand-rolled traefik.nix. §3 layout matches.
- **Server state Nix-declared & idempotent** — PASS. `modules/proxy.nix` `deploy-proxy` is
`Type=oneshot`+`RemainAfterExit`, re-runs every activation and converges (insert secret only if
absent, deploy). No `.bootstrapped`/run-once sentinels anywhere (grep clean, pass #1). Leans on 1c's
already-proven D8 (byte-identical closure + live throwaway rebuild, no manual post-step).
- **Log redaction is real** — PASS for infra secrets; **one advisory gap to verify behaviorally at
RL3/D6.** `runner/run_recipe_ci.py` `_redact_values()` reads `/run/secrets/*` (≥8-char values) and
`run_stage_redacted()` masks them in live-streamed stage output (sorted longest-first → no partial
leak). **But class-B *generated app passwords* are NOT under `/run/secrets/*`, so they are NOT in the
`_REDACT` list** — their non-leak rests entirely on the "harness never prints them / abra doesn't echo
generated ones" assumption (code comment, run_recipe_ci.py:59-60). Also: the runner's *own* stdout
(the `cc-ci-run …` Drone step) bypasses `run_stage_redacted`. This is exactly what my behavioral D6
leak test must catch at RL3 (grep published Drone logs **and** the dashboard for a known generated app
password). Phase-1 D6 passed that test once; recording the white-box shape so RL3 re-checks it, not a
new blocking finding. → **WATCH-ITEM for RL3/D6.**
- **Readability / docs accuracy** — advisory; defer to RL4 (docs) + the ruff/lint pass already covers
dead code / style deterministically.
**Net of §3 white-box review (RL2 input): no blocking findings; 2 advisories** (old_app copy-paste →
IDEAS; app-secret redaction → RL3/D6 watch-item). I expect Builder's W1 to be light. I have NOT filed
`[adversary]` BACKLOG items since nothing is blocking — will file if W1/RL3 surfaces a real defect.
## Operator added RL5 + RL6 (plan §7, 2026-05-27) — both BLOCKING for 1b DONE. Noted; verification plan:
- **RL5** (Builder moves; Adversary verifies cold): `modules/``nix/modules/`, `hosts/``nix/hosts/`;
`flake.nix`/`flake.lock` STAY at root so build ref `#cc-ci` is unchanged; fix flake internal paths +
`.drone.yml`/scripts refs; update `docs/architecture.md`. **Verification folds into RL3:** a fresh
recursive clone must still rebuild **byte-identical to the running system** (toplevel store hash WILL
change — expected; what must hold is build==running + reproducible). I'll re-confirm cold at RL3.
- **RL6** (coordinated near-END-of-1b): move `STATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.md`
`machine-docs/`; **README.md stays at root** (operator decision — human readme, not protocol). Update
ALL refs (cc-ci-plan plans, AGENTS.md, .drone.yml, scripts). I verify refs updated + nothing broken.
**CAVEAT affecting ME:** the watchdog (`launch.sh`) reads `STATUS-<id>.md`/`REVIEW-<id>.md` at repo
ROOT for handoffs/transitions — moving breaks it until launch.sh updated + watchdog restarted IN
LOCKSTEP (orchestrator handles that). So **I keep writing REVIEW-1b.md at root until the coordinated
cutover**, and at that moment I `git mv` my own REVIEW files (single-writer rule) in lockstep. Will NOT
move them unilaterally or while a phase transition is pending.
## Status: W0 PASS + §3 white-box pass complete (no blockers). New blocking items RL5/RL6 noted.
DoD for 1b is now **RL1RL6** (was RL1RL4). Awaiting Builder gates (W1 review+fixes; RL5 layout move;
then RL3 cold re-verify LAST, now also covering the RL5 byte-identical rebuild). Cardinal rule holds:
cleanup/refactor must not weaken/skip/regress any test — incl. the conditional-upgrade-skip watch-item.