Files
cc-ci/REVIEW.md
2026-05-27 06:13:51 +01:00

277 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — cc-ci Adversary (append-only)
This file is owned by the **Adversary** loop (§6.1). The Builder seeds this stub at bootstrap and
does not edit it afterward. Adversary appends milestone/D-item verdicts (`<id>: PASS @<ts>` +
evidence, or `FAIL` + a finding in `BACKLOG.md ## Adversary findings`), and may write `## VETO`.
<!-- Adversary verdicts below -->
## M0 — Foundations: PASS @2026-05-26T21:35Z
Verified cold (fresh shell, own clone `/srv/cc-ci/cc-ci-adv`, isolated host build dir
`/root/cc-ci-advverify`, no reuse of Builder's `/root/cc-ci`).
Acceptance — "`systemctl is-system-running` healthy after a rebuild from the repo" + Builder's
sops claim:
- **Repo rebuilds cc-ci:** synced M0 commit `deb4a0f` (git-archive, no .git) to host, ran
`nixos-rebuild build --flake .#cc-ci``BUILD EXIT 0`, produced
`…-nixos-system-nixos-24.11.20250630.50ab793`. Current HEAD also builds clean.
- **System health:** `systemctl is-system-running``running`; `systemctl --failed` → 0 units.
- **sops decrypt:** `/run/secrets/test_secret` present, mode `400 root:root`, 41 bytes, value
begins `cc-c…` (matches claimed generated `cc-ci-m0-…`). `secrets/secrets.yaml` is genuinely
encrypted (2× `ENC[…]` + sops metadata block).
- **D6 leak probe (early):** the decrypted plaintext value appears **0 times** across *all* git
history (`git grep -F over git rev-list --all`) and 0× in plaintext in `secrets.yaml`. No leak.
Note (not a finding; context for the M1 gate): the *running* system is already ahead of M0 — its
closure includes docker, `unit-swarm-init`, and **traefik** units (`traefik.yml`,
`traefik-stack.yml`, `unit-traefik-deploy`) that are **not yet committed** (HEAD `ab839ae` is
swarm-only, no traefik). Expected mid-M1 churn, but the Traefik config must be committed to the
repo before M1 is claimed or it fails D8 reproducibility — will check at the M1 gate.
## M1 — Swarm + abra target: PASS @2026-05-26T22:20Z
Verified cold from own clone; deployed my **own** probe recipe via abra (not trusting the Builder's
hand-test). Acceptance "a recipe deployed via abra is reachable over HTTPS at
`*.ci.commoninternet.net`, then fully torn down leaving no volumes" + orchestrator's M1 checklist
(ad).
- **(a) Real coop-cloud/traefik recipe (not hand-rolled):** `docker service ls`
`traefik_…_app` (`traefik:v3.6.15`) + `…_socket-proxy` (lscr.io socket-proxy) — the canonical
recipe layout, deployed via abra (`scripts/deploy-proxy.sh`). `modules/traefik.nix` is deleted.
- **(b) Wildcard on web-secure + proxy overlay:** static `traefik.yml` has `web-secure: :443`
(web→web-secure 301 redirect, verified live). File provider `/etc/traefik/file-provider.yml`:
`tls.certificates: [{certFile:/run/secrets/ssl_cert, keyFile:/run/secrets/ssl_key}]`; swarm
secrets `…_ssl_cert_v1`/`…_ssl_key_v1` mounted (2909 B / 227 B = the pre-issued cert). My probe
app `advm1probe_…_app` was attached to the `proxy` overlay.
- **E2E (cold deploy):** `abra app new custom-html -D advm1probe.ci.commoninternet.net` (forced
`LETS_ENCRYPT_ENV=""`) → `deploy succeeded 🟢`. Via SOCKS proxy: **HTTP 200**; served cert
`subject: CN=*.ci.commoninternet.net`, SAN-matched, `SSL certificate verify ok`, issuer LE E8 —
i.e. the **pre-issued wildcard**, NOT a per-host ACME cert.
- **(c) No Gandi/DNS token, no ACME credential:** repo (all history) clean; on host the only
gandi/dns-challenge strings are **commented-out** recipe-template options (`#GANDI_…`,
`#SECRET_GANDIV5_…`) holding no value. Active traefik env = `LETS_ENCRYPT_ENV=` (empty),
`WILDCARDS_ENABLED=1`, `compose.wildcard.yml`. `staging`/`production` certResolvers are *defined*
in traefik.yml (stock template) but **referenced by no router**; both acme.json are **0 bytes**;
**0 ACME lines in traefik logs**. No ACME ever fires. (Hardening risk filed — see findings.)
- **(d) Manual renewal documented:** DECISIONS.md — operator re-issues at same paths, then
`abra app secret rm … ssl_cert` + re-insert at bumped version; install.md "Renewed out-of-band;
never ACME here."
- **Teardown:** `abra app undeploy` + `volume remove` → post-teardown services/containers/volumes/
secrets for the probe **all 0**. Also independently confirmed the Builder's `cchtml1` test left 0
runtime resources (only its inert `.env` config file remains, harmless).
Verdict: **M1 PASS.** Not a hard fail on (c) — no token/credential exists and no ACME fires — but
the inert ACME resolvers + test-app default `LETS_ENCRYPT_ENV=production` are a latent hazard that
goes live when the harness deploys apps; filed as `[adversary]` for M4.
<!-- M2 live-trigger probe @2026-05-26T23:30Z: this push should create Drone build #4 -->
## M2 — Drone online: PASS @2026-05-26T23:32Z
Verified cold from own clone. Acceptance: "push to cc-ci triggers a visible green Drone build."
- **Drone server healthy:** `https://drone.ci.commoninternet.net/healthz` → HTTP 200 via gateway.
Exec runner (`drone-runner-exec.service`) active, `polling the remote server capacity=2 type=exec`.
- **Repo wired:** in Drone's DB the `recipe-maintainers/cc-ci` repo is `repo_active=1`,
`repo_config=.drone.yml`. Gitea↔Drone OAuth proven by the in-pipeline `clone` step succeeding
against the private repo (build can't clone without working OAuth/repo token).
- **Push→green, independently triggered:** I pushed my own commit `91a8e8d` (a REVIEW.md change) →
Drone created **build #4**, `build_event=push`, `build_trigger=@hook` (Gitea webhook), and it ran
**`success`**: stage `self-test` exit 0, steps `clone`+`hello` both exit 0. Builds #1#3 (Builder
commits) likewise all `success` via `@hook`. (My earlier M0/M1 review pushes predate the
`.drone.yml`, so correctly produced no builds.)
- **Visible logs (D7 precondition):** `logs` table holds per-step log blobs for every build; Drone
UI/API serve them. Full D7 UX is M8.
Verdict: **M2 PASS.** No new findings.
## M3 — Comment bridge: PRE-CLAIM PROGRESS (not yet PASS) @2026-05-26T23:48Z
M3 is **Blocked** in STATUS (Gitea not delivering webhooks), so not a gate verdict yet. But the
bridge is deployed and I independently hammered its auth/filter logic — the part I can verify
regardless of the delivery leg (and which survives a pivot to API polling). Probes were live POSTs
to `https://ci.commoninternet.net/hook` via the SOCKS proxy, with HMAC signatures I computed from
the on-host secret (read with root; value never printed/committed):
| probe | expect | got |
|---|---|---|
| no `X-Gitea-Signature` | 401 | **401** |
| bad signature | 401 | **401** |
| valid sig, event=`ping` (not issue_comment) | 204 | **204** |
| valid sig, `!testmexyz` on a real PR | 204 (no trigger) | **204** |
| valid sig, `!testme` but issue is not a PR | 204 | **204** |
| valid sig, `!testme` on PR, action=`edited` | 204 | **204** |
| valid sig, `!testme` on real PR, **non-collaborator** | 403 | **403** |
So: HMAC fail-closed + timing-safe (`compare_digest`, verified before body parse), `!testmexyz`
correctly ignored (exact trimmed match), non-PR ignored, and a non-collaborator is rejected (403;
collaborator status re-checked via Gitea API, not trusted from the signed payload). Source review
of `bridge/bridge.py` found no auth bypass.
**Blocker independently corroborated (operator-side):** the bridge hook *is* registered + active on
`recipe-maintainers/cc-ci` (id 210, events `[issue_comment]``ci.commoninternet.net/hook`), and
the bot is not a Gitea site-admin (`GET /admin/hooks` → 403) nor org owner, so it genuinely cannot
inspect/change Gitea's `[webhook] ALLOWED_HOST_LIST`. Endorse STATUS `## Blocked`: needs operator
allowlisting or the documented poll-the-API fallback.
**Still UNVERIFIED for an M3 PASS:** (1) the positive path — a valid collaborator `!testme` actually
starts a build + posts the PR comment end-to-end; (2) real Gitea→bridge delivery (or the polling
pivot). Will complete both when M3 is claimed.
**Noted for M7 (not a finding yet):** the Drone-managed Gitea webhook (id 209) carries its webhook
secret as a `?secret=` query param in the hook URL (Drone default; admin-only in Gitea, not in cc-ci
git / CI logs / dashboard). Will adjudicate against D6 at M7.
## M4 — Harness + install stage: VERIFICATION IN PROGRESS (no verdict yet) @2026-05-27T00:35Z
M4 is CLAIMED. Code review done; runtime checks so far:
- **A1 CLOSED** (see BACKLOG): harness forces `LETS_ENCRYPT_ENV=""` every deploy; live app
`cust-c95a69` served the wildcard cert, 0 ACME lines, no certresolver.
- **Happy-path teardown works:** a prior run's app `cust-e084bd` was fully torn down (gone) — not
an orphan; earlier ambiguity was a run cycling apps.
- **Two teardown-robustness defects filed (A2, A3):** janitor's `-pr` filter is dead code under the
`cust-<hex>` naming (no crash-orphan reaping); teardown is best-effort/unverified and deletes the
`.env` even on failed undeploy (silent orphan, run still green).
- **Deferred to next idle tick (a Builder harness run is active now; sequential-only):** my own
cold install run (green install + Playwright + clean teardown verification) and the §6 kill-mid-run
probe to test A3 empirically. Verdict (PASS/FAIL) follows that.
## M4 — Harness + install stage: PASS @2026-05-27T01:05Z
Verified by my **own** cold harness run (`RECIPE=custom-html REF=advcold… cc-ci-run
runner/run_recipe_ci.py`, app `cust-cfeb6a`, isolated from a Builder run that happened to run
concurrently as `cust-3c1970` — no collision, distinct domains/volumes/secrets):
- **Install stage green:** `test_install.py` → 2 passed (27s): `test_http_reachable` (HTTPS 200 via
gateway) + `test_playwright_page` (real Chromium loads the live app, status 200, served HTML).
- **Guaranteed teardown:** after the run, `cust-cfeb6a` left **0** services / volumes / secrets /
containers / `.env` — fully clean. Infra (traefik/drone/bridge/backups) untouched.
- A1 closed (no-ACME enforced). **Open robustness findings A2 (dead `-pr` janitor) + A3 (unverified
best-effort teardown)** concern the *crash* path (finalizer-skipped), not this happy-path run;
they don't block M4's literal acceptance but must be resolved before DONE (D2 teardown guarantee).
Kill-mid-run probe to substantiate A2/A3 deferred until the host is idle.
Verdict: **M4 PASS.**
## M5 — Upgrade + backup/restore stages: PASS @2026-05-27T01:05Z
Same cold run, stages 2 and 3 — both genuine end-to-end (no mocks; assertions reviewed in source
and not softened):
- **Upgrade green:** `test_upgrade.py` → 1 passed (41s). Deploys the **previous published version**
(`previous_version` = `recipe_versions[-2]`), writes a marker into the volume-backed html dir,
upgrades to latest (`abra upgrade`), then asserts HTTP 200 **and** the marker survives — a real
version change with data persistence across the volume (`cust-…_content`), not a no-op.
- **Backup/restore green:** `test_backup.py` → 1 passed (37s). Writes `original`, `abra backup`,
mutates to `mutated` (asserted), `abra restore`, then asserts the served content is back to
`original` ("restore did not return the pre-mutation state"). Real backup→mutate→restore cycle
via backup-bot-two.
- Teardown clean (same `cust-cfeb6a` 0-remnant check above covers all three stages — same domain
reused per stage).
Verdict: **M5 PASS.**
## M6 — Recipe-local tests + second recipe: VERIFICATION IN PROGRESS (no verdict yet) @2026-05-27T01:48Z
M6 CLAIMED. Host has been continuously busy (Builder M6.5 ramp), so deploy-based checks are
deferred to an idle window; static + evidence review so far:
- **custom-html 3-stage:** already verified cold by me (see M5 PASS) — green + clean teardown.
- **D4 recipe-local discovery — code genuine:** `run_recipe_ci.snapshot_recipe_tests` copies the
recipe-shipped `tests/` before abra re-checkouts to a version tag, then `run_recipe_local` deploys
the app and runs those tests against the LIVE app via `CCCI_BASE_URL`/`CCCI_APP_DOMAIN`, merged as
a separate stage with guaranteed teardown. Demo branch `recipe-maintainers/custom-html@
ci/d4-recipe-local` confirmed to ship `tests/test_recipe_local.py` (Gitea API). Will run it cold to
confirm the stage executes+passes.
- **keycloak (#2) install — test genuine:** `/realms/master` 200 health + real Playwright admin
console login (waits for the username field). `recipe_meta.py` (HEALTH_PATH/timeouts) confirms D5
"no harness surgery". Empirical keycloak reproduction deferred (heavy deploy; idle window).
- **Filed [adversary] A4** (concurrency): same-recipe concurrent runs share `~/.abra/recipes/<recipe>`
with no isolation/lock/concurrency-cap — a collision vector for the §6 concurrency check; to
confirm empirically.
Pending for idle host: cold D4 run, keycloak reproduce, A2/A3 kill-probe re-test, A4 concurrency test.
## D6/M7 — preliminary leak scan of published Drone logs (PASS so far; M7 not yet claimed) @2026-05-27T02:05Z
Host-safe probe while the host was busy. Pulled Drone's `database.sqlite`, dumped all 42 `logs`
rows (~25.5k chars of published per-step build output), scanned:
- **Known infra secrets — 0 leaks:** webhook HMAC (64), drone token (32), gitea token (40) each
appear **0×** in the logs (exact `grep -F`).
- **No value patterns:** 0 matches for `password|secret|token = <value>`.
- The only long hex/base64 hits are **git commit SHAs** in `git clone/merge` output — benign.
Caveat: current Drone logs are hello-world + self-test; the full M7/D6 test must also cover
app-generated secrets (e.g. keycloak DB passwords) in recipe-run logs AND the dashboard (M8). This
is a clean baseline, not the final D6 verdict. (DB copy was scanned off-box and deleted; no secret
value printed or committed.)
## M3 — Comment bridge: PASS @2026-05-27T03:13Z
Verified cold against the NEW design (orchestrator change: polling-PRIMARY + org-membership auth;
webhook now optional). Re-reviewed `bridge/bridge.py` (256 lines) — sound — then live-probed the
running bridge + Drone:
- **`!testme` triggers a run ≤60s:** I posted `!testme` (comment 13708) on PR #1 at epoch
1779847690 → bridge `[poll] triggered build 35` → Drone build 35 created at 1779847702 =
**12s** latency. (Build is `failure` only because `RECIPE=cc-ci` has no `tests/cc-ci/`; the
trigger + event=custom recipe-CI pipeline fired correctly — integration is live.)
- **Re-commenting re-runs:** my new comment 13708 → build 35, distinct from the earlier
comment 13705 → build 26. Distinct comment ids each fire once (dedup via `_claim`).
- **Other comments do NOT trigger:** I posted `!testmexyz`**no** build created, no bridge
trigger log. Exact trimmed match enforced.
- **Auth enforced (org-membership, fail-closed):** `GET /orgs/recipe-maintainers/members/<u>`
autonomic-bot & notplants → 204 (allowed), `definitely-not-a-member-zzz9` → 404 (rejected).
`is_authorized` returns True only on 204/allowlist; anything else (incl. errors) → False.
- **Link back:** bridge posted run-link comment 13706 ("cc-ci: started CI run … → drone…/recip…").
- **Concurrency cap live:** runner `capacity=1` (`DRONE_RUNNER_CAPACITY=1`) + pipeline
`concurrency:limit:1` — recipe-CI builds serialize.
Verdict: **M3 PASS.** (Polling is outbound read+comment only — no repo-admin; webhook optional.)
Note: full bridge→3-stage-recipe-CI E2E on a *real recipe* PR is the Builder's in-flight
integration item / D10 — build 35 shows the pipeline wiring works; green-on-a-real-recipe is M10.
## D6 — leak scan extended to recipe-CI build logs (still clean) @2026-05-27T04:05Z
Followup to the earlier hello-world scan: scanned the logs of all 7 `event=custom` recipe-CI builds
(~26.7k chars — these ran real `abra app deploy` + `abra app secret generate`, so generated app
secrets *could* surface here). Result: **0** `password|secret = <value>` patterns, **0** "secret
generated/inserted" value lines (abra doesn't echo secret values), and every long hex/base64 hit is
benign — Nix store paths, git SHAs, Drone workspace dir names (`<rand16>/drone/src`), pytest
tracebacks. No app-secret leak in published recipe-run logs. (Full M7/D6 verdict still pending the
dashboard (M8) leak check + final M7 claim.)
## M6 — Recipe-local tests + second recipe: PASS @2026-05-27T04:43Z
Acceptance: "both recipes green (custom-html 3-stage; keycloak install) + recipe-local merged",
plus D4/D5. Verified by a mix of my own cold runs + deep Drone-log corroboration (keycloak's 31-min
deploy made a self-rerun impractical on the contended host, so I read the actual build #39 logs, not
a Builder summary):
- **custom-html 3-stage:** my own cold run (see M5 PASS) — install/upgrade/backup green, 0 orphans.
- **keycloak (#2) full 3-stage — build #39 (event=custom, RECIPE=keycloak, success):** actual log
lines show `PASSED test_realm_endpoint_healthy`, `PASSED test_playwright_admin_login` (install,
510s), `PASSED test_upgrade_preserves_realm` (upgrade, 610s — DB realm survived), `PASSED
test_backup_mutate_restore` (backup, 495s — realm restored). Three separate reported stages (D2).
Tests are genuine (admin REST + real Playwright admin-console login; reviewed source — not mocked).
Post-run: **0** keycloak services/volumes (clean teardown).
- **D4 recipe-local — verified by my OWN run:** `RECIPE=custom-html SRC=…/custom-html
REF=ci/d4-recipe-local` → recipe-shipped `tests/test_recipe_local.py` snapshotted to a temp dir
(immune to abra's version re-checkout), deployed the app, ran
`test_recipe_local_serves_content PASSED` against the LIVE app via `CCCI_BASE_URL`, merged as a
`recipe-local` stage; clean teardown (0 `cust-` leftovers).
- **D5 (no harness surgery):** keycloak enrolled via `tests/keycloak/` + `recipe_meta.py` only; no
changes to shared `runner/harness` code. enroll-recipe.md documents the flow.
Verdict: **M6 PASS.** (keycloak full 3-stage also satisfies the first M6.5 breadth slot.)
## M6.5 — breadth ramp: RUNNING EVIDENCE (no verdict yet — recipes 56 + gate pending) @2026-05-27T06:12Z
Deep-corroborating each recipe's canonical Drone recipe-ci build from its actual logs (genuine
3-stage assertions, not summaries). Confirmed green so far (categories in parens):
- **custom-html** (simple/stateless) — build #33 + my own cold 3-stage run (M4/M5).
- **keycloak** (SSO + DB-backed) — build #39: realm health + Playwright admin login (install),
`test_upgrade_preserves_realm`, `test_backup_mutate_restore` (M6 verdict).
- **cryptpad** (stateful, no external DB) — build #46: `test_http_reachable`,
`test_playwright_loads_cryptpad`, `test_upgrade_preserves_data`, `test_backup_mutate_restore`.
- **matrix-synapse** (large-volume / DB + media store) — build #51: `test_client_api_healthy`,
`test_client_api_advertises_versions`, `test_upgrade_preserves_data`, `test_backup_mutate_restore`.
All three stages reported separately per build (D2). Categories covered: simple, SSO/DB, stateful,
large-volume. **Remaining:** recipe #5/#6 (multi-service+S3/object-storage, e.g. lasuite; and the
6th for breadth) + the M6.5 gate. Final M6.5/D10 verdict after those + the §6 concurrency check.