Files
cc-ci/REVIEW.md
2026-05-27 00:52:35 +01:00

9.1 KiB
Raw Blame History

REVIEW — cc-ci Adversary (append-only)

This file is owned by the Adversary loop (§6.1). The Builder seeds this stub at bootstrap and does not edit it afterward. Adversary appends milestone/D-item verdicts (<id>: PASS @<ts> + evidence, or FAIL + a finding in BACKLOG.md ## Adversary findings), and may write ## VETO.

M0 — Foundations: PASS @2026-05-26T21:35Z

Verified cold (fresh shell, own clone /srv/cc-ci/cc-ci-adv, isolated host build dir /root/cc-ci-advverify, no reuse of Builder's /root/cc-ci).

Acceptance — "systemctl is-system-running healthy after a rebuild from the repo" + Builder's sops claim:

  • Repo rebuilds cc-ci: synced M0 commit deb4a0f (git-archive, no .git) to host, ran nixos-rebuild build --flake .#cc-ciBUILD EXIT 0, produced …-nixos-system-nixos-24.11.20250630.50ab793. Current HEAD also builds clean.
  • System health: systemctl is-system-runningrunning; systemctl --failed → 0 units.
  • sops decrypt: /run/secrets/test_secret present, mode 400 root:root, 41 bytes, value begins cc-c… (matches claimed generated cc-ci-m0-…). secrets/secrets.yaml is genuinely encrypted (2× ENC[…] + sops metadata block).
  • D6 leak probe (early): the decrypted plaintext value appears 0 times across all git history (git grep -F over git rev-list --all) and 0× in plaintext in secrets.yaml. No leak.

Note (not a finding; context for the M1 gate): the running system is already ahead of M0 — its closure includes docker, unit-swarm-init, and traefik units (traefik.yml, traefik-stack.yml, unit-traefik-deploy) that are not yet committed (HEAD ab839ae is swarm-only, no traefik). Expected mid-M1 churn, but the Traefik config must be committed to the repo before M1 is claimed or it fails D8 reproducibility — will check at the M1 gate.

M1 — Swarm + abra target: PASS @2026-05-26T22:20Z

Verified cold from own clone; deployed my own probe recipe via abra (not trusting the Builder's hand-test). Acceptance "a recipe deployed via abra is reachable over HTTPS at *.ci.commoninternet.net, then fully torn down leaving no volumes" + orchestrator's M1 checklist (ad).

  • (a) Real coop-cloud/traefik recipe (not hand-rolled): docker service lstraefik_…_app (traefik:v3.6.15) + …_socket-proxy (lscr.io socket-proxy) — the canonical recipe layout, deployed via abra (scripts/deploy-proxy.sh). modules/traefik.nix is deleted.
  • (b) Wildcard on web-secure + proxy overlay: static traefik.yml has web-secure: :443 (web→web-secure 301 redirect, verified live). File provider /etc/traefik/file-provider.yml: tls.certificates: [{certFile:/run/secrets/ssl_cert, keyFile:/run/secrets/ssl_key}]; swarm secrets …_ssl_cert_v1/…_ssl_key_v1 mounted (2909 B / 227 B = the pre-issued cert). My probe app advm1probe_…_app was attached to the proxy overlay.
  • E2E (cold deploy): abra app new custom-html -D advm1probe.ci.commoninternet.net (forced LETS_ENCRYPT_ENV="") → deploy succeeded 🟢. Via SOCKS proxy: HTTP 200; served cert subject: CN=*.ci.commoninternet.net, SAN-matched, SSL certificate verify ok, issuer LE E8 — i.e. the pre-issued wildcard, NOT a per-host ACME cert.
  • (c) No Gandi/DNS token, no ACME credential: repo (all history) clean; on host the only gandi/dns-challenge strings are commented-out recipe-template options (#GANDI_…, #SECRET_GANDIV5_…) holding no value. Active traefik env = LETS_ENCRYPT_ENV= (empty), WILDCARDS_ENABLED=1, compose.wildcard.yml. staging/production certResolvers are defined in traefik.yml (stock template) but referenced by no router; both acme.json are 0 bytes; 0 ACME lines in traefik logs. No ACME ever fires. (Hardening risk filed — see findings.)
  • (d) Manual renewal documented: DECISIONS.md — operator re-issues at same paths, then abra app secret rm … ssl_cert + re-insert at bumped version; install.md "Renewed out-of-band; never ACME here."
  • Teardown: abra app undeploy + volume remove → post-teardown services/containers/volumes/ secrets for the probe all 0. Also independently confirmed the Builder's cchtml1 test left 0 runtime resources (only its inert .env config file remains, harmless).

Verdict: M1 PASS. Not a hard fail on (c) — no token/credential exists and no ACME fires — but the inert ACME resolvers + test-app default LETS_ENCRYPT_ENV=production are a latent hazard that goes live when the harness deploys apps; filed as [adversary] for M4.

M2 — Drone online: PASS @2026-05-26T23:32Z

Verified cold from own clone. Acceptance: "push to cc-ci triggers a visible green Drone build."

  • Drone server healthy: https://drone.ci.commoninternet.net/healthz → HTTP 200 via gateway. Exec runner (drone-runner-exec.service) active, polling the remote server capacity=2 type=exec.
  • Repo wired: in Drone's DB the recipe-maintainers/cc-ci repo is repo_active=1, repo_config=.drone.yml. Gitea↔Drone OAuth proven by the in-pipeline clone step succeeding against the private repo (build can't clone without working OAuth/repo token).
  • Push→green, independently triggered: I pushed my own commit 91a8e8d (a REVIEW.md change) → Drone created build #4, build_event=push, build_trigger=@hook (Gitea webhook), and it ran success: stage self-test exit 0, steps clone+hello both exit 0. Builds #1#3 (Builder commits) likewise all success via @hook. (My earlier M0/M1 review pushes predate the .drone.yml, so correctly produced no builds.)
  • Visible logs (D7 precondition): logs table holds per-step log blobs for every build; Drone UI/API serve them. Full D7 UX is M8.

Verdict: M2 PASS. No new findings.

M3 — Comment bridge: PRE-CLAIM PROGRESS (not yet PASS) @2026-05-26T23:48Z

M3 is Blocked in STATUS (Gitea not delivering webhooks), so not a gate verdict yet. But the bridge is deployed and I independently hammered its auth/filter logic — the part I can verify regardless of the delivery leg (and which survives a pivot to API polling). Probes were live POSTs to https://ci.commoninternet.net/hook via the SOCKS proxy, with HMAC signatures I computed from the on-host secret (read with root; value never printed/committed):

probe expect got
no X-Gitea-Signature 401 401
bad signature 401 401
valid sig, event=ping (not issue_comment) 204 204
valid sig, !testmexyz on a real PR 204 (no trigger) 204
valid sig, !testme but issue is not a PR 204 204
valid sig, !testme on PR, action=edited 204 204
valid sig, !testme on real PR, non-collaborator 403 403

So: HMAC fail-closed + timing-safe (compare_digest, verified before body parse), !testmexyz correctly ignored (exact trimmed match), non-PR ignored, and a non-collaborator is rejected (403; collaborator status re-checked via Gitea API, not trusted from the signed payload). Source review of bridge/bridge.py found no auth bypass.

Blocker independently corroborated (operator-side): the bridge hook is registered + active on recipe-maintainers/cc-ci (id 210, events [issue_comment]ci.commoninternet.net/hook), and the bot is not a Gitea site-admin (GET /admin/hooks → 403) nor org owner, so it genuinely cannot inspect/change Gitea's [webhook] ALLOWED_HOST_LIST. Endorse STATUS ## Blocked: needs operator allowlisting or the documented poll-the-API fallback.

Still UNVERIFIED for an M3 PASS: (1) the positive path — a valid collaborator !testme actually starts a build + posts the PR comment end-to-end; (2) real Gitea→bridge delivery (or the polling pivot). Will complete both when M3 is claimed.

Noted for M7 (not a finding yet): the Drone-managed Gitea webhook (id 209) carries its webhook secret as a ?secret= query param in the hook URL (Drone default; admin-only in Gitea, not in cc-ci git / CI logs / dashboard). Will adjudicate against D6 at M7.

M4 — Harness + install stage: VERIFICATION IN PROGRESS (no verdict yet) @2026-05-27T00:35Z

M4 is CLAIMED. Code review done; runtime checks so far:

  • A1 CLOSED (see BACKLOG): harness forces LETS_ENCRYPT_ENV="" every deploy; live app cust-c95a69 served the wildcard cert, 0 ACME lines, no certresolver.
  • Happy-path teardown works: a prior run's app cust-e084bd was fully torn down (gone) — not an orphan; earlier ambiguity was a run cycling apps.
  • Two teardown-robustness defects filed (A2, A3): janitor's -pr filter is dead code under the cust-<hex> naming (no crash-orphan reaping); teardown is best-effort/unverified and deletes the .env even on failed undeploy (silent orphan, run still green).
  • Deferred to next idle tick (a Builder harness run is active now; sequential-only): my own cold install run (green install + Playwright + clean teardown verification) and the §6 kill-mid-run probe to test A3 empirically. Verdict (PASS/FAIL) follows that.