From 6bdf43febd16aff5a37503d48c157b378b4b83ea Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Wed, 27 May 2026 02:56:28 +0100 Subject: [PATCH] STATUS: M3 CLAIMED (polling primary verified) + resource-safety section; clear webhook blocker Co-Authored-By: Claude Opus 4.7 (1M context) --- BACKLOG.md | 15 ++++++++++++--- JOURNAL.md | 41 +++++++++++++++++++++++++++++++++++++++++ STATUS.md | 50 ++++++++++++++++++++++++++------------------------ 3 files changed, 79 insertions(+), 27 deletions(-) diff --git a/BACKLOG.md b/BACKLOG.md index b97cc58..c73021b 100644 --- a/BACKLOG.md +++ b/BACKLOG.md @@ -34,9 +34,18 @@ Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adver OAuth link via one-time `scripts/bootstrap-drone-oauth.sh` (documented in install.md §2). ### M3 — Comment bridge -- [ ] comment-bridge service: HMAC verify, !testme exact match, collaborator check, Drone API call -- [ ] PR comment posting with run link -- [ ] Gate: M3 — live demo on scratch PR; auth enforced +- [x] comment-bridge service: polling PRIMARY (read-only, ≤30s) + optional admin webhook; !testme + exact match; org-membership auth (`GET /orgs/{owner}/members/{user}` 204) + allowlist; Drone API +- [x] PR comment posting with run link +- [x] Gate: M3 — live demo on scratch PR; auth enforced → CLAIMED 2026-05-27. Posted `!testme` on + PR #1 → poll fired in 6s → Drone build #26 for head d397720a → bridge commented run link back. + Org-membership auth verified (bot/trav/notplants 204, non-member 404 at read level). + +### Bridge→Drone→harness integration (connects M3 trigger to M4/M5 recipe CI; blocks D2/D10 via !testme) +- [ ] Add a recipe-CI pipeline to `.drone.yml` keyed on the `RECIPE` build param: runs + `cc-ci-run runner/run_recipe_ci.py` with STAGES=install,upgrade,backup, `CCCI_JANITOR_MAX_AGE=0` + (safe at capacity=1), `concurrency:{limit:1}`. Keep the existing `self-test` pipeline for pushes. +- [ ] Verify a real `!testme` on a recipe PR runs the full 3-stage CI through Drone (not the self-test). ### M4 — Harness + install stage - [x] run_recipe_ci.py + conftest + harness (abra wrappers, lifecycle) + Nix python/playwright env diff --git a/JOURNAL.md b/JOURNAL.md index adceaba..e23f5e3 100644 --- a/JOURNAL.md +++ b/JOURNAL.md @@ -446,3 +446,44 @@ remains 3-stage green (M5). docs/enroll-recipe.md written. **M6 CLAIMED.** keycloak's full 3-stage (DB data survival via a realm marker) folds into M6.5. **Next:** M6.5 — keycloak upgrade/backup, then recipes 3–6 across the remaining D10 categories. + +--- +## 2026-05-27 — Trigger redesign (polling primary) + resource safety + M3 verified + +Session restarted by watchdog (prior tmux died mid-turn with uncommitted bridge WIP). Re-oriented +from STATUS + plan; two orchestrator design changes landed and are now implemented + verified. + +**(1) Trigger: POLLING PRIMARY, webhook optional, org-membership auth** (plan §4.1/§1.5; commit +7addb96). Rewrote `bridge/bridge.py`: a poll thread (`poll_loop`, always-on, primary) scans each +`POLL_REPOS` repo's open PRs every 30s for new `!testme`; the `/hook` webhook stays as an optional +admin-registered push optimization. Both share an in-memory comment-id seen-set → a comment seen by +both fires once. First poll marks pre-existing comments seen (no startup re-fire). Authorization now +`GET /orgs/{owner}/members/{user}` (204=member, read-level) + optional `AUTH_ALLOWLIST`, replacing +the admin-requiring `/collaborators/{user}/permission`. Bot never self-registers webhooks. +- Verified org endpoint at read level (bot basic-auth): + `members/{autonomic-bot,trav,notplants}` → 204; `members/definitely-not-a-member-xyz` → 404. +- Deployed (nixos-rebuild, deploy-bridge reconcile); new container logs: + `poller (primary) watching ['recipe-maintainers/cc-ci'] every 30s` + `(poll primary + optional webhook)`. +- **End-to-end M3 trigger (poll path):** posted `!testme` on PR #1 (comment 13705, by bot) → + Drone build **#26** appeared after **6s** (latest was #25); bridge logged + `[poll] triggered build 26 for cc-ci@d397720a (PR #1, comment 13705) by autonomic-bot`; bridge + posted back `cc-ci: started CI run for cc-ci @ d397720a → https://drone.ci.commoninternet.net/...`. + Satisfies D1 (<60s) over the read-only outbound path — no operator webhook whitelist needed. + +**(2) Resource safety: bound live test apps** (plan §4.2/§4.3; commit 72ff8e2). MAX_TESTS = +`DRONE_RUNNER_CAPACITY` = 1 (`modules/drone-runner.nix`) → Drone runs ≤1 build at once, queues the +rest natively. Per-build timeout = 60m, reconciled best-effort in `modules/drone.nix` +(`PATCH /api/repos/.../cc-ci {"timeout":60}`, non-fatal). Janitor remains the backstop for +SIGKILL'd/timed-out builds (reaps orphaned run apps at run-start before each deploy). +- Verified on host after rebuild: `DRONE_RUNNER_CAPACITY=1`; deploy-drone logged + `set cc-ci build timeout = 60m`; Drone API confirms repo `timeout: 60`. + +**Gap noted (next item):** `.drone.yml` still only has the `self-test` pipeline — a bridge-triggered +build runs the self-test, NOT `runner/run_recipe_ci.py`. M4/M5 ran the orchestrator by hand +(`cc-ci-run`). Need a recipe-CI pipeline keyed on the `RECIPE` build param (runs +`cc-ci-run runner/run_recipe_ci.py` with STAGES=install,upgrade,backup, `CCCI_JANITOR_MAX_AGE=0`, +`concurrency:{limit:1}`) to connect bridge→Drone→harness end-to-end (required for D2/D10 via real +`!testme`). Added to Build backlog. + +**M3 CLAIMED** (gate). Trigger + auth + comment-back demoed live; the webhook-delivery blocker is +moot now that polling is primary. diff --git a/STATUS.md b/STATUS.md index d638ef4..9cf4b90 100644 --- a/STATUS.md +++ b/STATUS.md @@ -1,9 +1,12 @@ # STATUS — cc-ci Builder -**Phase:** M6 complete & CLAIMED. M0/M1/M2/M4/M5 PASS. M3 gate BLOCKED (Gitea webhook; operator). -Next: M6.5 (breadth ramp — recipes 3–6 + keycloak full 3-stage), M7, M8. Resolve M3 trigger before M10. -**In-flight:** M6.5 — keycloak full 3-stage (DB survival), then enroll recipes covering remaining categories. -**Last updated:** 2026-05-27 (M6 claimed; D4 + recipe #2) +**Phase:** M0/M1/M2/M4/M5 PASS; M3 + M6 CLAIMED (awaiting Adversary). M3 trigger now UNBLOCKED & +verified (polling primary — see M3 gate). Next: wire bridge→Drone recipe-CI pipeline (`.drone.yml` +integration gap), then M6.5 (breadth ramp), M7, M8. +**In-flight:** Bridge→Drone integration (recipe-CI pipeline) + M6.5 keycloak full 3-stage, then +enroll recipes covering remaining D10 categories. +**Last updated:** 2026-05-27 (trigger redesign: polling primary + org-membership auth, M3 verified; +resource safety: MAX_TESTS=1 + 60m timeout) ## Gates - **Gate: M0 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: flake rebuilds cc-ci from repo @@ -23,28 +26,27 @@ Next: M6.5 (breadth ramp — recipes 3–6 + keycloak full 3-stage), M7, M8. Res activated (push webhook). Pushing `.drone.yml` triggered build #1 → **success** (clone + hello exec steps, exit 0; ran abra/docker on the host). Repro: `nixos-rebuild switch` + one-time `scripts/bootstrap-drone-oauth.sh`. Starting M3 as independent work; won't flip M3 gate until M2 PASS. +- **Gate: M3 — CLAIMED, awaiting Adversary** (2026-05-27). Trigger redesigned per orchestrator + (plan §4.1): **polling is PRIMARY** (outbound, read-only, ≤30s), webhook optional/admin-registered; + commenter auth via org membership (`GET /orgs/{owner}/members/{user}` 204, read-level) + optional + allowlist — NOT the admin-requiring `/collaborators/{user}/permission`. Evidence: posted `!testme` + on PR #1 (by bot, an org member) → poller fired in **6s** → Drone build **#26** for head + `d397720a` → bridge posted the run-link comment back. Auth endpoint verified read-level: bot/trav/ + notplants → 204, non-member → 404. The old webhook-delivery blocker is **moot** (polling doesn't + need the Gitea `ALLOWED_HOST_LIST` whitelist). Won't advance past this gate until REVIEW shows PASS; + doing the bridge→Drone integration as independent work meanwhile. + +## Resource safety (plan §4.2/§4.3 — orchestrator change 2026-05-27) +- **MAX_TESTS = DRONE_RUNNER_CAPACITY = 1** (`modules/drone-runner.nix`): ≤1 build at once, Drone + auto-queues the rest natively. Verified `DRONE_RUNNER_CAPACITY=1` on the runner. +- **Per-build timeout = 60m** (`modules/drone.nix`, reconciled best-effort, non-fatal): a hung build + is cancelled → frees its slot. Verified Drone repo `timeout: 60`. +- **Janitor backstop** for SIGKILL'd builds (reaps orphaned run apps at run-start). At capacity=1 + the recipe-CI pipeline will set `CCCI_JANITOR_MAX_AGE=0` (safe — no concurrent runs). See DECISIONS. ## Blocked -- **M3 gate — Gitea→bridge webhook delivery (operator FIXING: whitelisting ci.commoninternet.net in - git.autonomic.zone `ALLOWED_HOST_LIST`).** Orchestrator update 2026-05-27: **keep the webhook - design, do NOT pivot to polling.** Bridge + webhook (id 210) left in place as-is (webhook-only; - the brief polling experiment was reverted). When the operator pings that the whitelist is applied: - re-test delivery (Gitea Test Delivery or re-comment `!testme` on PR #1), confirm the bridge gets - the POST + triggers a Drone build, then claim the M3 gate. Working other milestones meanwhile. - Original diagnosis below for reference. - The comment-bridge is built, deployed (swarm service behind traefik), and **publicly reachable**: - `https://ci.commoninternet.net/hook/healthz` → 200 from the sandbox over *real public DNS* - (ci.commoninternet.net → gateway 143.244.213.108). HMAC logic verified (a manually openssl-signed - POST is accepted; bad sig → 401). BUT Gitea never delivers: commenting `!testme` on PR #1 and even - Gitea's "Test Delivery" (UI returns 200/queued) produce **zero** requests at the bridge container - (and traefik accessLog is off, so unobservable there). Bridge is reachable from a 3rd network, gateway - accepts public sources, public DNS is correct → Gitea is not *sending* the HTTP request. Most likely - git.autonomic.zone's `[webhook] ALLOWED_HOST_LIST` excludes `ci.commoninternet.net` (bot is not Gitea - admin, can't inspect/change). **Operator options:** (a) add `ci.commoninternet.net` to Gitea's webhook - allowed-host list; or (b) tell me to pivot the bridge to **poll** the Gitea API for `!testme` comments - (self-service, satisfies D1's 60s; recorded as the fallback). **Not globally blocking** — M4 (harness + - install stage) is independent of the trigger path (dev builds triggerable via the Drone API), so I - proceed there meanwhile. +- (none) — M3 webhook blocker cleared by the polling-primary redesign (polling is + read-only/outbound and needs no Gitea `ALLOWED_HOST_LIST` whitelist). ## Tracking (adversary findings I must address) - **[adversary] A1 — no-ACME hazard for test apps.** Acknowledged (valid). The harness (M4) MUST