Files
cc-ci/STATUS.md
autonomic-bot 6232d2649c
All checks were successful
continuous-integration/drone/push Build is passing
STATUS: feature-complete except 6th D10 recipe; DONE gated on registry creds + Adversary
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:36:09 +01:00

119 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — cc-ci Builder
**Phase:** M0/M1/M2/M4/M5 PASS; M3 PASS (Adversary-verified); M6 CLAIMED (awaiting Adversary).
Bridge→Drone→harness integration DONE (recipe-ci pipeline). M6.5 underway: keycloak full 3-stage
GREEN through Drone (build #39). Next: enroll recipes 36 (remaining D10 categories), M7, M8.
**In-flight:** M6.5 gate CLAIMED — all 6 D10 recipes full 3-stage green (host + canonical Drone):
custom-html, keycloak(#39), cryptpad(#46), matrix-synapse(#51), lasuite-docs(#57), n8n(#63 in flight).
bluesky-pds (TLS-passthrough) swapped → n8n per DECISIONS (caddy self-ACME vs no-ACME design).
**M6.5 PASS + M7/D6 PASS** (Adversary). **M8/D7 CLAIMED** — dashboard overview+badges LIVE +
PR-comment outcome reflection (bridge edits comment to ✅/❌; verified). Remaining for DONE: M9
docs/reproducibility (D8 from-scratch rebuild + D9 docs) and **M10/D10** — the six recipes green via
**real `!testme` PRs** (currently proven via API-trigger; the Adversary-flagged gap). M10 = enroll
recipes in the bridge POLL_REPOS + open recipe-mirror PRs + `!testme` each.
**Last updated:** 2026-05-27 (M6.5 CLAIMED — 6/6 recipes 3-stage green across all D10 categories)
## Near-complete (2026-05-27)
Feature-complete except the 6th D10 recipe. Verified/claimed: M0M6 PASS, M6.5 PASS, M7/D6 PASS,
M8/D7 CLAIMED. M9/D9 docs complete (architecture+runbook added). M10: **5/6 recipes green via real
`!testme`** (custom-html/keycloak/matrix-synapse/n8n/cryptpad). **DONE is gated on:** (1) operator
Docker Hub registry creds → lasuite-docs 6th green (A1 blocker, notified; retries halted); (2)
Adversary verification of M8/M9 + D8 from-scratch rebuild + the D10 runs. No unblocked Builder
implementation remains — awaiting operator creds + Adversary. On each wake: check `.testenv`/sops for
creds + rate-limit reset → if available, wire creds (or quota-retry) + run lasuite; else idle.
## Gate: M6.5 — CLAIMED, awaiting Adversary (2026-05-27)
All 6 D10 recipes have a full install/upgrade/backup green run, each verified on host AND via the
canonical Drone recipe-ci pipeline (build #s above), each with clean teardown (0 orphans). Categories:
custom-html=simple, keycloak=SSO/identity+DB, cryptpad=stateful/no-DB, matrix-synapse=DB+media/
large-volume, lasuite-docs=multi-service+S3/MinIO/object-storage, n8n=workflow automation. D5 held:
each recipe enrolled via `tests/<recipe>/` + `recipe_meta.py` (EXTRA_ENV for cryptpad SANDBOX_DOMAIN
/ lasuite TIMEOUT) only — no shared `runner/harness` changes per recipe. Repro: trigger a custom
Drone build with RECIPE=<r> (or `cc-ci-run runner/run_recipe_ci.py` with RECIPE/STAGES on host).
## Gates
- **Gate: M0 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: flake rebuilds cc-ci from repo
(`switch --flake /root/cc-ci#cc-ci`, gen healthy, no failed units); sops-nix decrypts
`/run/secrets/test_secret` (0400 root, value = generated `cc-ci-m0-…`). Repro: clone repo, sync to
host, `nixos-rebuild switch --flake .#cc-ci`, then `systemctl is-system-running` + check the secret.
Per §6.1 I will NOT advance past this gate to M2; M1 work proceeds as independent unblocked work.
**M0 PASS** logged by Adversary in REVIEW.md @2026-05-26T21:35Z (cold verify, leak probe clean).
- **Gate: M1 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Docker single-node swarm +
`proxy` overlay; real coop-cloud/traefik via abra (wildcard/file-provider, no ACME); custom-html
deployed by hand → HTTP 200 over HTTPS via gateway at cchtml1.ci.commoninternet.net with the
wildcard cert; torn down clean (services/volumes/secrets/containers all 0). Repro:
`scripts/deploy-proxy.sh` + `abra app new/deploy/undeploy`. Starting M2 as independent work; will
not flip M2's gate until M1 shows PASS. → **M1 PASS** @2026-05-26T22:20Z.
- **Gate: M2 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Drone server (coop-cloud recipe,
reconcile oneshot, Gitea SSO) healthz 200 via gateway; exec runner polling (capacity=2). cc-ci repo
activated (push webhook). Pushing `.drone.yml` triggered build #1**success** (clone + hello exec
steps, exit 0; ran abra/docker on the host). Repro: `nixos-rebuild switch` + one-time
`scripts/bootstrap-drone-oauth.sh`. Starting M3 as independent work; won't flip M3 gate until M2 PASS.
- **Gate: M3 — CLAIMED, awaiting Adversary** (2026-05-27). Trigger redesigned per orchestrator
(plan §4.1): **polling is PRIMARY** (outbound, read-only, ≤30s), webhook optional/admin-registered;
commenter auth via org membership (`GET /orgs/{owner}/members/{user}` 204, read-level) + optional
allowlist — NOT the admin-requiring `/collaborators/{user}/permission`. Evidence: posted `!testme`
on PR #1 (by bot, an org member) → poller fired in **6s** → Drone build **#26** for head
`d397720a` → bridge posted the run-link comment back. Auth endpoint verified read-level: bot/trav/
notplants → 204, non-member → 404. The old webhook-delivery blocker is **moot** (polling doesn't
need the Gitea `ALLOWED_HOST_LIST` whitelist). Won't advance past this gate until REVIEW shows PASS;
doing the bridge→Drone integration as independent work meanwhile.
## Resource safety (plan §4.2/§4.3 — orchestrator change 2026-05-27)
- **MAX_TESTS = DRONE_RUNNER_CAPACITY = 1** (`modules/drone-runner.nix`): ≤1 build at once, Drone
auto-queues the rest natively. Verified `DRONE_RUNNER_CAPACITY=1` on the runner.
- **Per-build timeout = 60m** (`modules/drone.nix`, reconciled best-effort, non-fatal): a hung build
is cancelled → frees its slot. Verified Drone repo `timeout: 60`.
- **Janitor backstop** for SIGKILL'd builds (reaps orphaned run apps at run-start). At capacity=1
the recipe-CI pipeline will set `CCCI_JANITOR_MAX_AGE=0` (safe — no concurrent runs). See DECISIONS.
## Blocked
- **Docker Hub anonymous pull rate limit — registry pull creds needed (A1, operator).** During the
D10 real-`!testme` breadth runs, lasuite-docs (heaviest: 9 images) hit
`toomanyrequests: unauthenticated pull rate limit` on its upgrade stage (redis:8.2.6 task
Rejected "No such image" → couldn't pull). Confirmed: `docker pull redis:8.2.6` on the node →
rate-limited. This is the plan's flagged A1 input (§1.5/§4.4: "registry pull creds … rate-limit
failure traced to this is a finding, then request creds"). **Operator action:** provide Docker Hub
pull creds (store sops-encrypted in `secrets/`, wire into the docker daemon / swarm). NOT globally
blocking: **5/6 recipes already green via real `!testme`** (custom-html/keycloak/matrix-synapse/
n8n/cryptpad); lasuite-docs install+backup green too — only its upgrade (most pulls) is gated.
Contributing factor: my mid-breadth `docker image prune -af` evicted cached images → forced
re-pulls → tipped the limit (see DECISIONS). The anonymous limit resets in ~hours, so a retry may
also pass without creds, but creds are the durable fix. Working M9 (docs) meanwhile.
- (M3 webhook blocker previously here — cleared by the polling-primary redesign; polling is
read-only/outbound and needs no Gitea `ALLOWED_HOST_LIST` whitelist.)
## Tracking (adversary findings I must address)
- **[adversary] A4 — concurrent same-recipe runs collide on shared `~/.abra/recipes/<recipe>`.**
Root cause the finding names ("no Drone concurrency cap — runner capacity=2") is now **eliminated**:
MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1 (resource-safety change). With ≤1 build at a time there is
**no concurrent run** on this single node, so the shared-recipe-dir race cannot occur. Builder side
addressed via the concurrency cap (per plan §4.2 "concurrency cap 12"); Adversary to re-test/close.
(Per-run `ABRA_DIR`/HOME isolation would be belt-and-suspenders but is unnecessary at capacity=1.)
- **[adversary] A2 — janitor `-pr` filter dead.** Already fixed in code: `lifecycle.RUN_APP_RE` =
`^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$` (the hashed scheme), plus a stack-name regex
for `.env`-less orphans, gated on age. Awaiting Adversary kill-probe re-test.
- **[adversary] A3 — teardown unverified; `.env` removed before confirmed undeploy.** Already fixed:
`lifecycle.teardown_app` undeploys → `docker stack rm` fallback if services remain → removes
volumes/secrets while `.env` exists → drops `.env` LAST → then `_residual()` check raises
`TeardownError` if anything is left. Awaiting Adversary kill-mid-run re-test.
- **[adversary] A1 — no-ACME hazard for test apps.** Acknowledged (valid). The harness (M4) MUST
force `LETS_ENCRYPT_ENV=""` on every test-app deploy (already done in `scripts/deploy-proxy.sh` and
the M1 manual custom-html deploy; `scripts/deploy-drone.sh` will too). Considering a structural
belt-and-suspenders (drop the unused `certificatesResolvers` from cc-ci's traefik) — deferred,
needs a recipe-config override. Will make the harness enforcement the primary fix; Adversary
re-tests + closes after M4. → **Now enforced**: `harness.lifecycle.deploy_app` sets
`LETS_ENCRYPT_ENV=""` on every test-app deploy (verified in the M4 custom-html run). Adversary can
re-test + close A1.
## Notes
- **Disk RESOLVED:** operator grew the VM 8.9→**28 GiB** (22 GiB free) on 2026-05-26. Inodes
1.78M total / 1.21M free (was ~6k free — old 8.9 GiB fs had only 586k inodes, which the flake's
nixpkgs fetch exhausted). Both byte + inode pressure gone.
- M0 base config: flake at repo root pins nixpkgs to the exact rev cc-ci ran (50ab793) → first
rebuild is no-op-then-base. Deployed via `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run as
a detached transient systemd unit (survives ssh-over-tailscale drops). Gen 3 current, healthy.
- Open warning: incus module enables `systemd.network` while we set `networking.useDHCP=true`
(scripted dhcpcd) — Nix warns both may manage interfaces. Inherited from baseline, networking is
up; clean up later (pick networkd OR scripting). Tracked, non-blocking.