Files
cc-ci/machine-docs/STATUS.md
autonomic-bot 992d87cfcd refactor(1b): RL6 — move Builder protocol files into machine-docs/ (README stays root)
git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator
decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and
docs/install.md -> machine-docs/...

Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by
every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so
it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 22:35:30 +01:00

9.8 KiB
Raw Permalink Blame History

STATUS — cc-ci Builder

DONE — 2026-05-27

The cc-ci Co-op Cloud recipe CI server is complete. Every Definition-of-Done item (§2, D1D10) is independently Adversary-verified with a PASS dated <24h, no standing ## VETO, and the Adversary explicitly cleared the §6.1 DONE handshake ("Builder may flip STATUS → DONE", REVIEW.md).

D Item Verdict Evidence (Adversary REVIEW.md)
D1 !testme trigger PASS M3 @03:13Z + D10 real-!testme runs
D2 install/upgrade/backup matrix (real e2e) PASS M4/M5/M6 + D10 6/6 (3 stages each)
D3 Python + Playwright PASS live in every recipe install/D10 run
D4 recipe-local tests PASS M6 @04:43Z
D5 per-recipe tree, no harness surgery PASS M6.5 @07:25Z
D6 secrets (no leaks, rotatable) PASS M7 @07:55Z (grep clean: logs+dashboard+git)
D7 results UX (dashboard + PR outcome) PASS M8 @08:10Z
D8 reproducible server PASS byte-identical nixos-rebuild build==running + documented-alt @10:52Z
D9 documentation PASS @10:55Z (full docs set)
D10 six recipes via real !testme PASS (6/6) @11:57Z custom-html #84, keycloak #86, matrix-synapse #87, n8n #89, cryptpad #90, lasuite-docs #108

D10 set spans all required categories: simple (custom-html), SSO/identity+DB (keycloak), DB+media/large-volume (matrix-synapse), workflow (n8n), stateful/no-DB (cryptpad), multi-service + S3/object-storage (lasuite-docs). bluesky-pds (TLS-passthrough) was swapped → n8n with a documented reason (DECISIONS). Registry creds (A1) remain a documented good-to-have for rate-limit robustness, not a DONE blocker. Loop stopped.


Phase: ALL MILESTONES BUILDER-COMPLETE. Adversary-verified: M0M6 PASS, M6.5 PASS, M7/D6 PASS, M8/D7 PASS, D8-core PASS, D9 PASS. Only D10 left to verify — M10/D10 CLAIMED: all 6 recipes green via real !testme (custom-html #84, keycloak #86, matrix-synapse #87, n8n #89, cryptpad #90, lasuite-docs #108; all 5 categories). D10 PASS (6/6) @11:57Z logged by Adversary. Docker Hub rate-limit blocker RESOLVED. DONE blocked on ONE item: D8 live blank-VM rebuild. Adversary's D8 verdict (@10:52Z) = "core PASS (Nix byte-identical closure + docs); live blank-VM rebuild pending — to complete before DONE." It was DEFERRED on the premise that the rebuild needs operator registry creds (rate limit). That premise is now obsolete: D10 passed 6/6 WITHOUT creds — the rate limit was transient and the real fix was abra app upgrade -c. So the throwaway-VM live rebuild is feasible NOW in a fresh quota window (no creds dependency). Surfacing for the Adversary to complete D8 → then all D1D10 <24h PASS → DONE. I will NOT write ## DONE until REVIEW shows a full D8 PASS. No Builder implementation remains.

Gate: M6.5 — CLAIMED, awaiting Adversary (2026-05-27)

All 6 D10 recipes have a full install/upgrade/backup green run, each verified on host AND via the canonical Drone recipe-ci pipeline (build #s above), each with clean teardown (0 orphans). Categories: custom-html=simple, keycloak=SSO/identity+DB, cryptpad=stateful/no-DB, matrix-synapse=DB+media/ large-volume, lasuite-docs=multi-service+S3/MinIO/object-storage, n8n=workflow automation. D5 held: each recipe enrolled via tests/<recipe>/ + recipe_meta.py (EXTRA_ENV for cryptpad SANDBOX_DOMAIN / lasuite TIMEOUT) only — no shared runner/harness changes per recipe. Repro: trigger a custom Drone build with RECIPE= (or cc-ci-run runner/run_recipe_ci.py with RECIPE/STAGES on host).

Gates

  • Gate: M0 — CLAIMED, awaiting Adversary (2026-05-26). Evidence: flake rebuilds cc-ci from repo (switch --flake /root/cc-ci#cc-ci, gen healthy, no failed units); sops-nix decrypts /run/secrets/test_secret (0400 root, value = generated cc-ci-m0-…). Repro: clone repo, sync to host, nixos-rebuild switch --flake .#cc-ci, then systemctl is-system-running + check the secret. Per §6.1 I will NOT advance past this gate to M2; M1 work proceeds as independent unblocked work. → M0 PASS logged by Adversary in REVIEW.md @2026-05-26T21:35Z (cold verify, leak probe clean).
  • Gate: M1 — CLAIMED, awaiting Adversary (2026-05-26). Evidence: Docker single-node swarm + proxy overlay; real coop-cloud/traefik via abra (wildcard/file-provider, no ACME); custom-html deployed by hand → HTTP 200 over HTTPS via gateway at cchtml1.ci.commoninternet.net with the wildcard cert; torn down clean (services/volumes/secrets/containers all 0). Repro: scripts/deploy-proxy.sh + abra app new/deploy/undeploy. Starting M2 as independent work; will not flip M2's gate until M1 shows PASS. → M1 PASS @2026-05-26T22:20Z.
  • Gate: M2 — CLAIMED, awaiting Adversary (2026-05-26). Evidence: Drone server (coop-cloud recipe, reconcile oneshot, Gitea SSO) healthz 200 via gateway; exec runner polling (capacity=2). cc-ci repo activated (push webhook). Pushing .drone.yml triggered build #1 → success (clone + hello exec steps, exit 0; ran abra/docker on the host). Repro: nixos-rebuild switch + one-time scripts/bootstrap-drone-oauth.sh. Starting M3 as independent work; won't flip M3 gate until M2 PASS.
  • Gate: M3 — CLAIMED, awaiting Adversary (2026-05-27). Trigger redesigned per orchestrator (plan §4.1): polling is PRIMARY (outbound, read-only, ≤30s), webhook optional/admin-registered; commenter auth via org membership (GET /orgs/{owner}/members/{user} 204, read-level) + optional allowlist — NOT the admin-requiring /collaborators/{user}/permission. Evidence: posted !testme on PR #1 (by bot, an org member) → poller fired in 6s → Drone build #26 for head d397720a → bridge posted the run-link comment back. Auth endpoint verified read-level: bot/trav/ notplants → 204, non-member → 404. The old webhook-delivery blocker is moot (polling doesn't need the Gitea ALLOWED_HOST_LIST whitelist). Won't advance past this gate until REVIEW shows PASS; doing the bridge→Drone integration as independent work meanwhile.

Resource safety (plan §4.2/§4.3 — orchestrator change 2026-05-27)

  • MAX_TESTS = DRONE_RUNNER_CAPACITY = 1 (modules/drone-runner.nix): ≤1 build at once, Drone auto-queues the rest natively. Verified DRONE_RUNNER_CAPACITY=1 on the runner.
  • Per-build timeout = 60m (modules/drone.nix, reconciled best-effort, non-fatal): a hung build is cancelled → frees its slot. Verified Drone repo timeout: 60.
  • Janitor backstop for SIGKILL'd builds (reaps orphaned run apps at run-start). At capacity=1 the recipe-CI pipeline will set CCCI_JANITOR_MAX_AGE=0 (safe — no concurrent runs). See DECISIONS.

Blocked

  • (none) — all blockers resolved. The lasuite-docs upgrade gap (Docker Hub rate limit, then abra's false "deploy failed" on a converging rolling upgrade) is RESOLVED: quota reset + abra app upgrade -c fix → lasuite #108 all 3 stages green via !testme. Registry pull creds (A1) remain a RECOMMENDED durable hardening for heavy-recipe reproducibility under load (DECISIONS), not a current blocker.

Tracking (adversary findings I must address)

  • [adversary] A4 — concurrent same-recipe runs collide on shared ~/.abra/recipes/<recipe>. Root cause the finding names ("no Drone concurrency cap — runner capacity=2") is now eliminated: MAX_TESTS = DRONE_RUNNER_CAPACITY = 1 (resource-safety change). With ≤1 build at a time there is no concurrent run on this single node, so the shared-recipe-dir race cannot occur. Builder side addressed via the concurrency cap (per plan §4.2 "concurrency cap 12"); Adversary to re-test/close. (Per-run ABRA_DIR/HOME isolation would be belt-and-suspenders but is unnecessary at capacity=1.)
  • [adversary] A2 — janitor -pr filter dead. Already fixed in code: lifecycle.RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$ (the hashed scheme), plus a stack-name regex for .env-less orphans, gated on age. Awaiting Adversary kill-probe re-test.
  • [adversary] A3 — teardown unverified; .env removed before confirmed undeploy. Already fixed: lifecycle.teardown_app undeploys → docker stack rm fallback if services remain → removes volumes/secrets while .env exists → drops .env LAST → then _residual() check raises TeardownError if anything is left. Awaiting Adversary kill-mid-run re-test.
  • [adversary] A1 — no-ACME hazard for test apps. Acknowledged (valid). The harness (M4) MUST force LETS_ENCRYPT_ENV="" on every test-app deploy (already done in scripts/deploy-proxy.sh and the M1 manual custom-html deploy; scripts/deploy-drone.sh will too). Considering a structural belt-and-suspenders (drop the unused certificatesResolvers from cc-ci's traefik) — deferred, needs a recipe-config override. Will make the harness enforcement the primary fix; Adversary re-tests + closes after M4. → Now enforced: harness.lifecycle.deploy_app sets LETS_ENCRYPT_ENV="" on every test-app deploy (verified in the M4 custom-html run). Adversary can re-test + close A1.

Notes

  • Disk RESOLVED: operator grew the VM 8.9→28 GiB (22 GiB free) on 2026-05-26. Inodes 1.78M total / 1.21M free (was ~6k free — old 8.9 GiB fs had only 586k inodes, which the flake's nixpkgs fetch exhausted). Both byte + inode pressure gone.
  • M0 base config: flake at repo root pins nixpkgs to the exact rev cc-ci ran (50ab793) → first rebuild is no-op-then-base. Deployed via nixos-rebuild switch --flake /root/cc-ci#cc-ci run as a detached transient systemd unit (survives ssh-over-tailscale drops). Gen 3 current, healthy.
  • Open warning: incus module enables systemd.network while we set networking.useDHCP=true (scripted dhcpcd) — Nix warns both may manage interfaces. Inherited from baseline, networking is up; clean up later (pick networkd OR scripting). Tracked, non-blocking.