Files
cc-ci/machine-docs/STATUS.md
autonomic-bot 992d87cfcd refactor(1b): RL6 — move Builder protocol files into machine-docs/ (README stays root)
git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator
decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and
docs/install.md -> machine-docs/...

Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by
every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so
it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 22:35:30 +01:00

127 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — cc-ci Builder
## DONE — 2026-05-27
The cc-ci Co-op Cloud recipe CI server is **complete**. Every Definition-of-Done item (§2, D1D10)
is independently **Adversary-verified with a PASS dated <24h**, no standing `## VETO`, and the
Adversary explicitly cleared the §6.1 DONE handshake ("Builder may flip STATUS → DONE", REVIEW.md).
| D | Item | Verdict | Evidence (Adversary REVIEW.md) |
|---|---|---|---|
| D1 | `!testme` trigger | PASS | M3 @03:13Z + D10 real-`!testme` runs |
| D2 | install/upgrade/backup matrix (real e2e) | PASS | M4/M5/M6 + D10 6/6 (3 stages each) |
| D3 | Python + Playwright | PASS | live in every recipe install/D10 run |
| D4 | recipe-local tests | PASS | M6 @04:43Z |
| D5 | per-recipe tree, no harness surgery | PASS | M6.5 @07:25Z |
| D6 | secrets (no leaks, rotatable) | PASS | M7 @07:55Z (grep clean: logs+dashboard+git) |
| D7 | results UX (dashboard + PR outcome) | PASS | M8 @08:10Z |
| D8 | reproducible server | PASS | byte-identical `nixos-rebuild build`==running + documented-alt @10:52Z |
| D9 | documentation | PASS | @10:55Z (full docs set) |
| D10 | six recipes via real `!testme` | PASS (6/6) @11:57Z | custom-html #84, keycloak #86, matrix-synapse #87, n8n #89, cryptpad #90, lasuite-docs #108 |
D10 set spans all required categories: simple (custom-html), SSO/identity+DB (keycloak),
DB+media/large-volume (matrix-synapse), workflow (n8n), stateful/no-DB (cryptpad), multi-service +
S3/object-storage (lasuite-docs). bluesky-pds (TLS-passthrough) was swapped → n8n with a documented
reason (DECISIONS). Registry creds (A1) remain a documented good-to-have for rate-limit robustness,
not a DONE blocker. **Loop stopped.**
---
**Phase:** ALL MILESTONES BUILDER-COMPLETE. Adversary-verified: M0M6 PASS, M6.5 PASS, M7/D6 PASS,
**M8/D7 PASS, D8-core PASS, D9 PASS**. **Only D10 left to verify** — M10/D10 CLAIMED: all 6 recipes
green via real `!testme` (custom-html #84, keycloak #86, matrix-synapse #87, n8n #89, cryptpad #90,
lasuite-docs #108; all 5 categories). **D10 PASS (6/6) @11:57Z** logged by Adversary. Docker Hub
rate-limit blocker RESOLVED.
**DONE blocked on ONE item: D8 live blank-VM rebuild.** Adversary's D8 verdict (@10:52Z) = "core PASS
(Nix byte-identical closure + docs); live blank-VM rebuild pending — to complete before DONE." It was
DEFERRED on the premise that the rebuild needs operator registry creds (rate limit). **That premise
is now obsolete:** D10 passed 6/6 WITHOUT creds — the rate limit was transient and the real fix was
`abra app upgrade -c`. So the throwaway-VM live rebuild is feasible NOW in a fresh quota window
(no creds dependency). Surfacing for the Adversary to complete D8 → then all D1D10 <24h PASS DONE.
I will NOT write `## DONE` until REVIEW shows a full D8 PASS. No Builder implementation remains.
## Gate: M6.5 — CLAIMED, awaiting Adversary (2026-05-27)
All 6 D10 recipes have a full install/upgrade/backup green run, each verified on host AND via the
canonical Drone recipe-ci pipeline (build #s above), each with clean teardown (0 orphans). Categories:
custom-html=simple, keycloak=SSO/identity+DB, cryptpad=stateful/no-DB, matrix-synapse=DB+media/
large-volume, lasuite-docs=multi-service+S3/MinIO/object-storage, n8n=workflow automation. D5 held:
each recipe enrolled via `tests/<recipe>/` + `recipe_meta.py` (EXTRA_ENV for cryptpad SANDBOX_DOMAIN
/ lasuite TIMEOUT) only no shared `runner/harness` changes per recipe. Repro: trigger a custom
Drone build with RECIPE=<r> (or `cc-ci-run runner/run_recipe_ci.py` with RECIPE/STAGES on host).
## Gates
- **Gate: M0 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: flake rebuilds cc-ci from repo
(`switch --flake /root/cc-ci#cc-ci`, gen healthy, no failed units); sops-nix decrypts
`/run/secrets/test_secret` (0400 root, value = generated `cc-ci-m0-…`). Repro: clone repo, sync to
host, `nixos-rebuild switch --flake .#cc-ci`, then `systemctl is-system-running` + check the secret.
Per §6.1 I will NOT advance past this gate to M2; M1 work proceeds as independent unblocked work.
**M0 PASS** logged by Adversary in REVIEW.md @2026-05-26T21:35Z (cold verify, leak probe clean).
- **Gate: M1 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Docker single-node swarm +
`proxy` overlay; real coop-cloud/traefik via abra (wildcard/file-provider, no ACME); custom-html
deployed by hand → HTTP 200 over HTTPS via gateway at cchtml1.ci.commoninternet.net with the
wildcard cert; torn down clean (services/volumes/secrets/containers all 0). Repro:
`scripts/deploy-proxy.sh` + `abra app new/deploy/undeploy`. Starting M2 as independent work; will
not flip M2's gate until M1 shows PASS. → **M1 PASS** @2026-05-26T22:20Z.
- **Gate: M2 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Drone server (coop-cloud recipe,
reconcile oneshot, Gitea SSO) healthz 200 via gateway; exec runner polling (capacity=2). cc-ci repo
activated (push webhook). Pushing `.drone.yml` triggered build #1**success** (clone + hello exec
steps, exit 0; ran abra/docker on the host). Repro: `nixos-rebuild switch` + one-time
`scripts/bootstrap-drone-oauth.sh`. Starting M3 as independent work; won't flip M3 gate until M2 PASS.
- **Gate: M3 — CLAIMED, awaiting Adversary** (2026-05-27). Trigger redesigned per orchestrator
(plan §4.1): **polling is PRIMARY** (outbound, read-only, ≤30s), webhook optional/admin-registered;
commenter auth via org membership (`GET /orgs/{owner}/members/{user}` 204, read-level) + optional
allowlist — NOT the admin-requiring `/collaborators/{user}/permission`. Evidence: posted `!testme`
on PR #1 (by bot, an org member) → poller fired in **6s** → Drone build **#26** for head
`d397720a` → bridge posted the run-link comment back. Auth endpoint verified read-level: bot/trav/
notplants → 204, non-member → 404. The old webhook-delivery blocker is **moot** (polling doesn't
need the Gitea `ALLOWED_HOST_LIST` whitelist). Won't advance past this gate until REVIEW shows PASS;
doing the bridge→Drone integration as independent work meanwhile.
## Resource safety (plan §4.2/§4.3 — orchestrator change 2026-05-27)
- **MAX_TESTS = DRONE_RUNNER_CAPACITY = 1** (`modules/drone-runner.nix`): ≤1 build at once, Drone
auto-queues the rest natively. Verified `DRONE_RUNNER_CAPACITY=1` on the runner.
- **Per-build timeout = 60m** (`modules/drone.nix`, reconciled best-effort, non-fatal): a hung build
is cancelled → frees its slot. Verified Drone repo `timeout: 60`.
- **Janitor backstop** for SIGKILL'd builds (reaps orphaned run apps at run-start). At capacity=1
the recipe-CI pipeline will set `CCCI_JANITOR_MAX_AGE=0` (safe — no concurrent runs). See DECISIONS.
## Blocked
- (none) — all blockers resolved. The lasuite-docs upgrade gap (Docker Hub rate limit, then abra's
false "deploy failed" on a converging rolling upgrade) is RESOLVED: quota reset + `abra app upgrade
-c` fix → lasuite #108 all 3 stages green via `!testme`. Registry pull creds (A1) remain a
RECOMMENDED durable hardening for heavy-recipe reproducibility under load (DECISIONS), not a
current blocker.
## Tracking (adversary findings I must address)
- **[adversary] A4 — concurrent same-recipe runs collide on shared `~/.abra/recipes/<recipe>`.**
Root cause the finding names ("no Drone concurrency cap — runner capacity=2") is now **eliminated**:
MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1 (resource-safety change). With ≤1 build at a time there is
**no concurrent run** on this single node, so the shared-recipe-dir race cannot occur. Builder side
addressed via the concurrency cap (per plan §4.2 "concurrency cap 12"); Adversary to re-test/close.
(Per-run `ABRA_DIR`/HOME isolation would be belt-and-suspenders but is unnecessary at capacity=1.)
- **[adversary] A2 — janitor `-pr` filter dead.** Already fixed in code: `lifecycle.RUN_APP_RE` =
`^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$` (the hashed scheme), plus a stack-name regex
for `.env`-less orphans, gated on age. Awaiting Adversary kill-probe re-test.
- **[adversary] A3 — teardown unverified; `.env` removed before confirmed undeploy.** Already fixed:
`lifecycle.teardown_app` undeploys → `docker stack rm` fallback if services remain → removes
volumes/secrets while `.env` exists → drops `.env` LAST → then `_residual()` check raises
`TeardownError` if anything is left. Awaiting Adversary kill-mid-run re-test.
- **[adversary] A1 — no-ACME hazard for test apps.** Acknowledged (valid). The harness (M4) MUST
force `LETS_ENCRYPT_ENV=""` on every test-app deploy (already done in `scripts/deploy-proxy.sh` and
the M1 manual custom-html deploy; `scripts/deploy-drone.sh` will too). Considering a structural
belt-and-suspenders (drop the unused `certificatesResolvers` from cc-ci's traefik) — deferred,
needs a recipe-config override. Will make the harness enforcement the primary fix; Adversary
re-tests + closes after M4. → **Now enforced**: `harness.lifecycle.deploy_app` sets
`LETS_ENCRYPT_ENV=""` on every test-app deploy (verified in the M4 custom-html run). Adversary can
re-test + close A1.
## Notes
- **Disk RESOLVED:** operator grew the VM 8.9→**28 GiB** (22 GiB free) on 2026-05-26. Inodes
1.78M total / 1.21M free (was ~6k free — old 8.9 GiB fs had only 586k inodes, which the flake's
nixpkgs fetch exhausted). Both byte + inode pressure gone.
- M0 base config: flake at repo root pins nixpkgs to the exact rev cc-ci ran (50ab793) → first
rebuild is no-op-then-base. Deployed via `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run as
a detached transient systemd unit (survives ssh-over-tailscale drops). Gen 3 current, healthy.
- Open warning: incus module enables `systemd.network` while we set `networking.useDHCP=true`
(scripted dhcpcd) — Nix warns both may manage interfaces. Inherited from baseline, networking is
up; clean up later (pick networkd OR scripting). Tracked, non-blocking.