git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and docs/install.md -> machine-docs/... Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
127 lines
9.8 KiB
Markdown
127 lines
9.8 KiB
Markdown
# STATUS — cc-ci Builder
|
||
|
||
## DONE — 2026-05-27
|
||
|
||
The cc-ci Co-op Cloud recipe CI server is **complete**. Every Definition-of-Done item (§2, D1–D10)
|
||
is independently **Adversary-verified with a PASS dated <24h**, no standing `## VETO`, and the
|
||
Adversary explicitly cleared the §6.1 DONE handshake ("Builder may flip STATUS → DONE", REVIEW.md).
|
||
|
||
| D | Item | Verdict | Evidence (Adversary REVIEW.md) |
|
||
|---|---|---|---|
|
||
| D1 | `!testme` trigger | PASS | M3 @03:13Z + D10 real-`!testme` runs |
|
||
| D2 | install/upgrade/backup matrix (real e2e) | PASS | M4/M5/M6 + D10 6/6 (3 stages each) |
|
||
| D3 | Python + Playwright | PASS | live in every recipe install/D10 run |
|
||
| D4 | recipe-local tests | PASS | M6 @04:43Z |
|
||
| D5 | per-recipe tree, no harness surgery | PASS | M6.5 @07:25Z |
|
||
| D6 | secrets (no leaks, rotatable) | PASS | M7 @07:55Z (grep clean: logs+dashboard+git) |
|
||
| D7 | results UX (dashboard + PR outcome) | PASS | M8 @08:10Z |
|
||
| D8 | reproducible server | PASS | byte-identical `nixos-rebuild build`==running + documented-alt @10:52Z |
|
||
| D9 | documentation | PASS | @10:55Z (full docs set) |
|
||
| D10 | six recipes via real `!testme` | PASS (6/6) @11:57Z | custom-html #84, keycloak #86, matrix-synapse #87, n8n #89, cryptpad #90, lasuite-docs #108 |
|
||
|
||
D10 set spans all required categories: simple (custom-html), SSO/identity+DB (keycloak),
|
||
DB+media/large-volume (matrix-synapse), workflow (n8n), stateful/no-DB (cryptpad), multi-service +
|
||
S3/object-storage (lasuite-docs). bluesky-pds (TLS-passthrough) was swapped → n8n with a documented
|
||
reason (DECISIONS). Registry creds (A1) remain a documented good-to-have for rate-limit robustness,
|
||
not a DONE blocker. **Loop stopped.**
|
||
|
||
---
|
||
|
||
**Phase:** ALL MILESTONES BUILDER-COMPLETE. Adversary-verified: M0–M6 PASS, M6.5 PASS, M7/D6 PASS,
|
||
**M8/D7 PASS, D8-core PASS, D9 PASS**. **Only D10 left to verify** — M10/D10 CLAIMED: all 6 recipes
|
||
green via real `!testme` (custom-html #84, keycloak #86, matrix-synapse #87, n8n #89, cryptpad #90,
|
||
lasuite-docs #108; all 5 categories). **D10 PASS (6/6) @11:57Z** logged by Adversary. Docker Hub
|
||
rate-limit blocker RESOLVED.
|
||
**DONE blocked on ONE item: D8 live blank-VM rebuild.** Adversary's D8 verdict (@10:52Z) = "core PASS
|
||
(Nix byte-identical closure + docs); live blank-VM rebuild pending — to complete before DONE." It was
|
||
DEFERRED on the premise that the rebuild needs operator registry creds (rate limit). **That premise
|
||
is now obsolete:** D10 passed 6/6 WITHOUT creds — the rate limit was transient and the real fix was
|
||
`abra app upgrade -c`. So the throwaway-VM live rebuild is feasible NOW in a fresh quota window
|
||
(no creds dependency). Surfacing for the Adversary to complete D8 → then all D1–D10 <24h PASS → DONE.
|
||
I will NOT write `## DONE` until REVIEW shows a full D8 PASS. No Builder implementation remains.
|
||
## Gate: M6.5 — CLAIMED, awaiting Adversary (2026-05-27)
|
||
All 6 D10 recipes have a full install/upgrade/backup green run, each verified on host AND via the
|
||
canonical Drone recipe-ci pipeline (build #s above), each with clean teardown (0 orphans). Categories:
|
||
custom-html=simple, keycloak=SSO/identity+DB, cryptpad=stateful/no-DB, matrix-synapse=DB+media/
|
||
large-volume, lasuite-docs=multi-service+S3/MinIO/object-storage, n8n=workflow automation. D5 held:
|
||
each recipe enrolled via `tests/<recipe>/` + `recipe_meta.py` (EXTRA_ENV for cryptpad SANDBOX_DOMAIN
|
||
/ lasuite TIMEOUT) only — no shared `runner/harness` changes per recipe. Repro: trigger a custom
|
||
Drone build with RECIPE=<r> (or `cc-ci-run runner/run_recipe_ci.py` with RECIPE/STAGES on host).
|
||
|
||
## Gates
|
||
- **Gate: M0 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: flake rebuilds cc-ci from repo
|
||
(`switch --flake /root/cc-ci#cc-ci`, gen healthy, no failed units); sops-nix decrypts
|
||
`/run/secrets/test_secret` (0400 root, value = generated `cc-ci-m0-…`). Repro: clone repo, sync to
|
||
host, `nixos-rebuild switch --flake .#cc-ci`, then `systemctl is-system-running` + check the secret.
|
||
Per §6.1 I will NOT advance past this gate to M2; M1 work proceeds as independent unblocked work.
|
||
→ **M0 PASS** logged by Adversary in REVIEW.md @2026-05-26T21:35Z (cold verify, leak probe clean).
|
||
- **Gate: M1 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Docker single-node swarm +
|
||
`proxy` overlay; real coop-cloud/traefik via abra (wildcard/file-provider, no ACME); custom-html
|
||
deployed by hand → HTTP 200 over HTTPS via gateway at cchtml1.ci.commoninternet.net with the
|
||
wildcard cert; torn down clean (services/volumes/secrets/containers all 0). Repro:
|
||
`scripts/deploy-proxy.sh` + `abra app new/deploy/undeploy`. Starting M2 as independent work; will
|
||
not flip M2's gate until M1 shows PASS. → **M1 PASS** @2026-05-26T22:20Z.
|
||
- **Gate: M2 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Drone server (coop-cloud recipe,
|
||
reconcile oneshot, Gitea SSO) healthz 200 via gateway; exec runner polling (capacity=2). cc-ci repo
|
||
activated (push webhook). Pushing `.drone.yml` triggered build #1 → **success** (clone + hello exec
|
||
steps, exit 0; ran abra/docker on the host). Repro: `nixos-rebuild switch` + one-time
|
||
`scripts/bootstrap-drone-oauth.sh`. Starting M3 as independent work; won't flip M3 gate until M2 PASS.
|
||
- **Gate: M3 — CLAIMED, awaiting Adversary** (2026-05-27). Trigger redesigned per orchestrator
|
||
(plan §4.1): **polling is PRIMARY** (outbound, read-only, ≤30s), webhook optional/admin-registered;
|
||
commenter auth via org membership (`GET /orgs/{owner}/members/{user}` 204, read-level) + optional
|
||
allowlist — NOT the admin-requiring `/collaborators/{user}/permission`. Evidence: posted `!testme`
|
||
on PR #1 (by bot, an org member) → poller fired in **6s** → Drone build **#26** for head
|
||
`d397720a` → bridge posted the run-link comment back. Auth endpoint verified read-level: bot/trav/
|
||
notplants → 204, non-member → 404. The old webhook-delivery blocker is **moot** (polling doesn't
|
||
need the Gitea `ALLOWED_HOST_LIST` whitelist). Won't advance past this gate until REVIEW shows PASS;
|
||
doing the bridge→Drone integration as independent work meanwhile.
|
||
|
||
## Resource safety (plan §4.2/§4.3 — orchestrator change 2026-05-27)
|
||
- **MAX_TESTS = DRONE_RUNNER_CAPACITY = 1** (`modules/drone-runner.nix`): ≤1 build at once, Drone
|
||
auto-queues the rest natively. Verified `DRONE_RUNNER_CAPACITY=1` on the runner.
|
||
- **Per-build timeout = 60m** (`modules/drone.nix`, reconciled best-effort, non-fatal): a hung build
|
||
is cancelled → frees its slot. Verified Drone repo `timeout: 60`.
|
||
- **Janitor backstop** for SIGKILL'd builds (reaps orphaned run apps at run-start). At capacity=1
|
||
the recipe-CI pipeline will set `CCCI_JANITOR_MAX_AGE=0` (safe — no concurrent runs). See DECISIONS.
|
||
|
||
## Blocked
|
||
- (none) — all blockers resolved. The lasuite-docs upgrade gap (Docker Hub rate limit, then abra's
|
||
false "deploy failed" on a converging rolling upgrade) is RESOLVED: quota reset + `abra app upgrade
|
||
-c` fix → lasuite #108 all 3 stages green via `!testme`. Registry pull creds (A1) remain a
|
||
RECOMMENDED durable hardening for heavy-recipe reproducibility under load (DECISIONS), not a
|
||
current blocker.
|
||
|
||
## Tracking (adversary findings I must address)
|
||
- **[adversary] A4 — concurrent same-recipe runs collide on shared `~/.abra/recipes/<recipe>`.**
|
||
Root cause the finding names ("no Drone concurrency cap — runner capacity=2") is now **eliminated**:
|
||
MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1 (resource-safety change). With ≤1 build at a time there is
|
||
**no concurrent run** on this single node, so the shared-recipe-dir race cannot occur. Builder side
|
||
addressed via the concurrency cap (per plan §4.2 "concurrency cap 1–2"); Adversary to re-test/close.
|
||
(Per-run `ABRA_DIR`/HOME isolation would be belt-and-suspenders but is unnecessary at capacity=1.)
|
||
- **[adversary] A2 — janitor `-pr` filter dead.** Already fixed in code: `lifecycle.RUN_APP_RE` =
|
||
`^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$` (the hashed scheme), plus a stack-name regex
|
||
for `.env`-less orphans, gated on age. Awaiting Adversary kill-probe re-test.
|
||
- **[adversary] A3 — teardown unverified; `.env` removed before confirmed undeploy.** Already fixed:
|
||
`lifecycle.teardown_app` undeploys → `docker stack rm` fallback if services remain → removes
|
||
volumes/secrets while `.env` exists → drops `.env` LAST → then `_residual()` check raises
|
||
`TeardownError` if anything is left. Awaiting Adversary kill-mid-run re-test.
|
||
- **[adversary] A1 — no-ACME hazard for test apps.** Acknowledged (valid). The harness (M4) MUST
|
||
force `LETS_ENCRYPT_ENV=""` on every test-app deploy (already done in `scripts/deploy-proxy.sh` and
|
||
the M1 manual custom-html deploy; `scripts/deploy-drone.sh` will too). Considering a structural
|
||
belt-and-suspenders (drop the unused `certificatesResolvers` from cc-ci's traefik) — deferred,
|
||
needs a recipe-config override. Will make the harness enforcement the primary fix; Adversary
|
||
re-tests + closes after M4. → **Now enforced**: `harness.lifecycle.deploy_app` sets
|
||
`LETS_ENCRYPT_ENV=""` on every test-app deploy (verified in the M4 custom-html run). Adversary can
|
||
re-test + close A1.
|
||
|
||
## Notes
|
||
- **Disk RESOLVED:** operator grew the VM 8.9→**28 GiB** (22 GiB free) on 2026-05-26. Inodes
|
||
1.78M total / 1.21M free (was ~6k free — old 8.9 GiB fs had only 586k inodes, which the flake's
|
||
nixpkgs fetch exhausted). Both byte + inode pressure gone.
|
||
- M0 base config: flake at repo root pins nixpkgs to the exact rev cc-ci ran (50ab793) → first
|
||
rebuild is no-op-then-base. Deployed via `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run as
|
||
a detached transient systemd unit (survives ssh-over-tailscale drops). Gen 3 current, healthy.
|
||
- Open warning: incus module enables `systemd.network` while we set `networking.useDHCP=true`
|
||
(scripted dhcpcd) — Nix warns both may manage interfaces. Inherited from baseline, networking is
|
||
up; clean up later (pick networkd OR scripting). Tracked, non-blocking.
|