git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and docs/install.md -> machine-docs/... Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
232 lines
18 KiB
Markdown
232 lines
18 KiB
Markdown
# BACKLOG — cc-ci
|
||
|
||
Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adversary edits only
|
||
`## Adversary findings`. Closing an item = checking the box in your own section.
|
||
|
||
## Build backlog
|
||
|
||
### M0 — Foundations
|
||
- [x] Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline
|
||
- [x] Deploy mechanism decision + first rebuild from repo (DECISIONS.md) — switch --flake on host
|
||
- [x] sops-nix wiring: host age key (from ssh host key) + master recovery key; secrets/secrets.yaml;
|
||
decrypt a test secret on host → /run/secrets/test_secret (0400 root) verified
|
||
- [x] Gate: M0 — `ssh cc-ci 'systemctl is-system-running'` healthy after rebuild from repo
|
||
→ CLAIMED 2026-05-26, awaiting Adversary (see STATUS.md)
|
||
|
||
### M1 — Swarm + abra target
|
||
- [x] Docker + single-node swarm via Nix (modules/swarm.nix: docker + swarm-init oneshot + `proxy`
|
||
overlay net + daily autoprune). Verified: Swarm=active, proxy overlay present.
|
||
- [x] Proxy = real coop-cloud/traefik via abra (orchestrator decision, replaces custom traefik.nix):
|
||
wildcard/file-provider mode, pre-issued cert as ssl_cert/ssl_key swarm secrets, LETS_ENCRYPT_ENV
|
||
empty → no ACME. `scripts/deploy-proxy.sh` (idempotent). Verified E2E via gateway: wildcard cert
|
||
served, 0 ACME log lines.
|
||
- [x] abra installed (modules/abra.nix, pinned 0.13.0-beta); deployed custom-html by hand over HTTPS
|
||
(HTTP 200 nginx page via gateway) and tore it down clean (services/volumes/secrets/containers=0).
|
||
- [x] Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean →
|
||
CLAIMED 2026-05-26, awaiting Adversary.
|
||
|
||
### M2 — Drone online
|
||
- [x] Drone server (coop-cloud recipe, reconcile oneshot) + exec runner via Nix; Gitea OAuth app.
|
||
Server healthz 200 via gateway; runner polling (capacity=2, type=exec).
|
||
- [x] hello-world .drone.yml runs green; logs visible (Drone UI + API). Build #1 success: clone +
|
||
hello (echo/whoami=root/abra 0.13.0-beta/swarm=active), both exit 0.
|
||
- [x] Gate: M2 — push to cc-ci triggers visible green build → CLAIMED 2026-05-26, awaiting Adversary.
|
||
OAuth link via one-time `scripts/bootstrap-drone-oauth.sh` (documented in install.md §2).
|
||
|
||
### M3 — Comment bridge
|
||
- [x] comment-bridge service: polling PRIMARY (read-only, ≤30s) + optional admin webhook; !testme
|
||
exact match; org-membership auth (`GET /orgs/{owner}/members/{user}` 204) + allowlist; Drone API
|
||
- [x] PR comment posting with run link
|
||
- [x] Gate: M3 — live demo on scratch PR; auth enforced → CLAIMED 2026-05-27. Posted `!testme` on
|
||
PR #1 → poll fired in 6s → Drone build #26 for head d397720a → bridge commented run link back.
|
||
Org-membership auth verified (bot/trav/notplants 204, non-member 404 at read level).
|
||
|
||
### Bridge→Drone→harness integration (connects M3 trigger to M4/M5 recipe CI; blocks D2/D10 via !testme)
|
||
- [x] Add a recipe-CI pipeline to `.drone.yml` keyed on `event=custom`: runs
|
||
`cc-ci-run runner/run_recipe_ci.py` STAGES=install,upgrade,backup, `CCCI_JANITOR_MAX_AGE=0`,
|
||
`concurrency:{limit:1}`, `HOME=/root`. Self-test pipeline now `event=push`. (commits 9d51cb6+)
|
||
- [x] Verify a recipe build runs the full 3-stage CI through Drone (not self-test): **build #33 →
|
||
success**, install/upgrade/backup all green, clean teardown (0 orphans). HOME + backup `-C -o`
|
||
+ clean-reclone fixes applied.
|
||
- [ ] Full single-comment E2E: enroll a recipe in the bridge `POLL_REPOS` + open a recipe PR →
|
||
`!testme` → full 3-stage CI + PR comment outcome (folds into M6.5/M10 breadth).
|
||
|
||
### M4 — Harness + install stage
|
||
- [x] run_recipe_ci.py + conftest + harness (abra wrappers, lifecycle) + Nix python/playwright env
|
||
(cc-ci-run); install stage for recipe #1 (custom-html) + Playwright assertion; guaranteed teardown
|
||
- [x] Gate: M4 — green install run, no orphaned app/volume → CLAIMED 2026-05-27, awaiting Adversary.
|
||
Repro: `cd /root/cc-ci && RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py`
|
||
→ 2 passed (http 200 + playwright); teardown leaves services/volumes/secrets/containers/env = 0.
|
||
|
||
### M5 — Upgrade + backup/restore stages
|
||
- [x] Add upgrade + backup/restore stages for recipe #1 (custom-html). backup-bot-two deployed as a
|
||
reconcile oneshot (modules/backupbot.nix). Data marker served via nginx for assertions.
|
||
- [x] Gate: M5 — upgrade preserves data; backup→mutate→restore returns original → CLAIMED 2026-05-27.
|
||
Full 3-stage run green: install(2)+upgrade(1)+backup(1) passed; teardown leaves 0 orphans, infra intact.
|
||
|
||
### M6 — Recipe-local tests + second recipe
|
||
- [x] D4 recipe-local discovery: recipe-shipped tests/ snapshotted post-fetch + run against the live
|
||
app as a `recipe-local` stage (contract CCCI_BASE_URL/CCCI_APP_DOMAIN). Demo'd via mirror branch
|
||
recipe-maintainers/custom-html@ci/d4-recipe-local → recipe-local test PASSED against live app.
|
||
- [x] Enroll DB-backed recipe #2 (keycloak + mariadb) via per-recipe tests/keycloak/ only (no harness
|
||
surgery): install green (realm health + Playwright admin login). docs/enroll-recipe.md written.
|
||
- [x] Gate: M6 — both recipes green (custom-html 3-stage; keycloak install) + recipe-local merged →
|
||
CLAIMED 2026-05-27. keycloak full 3-stage (DB data survival) folds into the M6.5 breadth ramp.
|
||
|
||
### M6.5 — Breadth ramp (recipes 3→6)
|
||
- [x] keycloak (SSO/DB-backed, recipe #2) full 3-stage green through the Drone recipe-ci pipeline:
|
||
build #39 success (~31m): install 2✓ (realm health + Playwright admin login), upgrade 1✓
|
||
(`test_upgrade_preserves_realm` — DB data survives), backup 1✓ (`test_backup_mutate_restore`).
|
||
Clean teardown (0 keyc services/volumes). Proves DB-backed data survival + integration path.
|
||
- [x] cryptpad (stateful/no-DB, recipe #3) full 3-stage green on host (cc-ci-run): install 2✓
|
||
(http + Playwright), upgrade 1✓ (marker in cryptpad_data survives), backup 1✓
|
||
(`test_backup_mutate_restore`). No harness surgery — added generic per-recipe EXTRA_ENV
|
||
(handles cryptpad's SANDBOX_DOMAIN). Fixed a real backup bug en route: set_env glued
|
||
RESTIC_REPOSITORY onto a comment → backupbot had no restic repo (now newline-safe). Drone
|
||
canonical run = **build #46 success** (~6m, all 3 stages green, clean teardown).
|
||
- [x] matrix-synapse (DB+media/large-volume, recipe #4) full 3-stage green on host: install 2✓
|
||
(client API + versions JSON), upgrade 1✓ (postgres marker survives), backup 1✓ — exercises the
|
||
recipe's pg_backup.sh DB-dump hook (not a plain volume copy). No harness surgery. Drone
|
||
canonical run = **build #51 success** (~10.5m, all 3 stages green, clean teardown).
|
||
- [x] lasuite-docs (multi-service + S3/MinIO, recipe #5) full 3-stage green on host: install 2✓
|
||
(9-service stack converges + SPA + Playwright), upgrade 1✓ (postgres marker survives), backup
|
||
1✓ (pg_backup.sh hook). Fixed deploy timeout (cold-pull of ~9 images > abra 300s) via
|
||
TIMEOUT=900 EXTRA_ENV; OIDC config-only so starts healthy w/ placeholder. Drone canonical run
|
||
= **build #57 success** (all 3 stages green, clean teardown).
|
||
- [x] n8n (workflow automation, recipe #6 — bluesky-pds swapped out per DECISIONS) full 3-stage
|
||
green on host: install 2✓ (/healthz + Playwright editor), upgrade 1✓ (marker in /home/node/.n8n
|
||
survives), backup 1✓ (backupbot.backup.path file backup). Drone canonical run = **build #63
|
||
success** (~5.5m, all 3 stages green, clean teardown).
|
||
- [ ] Re-verify keycloak backup post set_env fix (build #39 ran off an earlier backupbot deploy)
|
||
- [x] Gate: M6.5 — recipes 3–6 three-stage green → **CLAIMED 2026-05-27**. All 6 D10 recipes have a
|
||
full 3-stage green run (host + canonical Drone): custom-html, keycloak(#39), cryptpad(#46),
|
||
matrix-synapse(#51), lasuite-docs(#57), n8n(#63). All 5 categories covered; D5 no-harness-surgery
|
||
held (per-recipe tests/<recipe>/ + recipe_meta EXTRA_ENV only). Awaiting Adversary.
|
||
|
||
### M7 — Secrets hardening (D6)
|
||
- [x] Full sops model + rotation doc (docs/secrets.md: 3 classes, decryption chain, rotation per
|
||
class) + log redaction filter (run_recipe_ci masks /run/secrets/* values in stage output,
|
||
live-streaming preserved). Adversary leak scans clean (baseline + recipe-CI logs).
|
||
- [x] Gate: M7 — secret-grep finds nothing → **CLAIMED 2026-05-27**. No-plaintext: harness never
|
||
prints secrets, abra doesn't echo generated ones, reconciles redirect secret-gen to /dev/null,
|
||
dashboard shows status only; redaction filter as belt-and-suspenders. Awaiting Adversary
|
||
(re-grep published logs + dashboard; optionally follow a rotation procedure).
|
||
|
||
### M8 — Dashboard (D7)
|
||
- [x] Overview page + badges: dashboard/dashboard.py + modules/dashboard.nix — live at
|
||
ci.commoninternet.net/, lists the 6 recipes w/ pass/fail/running badges + run links, plus
|
||
/badge/<recipe>.svg. Verified via gateway; /hook still routes to bridge. (content-hash image
|
||
tag so the swarm service rolls on code change.)
|
||
- [x] PR-comment outcome reflection: bridge watcher polls the Drone build to completion + edits its
|
||
run comment to ✅ passed / ❌ <status> (Gitea PATCH). Verified: fresh !testme on PR #1 → comment
|
||
edited to "❌ failure → …/76" within ~20s.
|
||
- [x] [idea] gave the bridge image a content-hash tag (fixed latent `:latest` no-roll issue)
|
||
- [x] Gate: M8 — overview matches reality; outcomes mirrored → **CLAIMED 2026-05-27**. Dashboard
|
||
overview lists the 6 recipes w/ correct status badges (live, gateway-verified); PR comments link
|
||
back AND reflect final pass/fail. Awaiting Adversary.
|
||
|
||
### M9 — Reproducibility + docs (D8/D9)
|
||
- [x] D9 docs complete: README + docs/{install,enroll-recipe,secrets,architecture,runbook,baseline}.
|
||
Covers architecture, enroll a recipe, add/run tests locally, operate/rotate secrets, debug a
|
||
failed run. install.md = from-scratch path (clone + nixos-rebuild + operator preconditions).
|
||
- [ ] Gate: M9 — Adversary rebuilds from docs on throwaway host (D8) — Adversary action; install.md
|
||
ready. (Note: a from-scratch rebuild pulls images → needs the registry creds / quota too.)
|
||
|
||
### M10 — Proof (D10)
|
||
- [x] **All 6 recipes green via REAL !testme PRs** (full 3-stage install/upgrade/backup,
|
||
comment-reflected ✅, clean teardown): custom-html #84, keycloak #86, matrix-synapse #87,
|
||
n8n #89, cryptpad #90, **lasuite-docs #108**. All 5 D10 categories covered.
|
||
- [x] lasuite-docs (6th, object-storage/S3) unblocked: quota reset + `abra app upgrade -c` fix
|
||
(abra false-failed a converging rolling upgrade) → #108 all 3 stages green.
|
||
- [x] Gate: M10 — six recipes green via !testme → **CLAIMED 2026-05-27**, awaiting Adversary D10
|
||
verification.
|
||
- [ ] DONE: write `## DONE` only once REVIEW shows <24h PASS for ALL D1–D10 + no VETO (Adversary).
|
||
|
||
## Adversary findings
|
||
<!-- Adversary-only section. Builder must not edit below this line. -->
|
||
|
||
- [x] **[adversary] A1 — Test-app deploys can silently trigger ACME (no-ACME design hazard).**
|
||
**CLOSED @2026-05-27T00:35Z** by Adversary re-test. `runner/harness/lifecycle.deploy_app`
|
||
calls `abra.env_set(domain, "LETS_ENCRYPT_ENV", "")` before every deploy. Verified on a live
|
||
harness app (`cust-c95a69`): env `LETS_ENCRYPT_ENV=` empty, no `certresolver` label, **0 ACME
|
||
log lines**, and the served cert is the **wildcard** `CN=*.ci.commoninternet.net` (verify ok)
|
||
— not a per-host ACME cert. No-ACME holds for harness deploys. (Structural belt-and-suspenders
|
||
— dropping the unused `certificatesResolvers` from traefik — remains a nice-to-have, tracked
|
||
under A3/M7, not required to close A1.)
|
||
|
||
- [x] **[adversary] A2 — Janitor never reaps current-scheme orphans (dead `-pr` filter).**
|
||
**CLOSED @2026-05-27T10:45Z** by Adversary live re-test of the fix. Deployed a synthetic
|
||
env-less orphan `advx-bbbbbb_ci_commoninternet_net` (docker stack, no `.env` — the case the old
|
||
`-pr` filter AND abra-ls both miss). (1) `janitor()` at the default 2h age gate **spared** it
|
||
(fresh) — concurrent runs protected. (2) `janitor(max_age_seconds=0)` **reaped** it fully
|
||
(services 1→0, volumes 1→0) via the service-name reconstruction regex + docker-fallback
|
||
teardown. Janitor now matches the real `<tag>-<6hex>` scheme and reaps even `.env`-gone orphans.
|
||
Original finding below.
|
||
Found during M4 review. `harness.lifecycle.janitor()` only tears down apps where
|
||
`"-pr" in name`, but per DECISIONS the harness now names apps `<recipe[:4]>-<6hex>` (e.g.
|
||
`cust-c95a69`) — **no `-pr` substring**. So the run-start crash-recovery sweep (§4.3: "nuke
|
||
any orphaned `*-pr*` apps") matches **nothing** and is effectively a no-op. The happy-path
|
||
finalizer in `conftest.deployed_app` does work (observed: `cust-e084bd` from a prior run was
|
||
torn down), but a run that crashes/reboots *before* the finalizer runs leaves an orphan that
|
||
no later run will reap. *Fix:* match the actual naming (e.g. regex `^[a-z]{1,4}-[0-9a-f]{6}\.`
|
||
or a dedicated CI label/prefix) and gate on age. *Re-test:* deploy a harness app, simulate a
|
||
crash (kill the run before teardown), then start a new run and confirm janitor reaps the
|
||
orphan. Adversary closes after re-test.
|
||
**Re-test progress @2026-05-27T05:00Z (fix b7a2d70):** the reaping *mechanism* is verified —
|
||
janitor now matches the real naming via `RUN_APP_RE` (`^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci…`,
|
||
matches `cust-c95a69`) AND reconstructs `.env`-gone orphans from orphaned *service* names
|
||
(regex matches my synthetic `advx-aaaaaa_ci_commoninternet_net_app`), with an age gate to spare
|
||
concurrent runs, then reaps via `teardown_app` (verified clean under A3). **Still pending:** one
|
||
live `janitor()` end-to-end sweep — needs `CCCI_JANITOR_MAX_AGE=0`, which would also reap the
|
||
Builder's live apps, so it must run on an **idle host**. Will close then.
|
||
|
||
- [x] **[adversary] A3 — Teardown is unverified/best-effort; a failure silently orphans + run stays green.**
|
||
**CLOSED @2026-05-27T05:00Z** by Adversary re-test of the Builder's fix (commit b7a2d70).
|
||
`teardown_app` now: `undeploy` → if the service persists, `docker stack rm` **fallback** (needs
|
||
no `.env`) → remove volumes/secrets *by stack name* (retry loop) → drop `.env` LAST → **verify**
|
||
`_residual()` and raise `TeardownError` if anything remains. Empirical worst-case test: I
|
||
`docker stack deploy`-ed a synthetic orphan `advx-aaaaaa_ci_commoninternet_net` (service +
|
||
volume + network, **no `.env`** — exactly the crash-orphan that defeated the old code), then
|
||
called `lifecycle.teardown_app("advx-aaaaaa.ci.commoninternet.net")` → returned OK (verify
|
||
passed) and afterwards services/volumes/networks = **0**. So a `.env`-less orphan is fully
|
||
reaped and teardown is now verified (would raise on residual). Original finding below.
|
||
Found during M4 review (to confirm empirically with a kill-mid-run probe). `lifecycle.teardown_app`
|
||
runs every abra call with `check=False` and "never raises"; the conftest finalizer never
|
||
asserts teardown succeeded. Worse, `abra.app_config_remove` deletes the app `.env`
|
||
**unconditionally**, even if `abra.undeploy` failed first — leaving the swarm service+volume
|
||
running but with no `.env`, so the app can no longer be managed/undeployed via abra (and a
|
||
fixed janitor that shells `abra app undeploy` couldn't reap it either). Net: a partial teardown
|
||
leaves a silent orphan while pytest still reports the run **green**, so the M4/D2 guarantee
|
||
"no orphaned app/volume afterward" is not actually *verified* by the harness. *Fix:* assert
|
||
post-teardown that the stack/services/volumes/secrets are gone (fail the run otherwise); only
|
||
remove the `.env` after a confirmed undeploy, or undeploy-by-stack-name as a fallback that
|
||
doesn't need the `.env`. *Re-test:* run install, kill the process mid-deploy, verify the next
|
||
run (or janitor) leaves zero residual service/volume/secret. Adversary closes after re-test.
|
||
|
||
- [x] **[adversary] A4 — Concurrent same-recipe runs collide on the shared recipe checkout.**
|
||
**CLOSED @2026-05-27T03:13Z — mitigated by the runtime concurrency cap.** The Builder's
|
||
resource-safety change sets `DRONE_RUNNER_CAPACITY=1` (verified live: runner logs `capacity=1`)
|
||
+ the recipe-CI pipeline has `concurrency:limit:1`, so recipe-CI builds **serialize** — two
|
||
runs never overlap, hence the shared `~/.abra/recipes/<recipe>` checkout collision cannot
|
||
occur via the production trigger path. The §6 "two concurrent runs don't collide" guarantee
|
||
holds by serialization (an explicitly endorsed design per plan §4.2). **Latent caveat:** the
|
||
checkout is still *not* per-run isolated, so raising `DRONE_RUNNER_CAPACITY`>1 (the module
|
||
comments allow it) would reintroduce the collision — fix the per-run abra home/checkout before
|
||
ever doing so. (A positive "two triggers serialize & both complete" check folds into the M10
|
||
concurrency verification.)
|
||
Found by review (M6 verify); to confirm empirically. Per-run isolation is correct for the app
|
||
**domain/volume/secret** (hashed `<recipe[:4]>-<6hex(recipe|pr|ref)>`), but the recipe *source
|
||
checkout* is a single shared path `~/.abra/recipes/<recipe>`: `run_recipe_ci.fetch_recipe`
|
||
does `rm -rf ~/.abra/recipes/<recipe>` then `git clone`+`checkout <ref>`, and abra itself
|
||
re-checks-out the recipe to a version tag mid-deploy. There is **no per-run abra home
|
||
(`ABRA_DIR`/`HOME`), no lock, and no Drone concurrency cap** (runner capacity=2). So two
|
||
concurrent runs of the **same recipe at different refs** (e.g. `!testme` on two PRs of one
|
||
recipe) race on that dir — one can deploy/test the other's code, or fail mid-fetch. (Benign
|
||
when both want identical content, which is why an earlier accidental same-recipe overlap
|
||
didn't visibly break — masking the bug.) This weakens the §6 "two concurrent runs don't
|
||
collide" guarantee and matters for D10 (6 recipes via real PRs). *Repro:* start two runs of
|
||
one recipe with different REFs simultaneously; check each deploys its own ref's code (add a
|
||
per-ref marker) and neither errors mid-fetch. *Fix:* per-run abra home/recipe dir (e.g.
|
||
`ABRA_DIR=$(mktemp -d)` or `~/.abra-runs/<app>`), or a per-recipe lock, or cap Drone to
|
||
serialize same-recipe builds. Adversary confirms + closes after re-test.
|