Files

autonomic-bot 992d87cfcd refactor(1b): RL6 — move Builder protocol files into machine-docs/ (README stays root)

git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator
decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and
docs/install.md -> machine-docs/...

Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by
every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so
it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 22:35:30 +01:00

23 KiB

Raw Blame History

DECISIONS — cc-ci Builder

Architecture decisions and dead-ends. One line of rationale each. (§0, §8)

Settled

Wildcard TLS: operator pre-issues wildcard cert at /var/lib/ci-certs/live/; Traefik file provider serves it; no ACME for commoninternet.net. (Plan §4.0/§8 — fixed.)
Repo: git.autonomic.zone/recipe-maintainers/cc-ci, private. Bot is org admin. (Bootstrap.)
Git credentials: helper script in repo-local git config sources /srv/cc-ci/.testenv at call time — no secret values stored in .git/config or commits.
Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3 modules/traefik.nix). Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloud traefik recipe via abra in wildcard / file-provider mode, for end-to-end fidelity (canonical web/web-secure entrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box:
- WILDCARDS_ENABLED=1 + append compose.wildcard.yml; the pre-issued cert is fed as the ssl_cert/ssl_key swarm secrets (v1) via abra app secret insert … -f from /var/lib/ci-certs/live/{fullchain,privkey}.pem. The file provider serves it (tls.certificates).
- LETS_ENCRYPT_ENV= empty on the traefik app and on every test app → the recipe's tls.certresolver=${LETS_ENCRYPT_ENV} label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
- Reproducibility (D8): scripts/deploy-proxy.sh is idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in docs/install.md. The custom modules/traefik.nix was removed; modules/swarm.nix keeps swarm init + proxy net + firewall 80/443.
- Renewal (manual, ~90d): operator re-issues the wildcard at the same paths, then abra app secret rm traefik.ci.commoninternet.net ssl_cert -n + re-insert at a new version (bump SECRET_WILDCARD_CERT_VERSION) and redeploy. (Documented in docs/secrets.md at M7.)
- abra teardown syntax (for harness, §4.3): abra app undeploy <d> -n, abra app volume remove <d> -f -n, abra app secret remove <d> --all -n. None take --chaos.
Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer 2026-05-26). Every piece of swarm infra that abra deploys (traefik modules/proxy.nix, Drone modules/drone.nix, later comment-bridge + dashboard) is a systemd.services.<x> with Type=oneshot + RemainAfterExit, after/requires swarm-init + docker, wants network-online, wantedBy multi-user, embedding its script via pkgs.writeShellApplication (self-contained in the store, not a /root/cc-ci path). The script reconciles (inspect → converge → no-op if correct) on every activation/boot — no run-once sentinel — so it self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit) on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to git clone + nixos-rebuild switch + operator preconditions, no manual post-steps. The old scripts/deploy-*.sh were folded into these modules and removed. pkgs.abra is provided via an overlay (modules/packages.nix) so all modules share the one pinned build.
- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the wildcard means bumping SECRET_WILDCARD_*_VERSION (operator) so the next reconcile re-inserts. Documented in docs/secrets.md at M7.
Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27, supersedes the earlier "keep webhook, do NOT pivot to polling" steer). Hard constraint: the bot/server runs at READ level, never repo-admin, and never self-registers a webhook.
- Polling is PRIMARY and the source of truth for D1. The bridge polls each enrolled repo's open PRs for new !testme comments every POLL_INTERVAL (30s ≤ 60s). Outbound (cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
- Webhook is an OPTIONAL push optimization. The /hook endpoint stays live (HMAC-verified) so an admin-registered issue_comment webhook lowers latency, but the bridge never registers one. Manual registration is documented in docs/enroll-recipe.md. Both paths share an in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
- Commenter authorization via org membership (read-level, no admin). Allowed iff GET /orgs/{owner}/members/{user} → 204 (verified 2026-05-27: admits bot/trav/notplants, 404 for a non-member, works with bot read-level basic-auth) or the user is in the optional AUTH_ALLOWLIST. Replaces the earlier /collaborators/{user}/permission check, which needs repo-admin. Fail-closed on any error.
- Enrollment = add the repo to the bridge POLL_REPOS csv + ensure tests/<recipe>/ exists. No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't matter: polling makes it irrelevant; the operator was whitelisting ci.commoninternet.net in Gitea's ALLOWED_HOST_LIST, but D1 no longer depends on that.)
Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27, plan §4.2/§4.3). Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
- MAX_TESTS = DRONE_RUNNER_CAPACITY = 1 (modules/drone-runner.nix, maxTests let-binding). Drone runs at most MAX_TESTS builds at once and auto-queues the rest in its native pending queue — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
- Per-build TIMEOUT = 60 min (modules/drone.nix, buildTimeoutMinutes; reconciled best-effort via PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60} using the bridge's Drone admin token, local --resolve, non-fatal). A build over the limit is cancelled by Drone → the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue once a test finishes OR times out".
- Teardown + janitor backstop. Each build deploys → runs the 3 stages → undeploys (guaranteed try/finally in conftest/orchestrator). A SIGKILL'd/timed-out build can't run its own teardown, so the run-start janitor (lifecycle.janitor, called before every deploy in both fixtures + run_recipe_ci) reaps orphaned run apps as the backstop. At capacity=1 the CI path will set CCCI_JANITOR_MAX_AGE=0 (reap any orphan immediately — safe with no concurrent runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default 2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
- Optional concurrency: {limit: 1} in the recipe-CI .drone.yml is a redundant belt — primary mechanism is DRONE_RUNNER_CAPACITY. (Wired when the recipe-CI pipeline lands — see backlog.)
D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0 sanctions this swap-with-reason). bluesky-pds routes via a Traefik TCP router with tls.passthrough=true to an in-container caddy that terminates TLS itself and obtains its own cert via ACME. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through to cc-ci's Traefik, which terminates it with the pre-issued static wildcard cert, and ACME is hard-forbidden for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness gap. Per the plan's explicit allowance, bluesky-pds is a documented non-CI'd recipe (reason here), and n8n takes the 6th slot. The 5 required D10 categories are already covered by recipes 1–5 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume= matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app (workflow automation) behind the normal terminate-at-Traefik path.
Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27). D10 real-!testme breadth runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage: toomanyrequests). Two lessons: (1) registry pull creds are an A1 operator input needed for reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2) Don't docker image prune -af mid-breadth — it evicts cached recipe images and forces re-pulls that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune only dangling (not -a) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).

Open (defaults from §8, to confirm as reality lands)

Deploy mechanism — SETTLED (M0): nixos-rebuild switch --flake /root/cc-ci#cc-ci run on cc-ci itself, with the repo materialised on the host at /root/cc-ci. Chosen over --target-host/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (nixos-rebuild --rollback). The switch is launched as a detached transient systemd unit (systemd-run --unit=ccci-rebuild --collect) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone via tar | ssh (rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, then nixos-rebuild switch --flake .#cc-ci).
- nixpkgs pin: flake pins the exact rev cc-ci already ran (50ab793…) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift.
Webhook scope: default per-repo via enroll script.
CI engine: Drone (per plan) — kept, with a noted risk. nixpkgs 24.11 has Drone server 2.24.0 but drone-runner-exec is abandoned (unstable-2020-04-19) — the only exec runner Drone ever shipped (upstream archived ~2021). The maintained fork Woodpecker (2.7.3, with NixOS modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern Drone server (RPC protocol stable). Fallback: if the exec runner proves incompatible/broken, pivot to Woodpecker (coop-cloud ships a woodpecker recipe too) and record it — like the traefik pivot. Re-evaluate at the M2 gate.
Drone deployment shape — SETTLED (M2): mirror the traefik pattern. The server is the coop-cloud drone recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by traefik at drone.ci.commoninternet.net, LETS_ENCRYPT_ENV empty → wildcard cert, no ACME), with Gitea SSO (compose.gitea.yml). The exec runner runs as a Nix systemd service on the host (modules/drone-runner.nix) so it can drive host abra/swarm (plan §4.2). One generated DRONE_RPC_SECRET is shared: inserted as the server's rpc_secret swarm secret AND read by the runner from sops. Reproducible deploy: scripts/deploy-drone.sh.
- Gitea OAuth app cc-ci-drone created under the bot (client_id ab4cdb9d-ee96-4867-875f- 87384505fc52, redirect https://drone.ci.commoninternet.net/login); client_secret + rpc_secret stored sops-encrypted in secrets/secrets.yaml (A2 internal secrets).
Drone runner type: exec (must drive host abra).
Secret tool — SETTLED (M0): sops-nix. cc-ci decrypts at activation using its ed25519 SSH host key as the age identity (sops.age.sshKeyPaths), so no extra key file to manage on the box. Recipients in /.sops.yaml: the host age key (age1h90ut…, from ssh-to-age) + an off-box master recovery key (age1cmk26t…; private half only at /srv/cc-ci/.sops/master-age.txt on the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext into secrets/<f>.yaml then sops -e -i (run inside the repo so .sops.yaml is found).
D10 recipe set: lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
Per-run app domain scheme — adapted (M4, deviates from plan §4.0). Plan §4.0 wanted <recipe>-pr<n>-<short-sha>.ci.commoninternet.net, but Docker swarm config/secret names (<stackname>_<resource>_<version>) must be ≤ 64 chars and abra derives <stackname> from the domain (dots→_, hyphens kept). .ci.commoninternet.net alone is 22 chars, so long recipe names
- config names overflow 64 (hit with custom-html-pr0-m4demo…_nginx_default_conf_v6 = 66). New scheme: <recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net (e.g. cust-e084bd) — short, unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/ ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6). Many abra commands (app ls, secret generate without flags, version resolution) silently git checkout <version-tag> in ~/.abra/recipes/<recipe>, discarding a PR branch's files. To test the PR head code (not a re-resolved tag): (1) fetch_recipe clones the mirror branch/ref (private → bot token via per-command http.extraHeader, never persisted/logged); (2) all harness abra calls that touch the recipe pass -C (chaos: use current checkout) -o (offline: no remote fetch); (3) recipe-shipped tests/ (D4) are snapshotted to a temp dir right after fetch, since later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.

Risks

Disk — RESOLVED 2026-05-26. Original 8.9 GiB root had only ~3.8 GiB free and a hard inode ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on inodes before bytes. Operator grew the VM to 28 GiB (22 GiB free, 1.78M inodes / 1.21M free); the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown + periodic docker image prune to avoid regressing during M6.5 breadth.

Dead-ends

(none yet)

Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27

Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default). cc-ci-secrets is mounted as a submodule at cc-ci/secrets/ rather than a flake inputs.secrets. Rationale: a private flake input must be re-fetched at every nix eval, requiring the bot token persistently in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band secret, which 1c forbids). A submodule makes secrets/secrets.yaml a plain path in the working tree → defaultSopsFile = ../secrets/secrets.yaml is unchanged (minimal diff, trivially byte-identical), and the only credential use is the one git clone --recursive at provisioning ("the two repos are given", Mission §1). Build invocation becomes nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci' so the submodule tree is included. (Revisit if ?submodules=1 proves unreliable on cc-ci's nix version.)
Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via sops.age.keyFile. The recovery key (age1cmk26…, private at /srv/cc-ci/.sops/master-age.txt) is already a sops recipient, so a fresh host with a different ssh host key still decrypts every secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding". Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting via its host key (age.sshKeyPaths); secrets.nix will offer both identity sources. (Per-host re-encrypt is cleaner for a permanent new instance — documented as the alternative, not used for the throwaway test.)
Cert into git: wildcard cert+key become sops secrets in cc-ci-secrets, decrypted at activation back to /var/lib/ci-certs/live/{fullchain.pem,privkey.pem} via sops.secrets.<name>.path; proxy.nix keeps reading that path (now sops-sourced, not operator-drop).
cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM. The freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
C6 teardown OVERRIDE (operator, 2026-05-27): do NOT destroy the FINAL throwaway VM after W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).

Phase 1b — lint/format tooling (open decisions §6, settled W0)

Formatters/linters (RL1): Nix = nixpkgs-fmt (format) + statix (lints) + deadnix (dead code); Python = ruff (lint + format); Shell = shellcheck + shfmt -i 2 -ci; YAML = yamllint. Kept nixpkgs-fmt over alejandra because it was already the repo formatter and devshell tool (no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flake lint devshell (nix develop .#lint) so CI and local use byte-identical tool versions.
Lint entrypoint: scripts/lint.sh (check-only by default; --fix auto-applies). The .drone.yml push pipeline runs it via nix develop .#lint --command bash scripts/lint.sh.
ruff strictness: select = [E,F,W,I,UP,B,C4,SIM], ignore = [E501] (line length is the formatter's job; only un-splittable strings would trip it). line-length=100, target=py311.
Drone lint stage = FAIL (not warn). The codebase is green now, so enforce from here on — an unclean commit fails the lint step. (Resolves the §6 open question.)
Python type-checking (mypy/pyright): DEFERRED to IDEAS, not added in 1b. The harness is small and dynamically typed around abra/subprocess JSON; gradual typing is a larger effort than this bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs.
blocking vs advisory split (§3): treated as in the phase plan — tests-real, Nix-idempotent, no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift = advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
cc-ci self-CI push trigger: the lint stage lives in the event: push pipeline. The Gitea→Drone push webhook on this instance is flaky (last_status: None; documented §4.1) and predates 1b — recipe CI uses polling as primary, but cc-ci's own self-test/lint relies on the push webhook. The lint stage is correctly wired and proven green via the identical nix develop .#lint command; reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.

Phase 1b — repo layout (operator review items RL5/RL6, plan §7)

RL5 — all Nix code under nix/. Moved modules/→nix/modules/ and hosts/→nix/hosts/. flake.nix/flake.lock STAY at the repo root (entry point) so the build ref #cc-ci and nixos-rebuild --flake '…#cc-ci' are unchanged — only flake.nix's internal ./hosts/cc-ci/configuration.nix → ./nix/hosts/cc-ci/configuration.nix changed. Root-relative refs inside the moved modules were re-based ../X → ../../X (secrets.nix → ../../secrets/, bridge.nix → ../../bridge/, dashboard.nix → ../../dashboard/); configuration.nix's ../../modules/* imports are unchanged (both dirs moved under nix/, so the relative path still resolves). Toplevel is byte-identical (8i3jcad9…) before/after the move — store derivations are content-addressed on the copied file contents, and the module .nix files aren't part of the runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash change; in practice it's stable, which is even stronger for reproducibility.) Living docs (README, architecture/install/secrets/enroll) + the .drone.yml comment updated to nix/…; append-only history logs left as the record of what was true then.
RL6 — protocol files → machine-docs/: DEFERRED to the coordinated end of 1b. Will git mv STATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.md into machine-docs/ (README.md STAYS at root — operator decision, it's the human readme, not a protocol file). The live watchdog (launch.sh) reads STATUS-<id>.md/REVIEW-<id>.md at the repo root for handoffs/transition, so this is done LAST, in lockstep with the orchestrator updating launch.sh + restarting the watchdog — not unilaterally and not while a phase transition is pending. The Adversary likewise git mvs its own REVIEW files at the cutover (single-writer rule).

Phase 1b — recorded deviation: no `tests/_template/` dir (enroll = copy an existing recipe)

Plan §3's repo layout lists a tests/_template/ "copy-to-add-a-recipe" dir. It was never created (pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in docs/enroll-recipe.md is "copy an existing recipe's tree, e.g. tests/custom-html/…, then adjust recipe_meta.py + the per-recipe test files." This satisfies D5's "small, repeatable, documented operation with no harness surgery" the same way (a concrete recipe is a better starting template than an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.

23 KiB Raw Blame History Unescape Escape