Files
cc-ci/JOURNAL.md
autonomic-bot 1c81279fda
All checks were successful
continuous-integration/drone/push Build is passing
M3 start: comment-bridge source (stdlib) + bridge secrets in sops
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 23:17:30 +01:00

18 KiB
Raw Blame History

JOURNAL — cc-ci Builder (append-only)

2026-05-26 — Bootstrap (§1)

Access verification (all pass):

  • ssh cc-ci 'hostname && whoami && nixos-version'nixos / root / 24.11.719113.50ab793786d9 (Vicuna)
  • curl https://git.autonomic.zone/api/v1/version{"version":"1.24.2"}
  • Gitea bot auth (curl -u $GITEA_USERNAME:$GITEA_PASSWORD .../api/v1/user) → login: autonomic-bot, id 64
  • getent hosts probe-$RANDOM.ci.commoninternet.net143.244.213.108 (the gateway IP, as expected — TLS passthrough)
  • Cert present: ls /var/lib/ci-certs/live/fullchain.pem (2909 b), privkey.pem (227 b, mode 640)
  • recipe-maintainers org exists (private); recipe-maintainers/cc-ci → 404 (created below)
  • Mirrored recipes already present: bluesky-pds, lasuite-docs, custom-html, custom-html-tiny, n8n, keycloak, lasuite-meet, matrix-synapse, cryptpad

Baseline (docs/baseline.md): fresh NixOS 24.11 Incus VM, 2 vCPU, 3.5 GiB RAM, 8.9 GiB disk (3.8 GiB free). No docker/swarm/abra. Channel-based /etc/nixos/configuration.nix (no flake).

Actions:

  • Created repo recipe-maintainers/cc-ci (private) via Gitea API.
  • git init in /srv/cc-ci/cc-ci; credential helper reads creds from /srv/cc-ci/.testenv (no secrets stored in git config).
  • Seeded skeleton layout (§3) + loop-state files + docs/baseline.md.

Next: commit + push bootstrap, then M0 (flake + base config + sops test secret).

2026-05-26 — M0: flake + base config rebuilt from repo

Authored flake.nix (pins nixpkgs rev 50ab793786d9…, the exact rev cc-ci ran), hosts/cc-ci/hardware.nix (incus VM module + cloud-init + DHCP/nameservers) and hosts/cc-ci/configuration.nix (faithful baseline repro: tailscale w/ hardcoded --hostname= cc-nix-test since builtins.readFile /etc/ts-hostname is impure under flakes; sshd root; firewall trust tailscale0 + tcp/22; base pkgs).

Disk/inode hiccup → resolved: first nix flake lock/build hit No space left on device — diagnosed as inode exhaustion (df -i → 6005 free of 586336; old 8.9 GiB fs). Operator grew the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried.

Build + switch (commands + output):

  • ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci'BUILD EXIT 0, produced nixos-system-nixos-24.11.20250630.50ab793.
  • ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci' (detached so it survives ssh drop) → unit Result=success ExecMainStatus=0.

Gate verification:

  • systemctl is-system-runningrunning
  • readlink /run/current-system…-nixos-system-nixos-24.11.20250630.50ab793 (gen 3, from flake)
  • systemctl is-active tailscaledactive; sshd.socketactive (sshd is socket-activated, so sshd.service reads inactive — live ssh proves it works)
  • systemctl --failed → none
  • nixos-rebuild list-generations → gen 3 current @20:23, prior channel gen 2 retained for rollback.

Known warning (tracked, non-blocking): incus module enables systemd.network while we keep networking.useDHCP=true (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from baseline; networking is up. Clean up by choosing one stack later.

Deploy mechanism settled (DECISIONS.md): switch --flake on-host, repo synced via tar | ssh.

Next: sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then CLAIM the M0 gate for the Adversary.

2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED)

Keys:

  • Host age recipient from ssh host key: ssh cc-ci 'nix run nixpkgs#ssh-to-age -- -i /etc/ssh/ssh_host_ed25519_key.pub'age1h90utdztfc23kx8ewrtrtk80mnddvrf8pg4ppej55rwwwupzhfvqhmp3qa.
  • Master recovery key generated on host (age-keygen), public age1cmk26t…; private moved off-box to /srv/cc-ci/.sops/master-age.txt (mode 600) and shred-ded from the host. Never in repo.

Files: .sops.yaml (both recipients, rule secrets/.*\.(yaml|json|env)$); modules/secrets.nix (sops.age.sshKeyPaths=[/etc/ssh/ssh_host_ed25519_key], secrets.test_secret={}); flake gains sops-nix input + sops-nix.nixosModules.sops; configuration.nix imports the module.

sops-nix version pin (dead-end avoided): master sops-nix wants buildGo125Module (Go 1.25), absent in pinned nixpkgs 24.11 → eval error. Pinned sops-nix to 77c423a… (2025-06-17, last using plain buildGoModule). Verified the file at that rev uses buildGoModule. Build then OK.

Encrypt test secret: on host, printf 'test_secret: cc-ci-m0-<rand>' > secrets/secrets.yaml then nix run nixpkgs#sops -- --encrypt --in-place secrets/secrets.yaml (run inside repo so .sops.yaml resolves) → rc=0, two age recipients in the file.

Build + switch (commands + output):

  • nixos-rebuild build --flake .#cc-ciBUILD EXIT 0 (built sops-install-secrets w/ Go 1.23.8).
  • systemd-run --unit=ccci-rebuild2 ... nixos-rebuild switch --flake /root/cc-ci#cc-ciResult=success ExecMainStatus=0.

Gate verification (M0):

  • systemctl is-system-runningrunning; systemctl --failed → none.
  • ls -la /run/secrets/test_secret-r-------- 1 root root 41 ; statroot:root 400.
  • head -c9cc-ci-m0- (matches generated value), wc -c → 41 (9 + 32 hex). Decrypt path proven.
  • Pulled encrypted secrets/secrets.yaml + flake.lock back to clone; grep cc-ci-m0 secrets.yaml → no plaintext leak; lock inputs = nixpkgs, sops-nix.

Gate handshake: set Gate: M0 — CLAIMED, awaiting Adversary in STATUS.md. REVIEW.md still empty (no Adversary activity yet). Per §6.1 liveness I won't idle-block: I keep M0 claimed and proceed with M1 (independent infra build), without advancing to M2 until M0 shows PASS.

Next: M1 — Docker + single-node swarm via Nix (modules/swarm.nix), then Traefik (file provider → /var/lib/ci-certs/live/) + abra, then a by-hand HTTPS deploy/teardown of a trivial recipe.

2026-05-26 — M1: Docker + single-node swarm via Nix

modules/swarm.nix: virtualisation.docker.enable + daily autoprune (--all --volumes until=24h to protect the 28 GiB root), docker in systemPackages, and a swarm-init oneshot (docker swarm init --advertise-addr 127.0.0.1 if not active; docker network create --driver overlay --attachable proxy if absent). Imported into configuration.nix.

Build + switch: nixos-rebuild build --flake .#cc-ci → EXIT 0; systemd-run … switchResult=success.

Verify (commands + output):

  • systemctl show swarm-init -p ResultResult=success
  • docker info --format ...Swarm=active Managers=1 Nodes=1
  • docker network ls --filter name=proxyproxy overlay swarm
  • systemctl is-system-runningrunning; --failed → none.

Next: Traefik as a swarm stack (Nix-declared compose + docker stack deploy oneshot): docker swarm provider + file provider serving /var/lib/ci-certs/live/{fullchain,privkey}.pem on :443, attached to proxy. Then abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate). Rationale for swarm-service Traefik over a host services.traefik: a host process isn't on the proxy overlay, so it can't reach swarm service VIPs; coop-cloud recipes assume an on-proxy Traefik watching swarm labels.

2026-05-26 — M1: Traefik swarm stack + HTTPS path proven

modules/traefik.nix: Traefik v3.3 as a swarm service on proxy (so it reaches recipe VIPs). Config via Nix writeText store files bind-mounted into the container (real files, not /etc symlinks): static traefik.yml (entrypoints web/websecure; providers.swarm unix socket, exposedByDefault=false, network=proxy; providers.file dir /etc/traefik/dynamic; ping; no dashboard) and dynamic certs.yml (wildcard at /var/lib/ci-certs/live/* as stores.default. defaultCertificate + certificates — so any *.ci.commoninternet.net router with tls=true is covered, no ACME). Deployed by a traefik-deploy oneshot (docker stack deploy) after swarm-init. Opened firewall 80/443 (gateway forwards over enp5s0).

Build + switch: build EXIT 0; switch Result=success; traefik-deploy Result=success; docker service lstraefik_traefik traefik:v3.3 1/1.

Verify (commands + output):

  • Local: curl -ksv -H 'Host: probe-test.ci.commoninternet.net' https://localhost/subject: CN=*.ci.commoninternet.net, issuer: …Let's Encrypt; CN=E8, TLSv1.3, HTTP 404.
  • End-to-end via gateway: curl -ksv --resolve probe-test.ci.commoninternet.net:443:143.244.213.108 https://probe-test.ci.commoninternet.net/Connected to …(143.244.213.108) port 443, same wildcard cert, HTTP 404. Confirms gateway SNI-passthrough → cc-ci Traefik TLS termination. 404 is correct (no router for that host yet).

Next: install abra (M1 last task), abra app new a trivial recipe (custom-html) → deploy → reach over HTTPS at .ci.commoninternet.net → teardown leaving no volumes. That completes M1 → CLAIM M1 gate.

2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED)

Orchestrator decision (mid-M1): replace the hand-rolled Traefik with the canonical Co-op Cloud traefik recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom modules/traefik.nix; moved firewall 80/443 into modules/swarm.nix. Recorded in DECISIONS.md.

Why the pivot also fixed a real bug: my custom Traefik used entrypoint websecure; coop-cloud recipes label entrypoints=web-secure. While chasing that I also hit a sharp systemd-run gotcha: systemd-run … nixos-rebuild switch --flake .#cc-ci runs with cwd /, so .#/ → "could not find a flake.nix"; the switch silently failed while a post---collect systemctl show returned a stale Result=success. Fix: always use the absolute flake path /root/cc-ci#cc-ci, and read the result before resetting. (rebuild6/7 had silently not applied; rebuild25 used the absolute path.)

abra packaged (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd. abra --version0.13.0-beta-06a57de.

scripts/deploy-proxy.sh (idempotent, pure-bash — host has no python3): ensure local abra server, fetch traefik, write wildcard/no-ACME env (WILDCARDS_ENABLED=1, SECRET_WILDCARD_*_VERSION=v1, COMPOSE_FILE=compose.yml:compose.wildcard.yml, LETS_ENCRYPT_ENV= empty), insert cert secrets via abra app secret insert … -f from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line PEM must use -f (not arg); secret-presence must check docker secret ls (abra's recipe list always shows the name with created on server:false).

Traefik deploy: abra app deploydeploy succeeded 🟢 (traefik v3.6.15 + socket-proxy). Verify: docker service ls → app+socket-proxy 1/1; via gateway curl --resolve probe.*:443: 143.244.213.108CN=*.ci.commoninternet.net (LE E8); 0 ACME log lines.

M1 gate (recipe over HTTPS + teardown):

  • abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -n then set LETS_ENCRYPT_ENV= and abra app deploy -n -C🟢 (nginx 1.29.0).
  • curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/http_code=200 size=615, served the nginx welcome page over HTTPS with the wildcard cert.
  • Teardown: abra app undeploy -n🟢; abra app volume remove -f -n → "1 volumes removed"; leak check → services 0 / volumes 0 / secrets 0 / containers 0. Clean.
  • Correct teardown syntax confirmed: secret remove <d> --all -n (not --all-secrets).

docs/install.md seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md.

Next: M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green.

2026-05-26 — M2 start: CI engine decision + Gitea OAuth app + Drone secrets

Decision (DECISIONS.md): keep Drone per plan. nixpkgs 24.11 has drone server 2.24.0 but only the abandoned drone-runner-exec (unstable-2020) — accepted (stable RPC), Woodpecker is the documented fallback. Deploy shape mirrors traefik: server via coop-cloud drone recipe (abra, swarm, traefik-routed at drone.ci.commoninternet.net, no ACME), exec runner as a host Nix systemd service.

Recipe recon: coop-cloud drone recipe = drone/drone:2.26.0, secrets rpc_secret + CLIENT_SECRET (Gitea OAuth), Gitea SSO via compose.gitea.yml (GITEA_CLIENT_ID, GITEA_DOMAIN). Server env: DRONE_SERVER_HOST/PROTO, DRONE_USER_CREATE.

Done this tick:

  • Created Gitea OAuth app cc-ci-drone (bot): client_id ab4cdb9d-…, redirect https://drone.ci.commoninternet.net/login.
  • Generated DRONE_RPC_SECRET (openssl-equivalent /dev/urandom hex32) + stored client_secret; both added to secrets/secrets.yaml via sops set (needed SOPS_AGE_KEY from the host ssh key: ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key). Verified: decrypt shows keys test_secret/drone_rpc_secret/drone_gitea_client_secret; file stays encrypted (4× ENC).

Next: scripts/deploy-drone.sh (abra deploy of drone server w/ Gitea SSO + rpc/client secrets), modules/drone-runner.nix (exec runner systemd unit, rpc secret from sops), wire sops secrets for the runner, then push a hello-world .drone.yml and confirm a green build (M2 gate).

2026-05-26 — M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots

Orchestrator steer (2×): collapse install to a single nixos-rebuild switch — convert the manual deploy scripts into idempotent-reconcile systemd oneshots (writeShellApplication, embedded in store; after swarm-init+docker; wants network-online; wantedBy multi-user; reconcile every activation/boot, NO run-once sentinel; fail visibly on missing cert). Applied to proxy + drone.

Refactor done:

  • modules/packages.nix: pkgs.abra overlay (shared pinned build).
  • modules/proxy.nix: deploy-proxy oneshot — reconciles coop-cloud traefik (wildcard/no-ACME).
  • modules/drone.nix: deploy-drone oneshot — reconciles coop-cloud drone (Gitea SSO, secrets from /run/secrets), after deploy-proxy.
  • modules/drone-runner.nix: exec runner (fixed PATH conflict via lib.mkForce; allowUnfree for drone-runner-exec — Polyform license).
  • modules/secrets.nix: declared drone_rpc_secret + drone_gitea_client_secret + a sops template drone-runner.env (DRONE_RPC_SECRET) as the runner's EnvironmentFile (shared secret).
  • Removed scripts/deploy-*.sh. install.md now = clone + nixos-rebuild switch + preconditions.

Build/switch: build EXIT 0 (shellcheck clean via writeShellApplication; runner pkg unfree-allowed). nixos-rebuild switch → all three units active/success:

  • deploy-proxy success (reconciled traefik), deploy-dronedeploy succeeded 🟢 (drone/drone 2.26.0, secrets client_secret+rpc_secret v1, drone_env config), drone-runner-exec active.

Verify (commands + output):

  • docker service lsdrone_ci_commoninternet_net_app 1/1, traefik app+socket-proxy 1/1.
  • Via gateway: …/healthz200; /303 (login redirect, correct).
  • Runner: journal shows a few startup cannot ping the remote server (404) (drone RPC not ready yet) then successfully pinged the remote server + polling the remote server capacity=2 endpoint=https://drone.ci.commoninternet.net kind=pipeline type=exec. Runner connected via RPC.

Remaining for M2 gate: push a hello-world .drone.yml to cc-ci + get a green build. Needs the cc-ci repo activated in Drone, which requires the bot's Gitea OAuth login (browser flow) to grant Drone a Gitea token (to sync repos + set the push webhook). Next tick: script the OAuth login to mint a Drone token, activate cc-ci, push .drone.yml, confirm green. (DRONE_USER_CREATE made autonomic-bot the admin.)

2026-05-26 — M2 GATE MET: green build via push (Drone + exec runner)

Drone↔Gitea OAuth (scripted, the one manual bootstrap): logged the bot into Gitea (CSRF cookie → form), drove Drone /login → Gitea authorize consent (POST /login/oauth/grant with _csrf+state+ granted=true) → code callback → Drone _session_. Captured the whole flow in scripts/bootstrap-drone-oauth.sh (reads bot creds from env; documented in install.md §2; one-time, token persists in Drone's data volume).

Repo activation: GET /api/user → autonomic-bot admin=true; GET /api/user/repos?latest=true synced 12 repos; POST /api/repos/recipe-maintainers/cc-ci → active=true, config_path .drone.yml (sets the Gitea push webhook).

Green build: added .drone.yml (exec pipeline), pushed (0d89e28). Polled /api/repos/recipe-maintainers/cc-ci/builds → build #1 pending→running→success. Steps: clone success exit 0; hello success exit 0 — log shows whoami=root, abra 0.13.0-beta-06a57de, swarm=active (ran on the host via the exec runner). M2 gate met; CLAIMED.

Next: M3 — comment-bridge service: Gitea issue_comment webhook → verify HMAC + !testme exact + collaborator → resolve PR head repo/SHA → trigger a parameterized Drone build; post a PR comment with the run link. Need a Drone API token for the bridge (mint from the bot's Drone account).

2026-05-26 — M3 start: bridge secrets + comment-bridge source

Secrets (sops): minted a Gitea API token (cc-ci-bridge, scopes read:org/user, write:repo/issue), a Drone API token (POST /api/user/token, the stable personal token; rotates on call), and a webhook HMAC (urandom hex64). Stored as bridge_gitea_token / bridge_drone_token / bridge_webhook_hmac via sops set (host age identity). secrets.yaml now holds 6 secrets.

bridge/bridge.py (Python stdlib only, §4.1): POST /hook handler — verifies Gitea HMAC (X-Gitea-Signature sha256), requires X-Gitea-Event: issue_comment, action=created, body trimmed == !testme, issue is a PR; checks commenter is a collaborator (Gitea collaborators endpoint, 204); resolves PR head sha+repo; triggers a parameterized Drone build (POST /api/repos/<CI_REPO>/builds?branch=main&RECIPE&REF&PR&SRC, custom params → pipeline env); posts a PR comment linking the run. Secrets read from mounted files; config via env. /healthz GET.

Next: package the bridge as a swarm service (dockerTools image, no Docker Hub pull) behind traefik at ci.commoninternet.net/hook via a reconcile oneshot (modules/bridge.nix); register a per-repo webhook with the HMAC; demo on a scratch PR (!testme triggers; non-!testme + non-collab rejected). That's the M3 gate.