Orchestrator decision: deploy canonical coop-cloud traefik via abra instead of a hand-rolled module. abra packaged in Nix (pinned). custom-html deployed over HTTPS (200) via the gateway and torn down clean. docs/install.md seeded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
JOURNAL — cc-ci Builder (append-only)
2026-05-26 — Bootstrap (§1)
Access verification (all pass):
ssh cc-ci 'hostname && whoami && nixos-version'→nixos/root/24.11.719113.50ab793786d9 (Vicuna)curl https://git.autonomic.zone/api/v1/version→{"version":"1.24.2"}- Gitea bot auth (
curl -u $GITEA_USERNAME:$GITEA_PASSWORD .../api/v1/user) →login: autonomic-bot, id 64 getent hosts probe-$RANDOM.ci.commoninternet.net→143.244.213.108(the gateway IP, as expected — TLS passthrough)- Cert present:
ls /var/lib/ci-certs/live/→fullchain.pem(2909 b),privkey.pem(227 b, mode 640) - recipe-maintainers org exists (private);
recipe-maintainers/cc-ci→ 404 (created below) - Mirrored recipes already present: bluesky-pds, lasuite-docs, custom-html, custom-html-tiny, n8n, keycloak, lasuite-meet, matrix-synapse, cryptpad
Baseline (docs/baseline.md): fresh NixOS 24.11 Incus VM, 2 vCPU, 3.5 GiB RAM, 8.9 GiB disk
(3.8 GiB free). No docker/swarm/abra. Channel-based /etc/nixos/configuration.nix (no flake).
Actions:
- Created repo
recipe-maintainers/cc-ci(private) via Gitea API. git initin /srv/cc-ci/cc-ci; credential helper reads creds from /srv/cc-ci/.testenv (no secrets stored in git config).- Seeded skeleton layout (§3) + loop-state files + docs/baseline.md.
Next: commit + push bootstrap, then M0 (flake + base config + sops test secret).
2026-05-26 — M0: flake + base config rebuilt from repo
Authored flake.nix (pins nixpkgs rev 50ab793786d9…, the exact rev cc-ci ran),
hosts/cc-ci/hardware.nix (incus VM module + cloud-init + DHCP/nameservers) and
hosts/cc-ci/configuration.nix (faithful baseline repro: tailscale w/ hardcoded --hostname= cc-nix-test since builtins.readFile /etc/ts-hostname is impure under flakes; sshd root; firewall
trust tailscale0 + tcp/22; base pkgs).
Disk/inode hiccup → resolved: first nix flake lock/build hit No space left on device —
diagnosed as inode exhaustion (df -i → 6005 free of 586336; old 8.9 GiB fs). Operator grew
the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried.
Build + switch (commands + output):
ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci'→BUILD EXIT 0, producednixos-system-nixos-24.11.20250630.50ab793.ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci'(detached so it survives ssh drop) → unitResult=success ExecMainStatus=0.
Gate verification:
systemctl is-system-running→runningreadlink /run/current-system→…-nixos-system-nixos-24.11.20250630.50ab793(gen 3, from flake)systemctl is-active tailscaled→active;sshd.socket→active(sshd is socket-activated, sosshd.servicereads inactive — live ssh proves it works)systemctl --failed→ nonenixos-rebuild list-generations→ gen 3 current @20:23, prior channel gen 2 retained for rollback.
Known warning (tracked, non-blocking): incus module enables systemd.network while we keep
networking.useDHCP=true (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from
baseline; networking is up. Clean up by choosing one stack later.
Deploy mechanism settled (DECISIONS.md): switch --flake on-host, repo synced via tar | ssh.
Next: sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then CLAIM the M0 gate for the Adversary.
2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED)
Keys:
- Host age recipient from ssh host key:
ssh cc-ci 'nix run nixpkgs#ssh-to-age -- -i /etc/ssh/ssh_host_ed25519_key.pub'→age1h90utdztfc23kx8ewrtrtk80mnddvrf8pg4ppej55rwwwupzhfvqhmp3qa. - Master recovery key generated on host (
age-keygen), publicage1cmk26t…; private moved off-box to/srv/cc-ci/.sops/master-age.txt(mode 600) andshred-ded from the host. Never in repo.
Files: .sops.yaml (both recipients, rule secrets/.*\.(yaml|json|env)$); modules/secrets.nix
(sops.age.sshKeyPaths=[/etc/ssh/ssh_host_ed25519_key], secrets.test_secret={}); flake gains
sops-nix input + sops-nix.nixosModules.sops; configuration.nix imports the module.
sops-nix version pin (dead-end avoided): master sops-nix wants buildGo125Module (Go 1.25),
absent in pinned nixpkgs 24.11 → eval error. Pinned sops-nix to 77c423a… (2025-06-17, last using
plain buildGoModule). Verified the file at that rev uses buildGoModule. Build then OK.
Encrypt test secret: on host, printf 'test_secret: cc-ci-m0-<rand>' > secrets/secrets.yaml
then nix run nixpkgs#sops -- --encrypt --in-place secrets/secrets.yaml (run inside repo so
.sops.yaml resolves) → rc=0, two age recipients in the file.
Build + switch (commands + output):
nixos-rebuild build --flake .#cc-ci→BUILD EXIT 0(built sops-install-secrets w/ Go 1.23.8).systemd-run --unit=ccci-rebuild2 ... nixos-rebuild switch --flake /root/cc-ci#cc-ci→Result=success ExecMainStatus=0.
Gate verification (M0):
systemctl is-system-running→running;systemctl --failed→ none.ls -la /run/secrets/test_secret→-r-------- 1 root root 41;stat→root:root 400.head -c9→cc-ci-m0-(matches generated value),wc -c→ 41 (9 + 32 hex). Decrypt path proven.- Pulled encrypted
secrets/secrets.yaml+flake.lockback to clone;grep cc-ci-m0 secrets.yaml→ no plaintext leak; lock inputs = nixpkgs, sops-nix.
Gate handshake: set Gate: M0 — CLAIMED, awaiting Adversary in STATUS.md. REVIEW.md still empty
(no Adversary activity yet). Per §6.1 liveness I won't idle-block: I keep M0 claimed and proceed
with M1 (independent infra build), without advancing to M2 until M0 shows PASS.
Next: M1 — Docker + single-node swarm via Nix (modules/swarm.nix), then Traefik (file provider → /var/lib/ci-certs/live/) + abra, then a by-hand HTTPS deploy/teardown of a trivial recipe.
2026-05-26 — M1: Docker + single-node swarm via Nix
modules/swarm.nix: virtualisation.docker.enable + daily autoprune (--all --volumes until=24h
to protect the 28 GiB root), docker in systemPackages, and a swarm-init oneshot
(docker swarm init --advertise-addr 127.0.0.1 if not active; docker network create --driver overlay --attachable proxy if absent). Imported into configuration.nix.
Build + switch: nixos-rebuild build --flake .#cc-ci → EXIT 0; systemd-run … switch →
Result=success.
Verify (commands + output):
systemctl show swarm-init -p Result→Result=successdocker info --format ...→Swarm=active Managers=1 Nodes=1docker network ls --filter name=proxy→proxy overlay swarmsystemctl is-system-running→running;--failed→ none.
Next: Traefik as a swarm stack (Nix-declared compose + docker stack deploy oneshot): docker
swarm provider + file provider serving /var/lib/ci-certs/live/{fullchain,privkey}.pem on :443,
attached to proxy. Then abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate).
Rationale for swarm-service Traefik over a host services.traefik: a host process isn't on the
proxy overlay, so it can't reach swarm service VIPs; coop-cloud recipes assume an on-proxy
Traefik watching swarm labels.
2026-05-26 — M1: Traefik swarm stack + HTTPS path proven
modules/traefik.nix: Traefik v3.3 as a swarm service on proxy (so it reaches recipe VIPs).
Config via Nix writeText store files bind-mounted into the container (real files, not /etc
symlinks): static traefik.yml (entrypoints web/websecure; providers.swarm unix socket,
exposedByDefault=false, network=proxy; providers.file dir /etc/traefik/dynamic; ping; no
dashboard) and dynamic certs.yml (wildcard at /var/lib/ci-certs/live/* as stores.default. defaultCertificate + certificates — so any *.ci.commoninternet.net router with tls=true is covered,
no ACME). Deployed by a traefik-deploy oneshot (docker stack deploy) after swarm-init. Opened
firewall 80/443 (gateway forwards over enp5s0).
Build + switch: build EXIT 0; switch Result=success; traefik-deploy Result=success;
docker service ls → traefik_traefik traefik:v3.3 1/1.
Verify (commands + output):
- Local:
curl -ksv -H 'Host: probe-test.ci.commoninternet.net' https://localhost/→subject: CN=*.ci.commoninternet.net,issuer: …Let's Encrypt; CN=E8, TLSv1.3, HTTP 404. - End-to-end via gateway:
curl -ksv --resolve probe-test.ci.commoninternet.net:443:143.244.213.108 https://probe-test.ci.commoninternet.net/→Connected to …(143.244.213.108) port 443, same wildcard cert, HTTP 404. Confirms gateway SNI-passthrough → cc-ci Traefik TLS termination. 404 is correct (no router for that host yet).
Next: install abra (M1 last task), abra app new a trivial recipe (custom-html) → deploy →
reach over HTTPS at .ci.commoninternet.net → teardown leaving no volumes. That completes M1
→ CLAIM M1 gate.
2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED)
Orchestrator decision (mid-M1): replace the hand-rolled Traefik with the canonical Co-op Cloud
traefik recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom
modules/traefik.nix; moved firewall 80/443 into modules/swarm.nix. Recorded in DECISIONS.md.
Why the pivot also fixed a real bug: my custom Traefik used entrypoint websecure; coop-cloud
recipes label entrypoints=web-secure. While chasing that I also hit a sharp systemd-run gotcha:
systemd-run … nixos-rebuild switch --flake .#cc-ci runs with cwd /, so .# → / → "could not
find a flake.nix"; the switch silently failed while a post---collect systemctl show returned a
stale Result=success. Fix: always use the absolute flake path /root/cc-ci#cc-ci, and read the
result before resetting. (rebuild6/7 had silently not applied; rebuild2–5 used the absolute path.)
abra packaged (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd.
abra --version → 0.13.0-beta-06a57de.
scripts/deploy-proxy.sh (idempotent, pure-bash — host has no python3): ensure local abra server,
fetch traefik, write wildcard/no-ACME env (WILDCARDS_ENABLED=1, SECRET_WILDCARD_*_VERSION=v1,
COMPOSE_FILE=compose.yml:compose.wildcard.yml, LETS_ENCRYPT_ENV= empty), insert cert secrets via
abra app secret insert … -f from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line
PEM must use -f (not arg); secret-presence must check docker secret ls (abra's recipe list always
shows the name with created on server:false).
Traefik deploy: abra app deploy → deploy succeeded 🟢 (traefik v3.6.15 + socket-proxy).
Verify: docker service ls → app+socket-proxy 1/1; via gateway curl --resolve probe.*:443: 143.244.213.108 → CN=*.ci.commoninternet.net (LE E8); 0 ACME log lines.
M1 gate (recipe over HTTPS + teardown):
abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -nthen setLETS_ENCRYPT_ENV=andabra app deploy -n -C→🟢(nginx 1.29.0).curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/→http_code=200 size=615, served the nginx welcome page over HTTPS with the wildcard cert.- Teardown:
abra app undeploy -n→ 🟢;abra app volume remove -f -n→ "1 volumes removed"; leak check → services 0 / volumes 0 / secrets 0 / containers 0. Clean. - Correct teardown syntax confirmed:
secret remove <d> --all -n(not--all-secrets).
docs/install.md seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md.
Next: M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green.