All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
288 lines
18 KiB
Markdown
288 lines
18 KiB
Markdown
# JOURNAL — cc-ci Builder (append-only)
|
||
|
||
## 2026-05-26 — Bootstrap (§1)
|
||
|
||
**Access verification (all pass):**
|
||
- `ssh cc-ci 'hostname && whoami && nixos-version'` → `nixos` / `root` / `24.11.719113.50ab793786d9 (Vicuna)`
|
||
- `curl https://git.autonomic.zone/api/v1/version` → `{"version":"1.24.2"}`
|
||
- Gitea bot auth (`curl -u $GITEA_USERNAME:$GITEA_PASSWORD .../api/v1/user`) → `login: autonomic-bot`, id 64
|
||
- `getent hosts probe-$RANDOM.ci.commoninternet.net` → `143.244.213.108` (the gateway IP, as expected — TLS passthrough)
|
||
- Cert present: `ls /var/lib/ci-certs/live/` → `fullchain.pem` (2909 b), `privkey.pem` (227 b, mode 640)
|
||
- recipe-maintainers org exists (private); `recipe-maintainers/cc-ci` → 404 (created below)
|
||
- Mirrored recipes already present: bluesky-pds, lasuite-docs, custom-html, custom-html-tiny, n8n,
|
||
keycloak, lasuite-meet, matrix-synapse, cryptpad
|
||
|
||
**Baseline (docs/baseline.md):** fresh NixOS 24.11 Incus VM, 2 vCPU, 3.5 GiB RAM, 8.9 GiB disk
|
||
(3.8 GiB free). No docker/swarm/abra. Channel-based `/etc/nixos/configuration.nix` (no flake).
|
||
|
||
**Actions:**
|
||
- Created repo `recipe-maintainers/cc-ci` (private) via Gitea API.
|
||
- `git init` in /srv/cc-ci/cc-ci; credential helper reads creds from /srv/cc-ci/.testenv (no
|
||
secrets stored in git config).
|
||
- Seeded skeleton layout (§3) + loop-state files + docs/baseline.md.
|
||
|
||
**Next:** commit + push bootstrap, then M0 (flake + base config + sops test secret).
|
||
|
||
## 2026-05-26 — M0: flake + base config rebuilt from repo
|
||
|
||
**Authored** `flake.nix` (pins nixpkgs rev `50ab793786d9…`, the exact rev cc-ci ran),
|
||
`hosts/cc-ci/hardware.nix` (incus VM module + cloud-init + DHCP/nameservers) and
|
||
`hosts/cc-ci/configuration.nix` (faithful baseline repro: tailscale w/ hardcoded `--hostname=
|
||
cc-nix-test` since `builtins.readFile /etc/ts-hostname` is impure under flakes; sshd root; firewall
|
||
trust tailscale0 + tcp/22; base pkgs).
|
||
|
||
**Disk/inode hiccup → resolved:** first `nix flake lock`/build hit `No space left on device` —
|
||
diagnosed as **inode** exhaustion (`df -i` → 6005 free of 586336; old 8.9 GiB fs). Operator grew
|
||
the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried.
|
||
|
||
**Build + switch (commands + output):**
|
||
- `ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci'` → `BUILD EXIT 0`,
|
||
produced `nixos-system-nixos-24.11.20250630.50ab793`.
|
||
- `ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch
|
||
--flake /root/cc-ci#cc-ci'` (detached so it survives ssh drop) → unit `Result=success
|
||
ExecMainStatus=0`.
|
||
|
||
**Gate verification:**
|
||
- `systemctl is-system-running` → `running`
|
||
- `readlink /run/current-system` → `…-nixos-system-nixos-24.11.20250630.50ab793` (gen 3, from flake)
|
||
- `systemctl is-active tailscaled` → `active`; `sshd.socket` → `active` (sshd is socket-activated, so
|
||
`sshd.service` reads inactive — live ssh proves it works)
|
||
- `systemctl --failed` → none
|
||
- `nixos-rebuild list-generations` → gen 3 current @20:23, prior channel gen 2 retained for rollback.
|
||
|
||
**Known warning (tracked, non-blocking):** incus module enables `systemd.network` while we keep
|
||
`networking.useDHCP=true` (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from
|
||
baseline; networking is up. Clean up by choosing one stack later.
|
||
|
||
**Deploy mechanism settled** (DECISIONS.md): `switch --flake` on-host, repo synced via `tar | ssh`.
|
||
|
||
**Next:** sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then
|
||
CLAIM the M0 gate for the Adversary.
|
||
|
||
## 2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED)
|
||
|
||
**Keys:**
|
||
- Host age recipient from ssh host key: `ssh cc-ci 'nix run nixpkgs#ssh-to-age -- -i
|
||
/etc/ssh/ssh_host_ed25519_key.pub'` → `age1h90utdztfc23kx8ewrtrtk80mnddvrf8pg4ppej55rwwwupzhfvqhmp3qa`.
|
||
- Master recovery key generated on host (`age-keygen`), public `age1cmk26t…`; private moved off-box
|
||
to `/srv/cc-ci/.sops/master-age.txt` (mode 600) and `shred`-ded from the host. Never in repo.
|
||
|
||
**Files:** `.sops.yaml` (both recipients, rule `secrets/.*\.(yaml|json|env)$`); `modules/secrets.nix`
|
||
(`sops.age.sshKeyPaths=[/etc/ssh/ssh_host_ed25519_key]`, `secrets.test_secret={}`); flake gains
|
||
`sops-nix` input + `sops-nix.nixosModules.sops`; configuration.nix imports the module.
|
||
|
||
**sops-nix version pin (dead-end avoided):** master sops-nix wants `buildGo125Module` (Go 1.25),
|
||
absent in pinned nixpkgs 24.11 → eval error. Pinned sops-nix to `77c423a…` (2025-06-17, last using
|
||
plain `buildGoModule`). Verified the file at that rev uses `buildGoModule`. Build then OK.
|
||
|
||
**Encrypt test secret:** on host, `printf 'test_secret: cc-ci-m0-<rand>' > secrets/secrets.yaml`
|
||
then `nix run nixpkgs#sops -- --encrypt --in-place secrets/secrets.yaml` (run inside repo so
|
||
`.sops.yaml` resolves) → rc=0, two age recipients in the file.
|
||
|
||
**Build + switch (commands + output):**
|
||
- `nixos-rebuild build --flake .#cc-ci` → `BUILD EXIT 0` (built sops-install-secrets w/ Go 1.23.8).
|
||
- `systemd-run --unit=ccci-rebuild2 ... nixos-rebuild switch --flake /root/cc-ci#cc-ci` →
|
||
`Result=success ExecMainStatus=0`.
|
||
|
||
**Gate verification (M0):**
|
||
- `systemctl is-system-running` → `running`; `systemctl --failed` → none.
|
||
- `ls -la /run/secrets/test_secret` → `-r-------- 1 root root 41` ; `stat` → `root:root 400`.
|
||
- `head -c9` → `cc-ci-m0-` (matches generated value), `wc -c` → 41 (9 + 32 hex). Decrypt path proven.
|
||
- Pulled encrypted `secrets/secrets.yaml` + `flake.lock` back to clone; `grep cc-ci-m0 secrets.yaml`
|
||
→ no plaintext leak; lock inputs = nixpkgs, sops-nix.
|
||
|
||
**Gate handshake:** set `Gate: M0 — CLAIMED, awaiting Adversary` in STATUS.md. REVIEW.md still empty
|
||
(no Adversary activity yet). Per §6.1 liveness I won't idle-block: I keep M0 claimed and proceed
|
||
with M1 (independent infra build), without advancing to M2 until M0 shows PASS.
|
||
|
||
**Next:** M1 — Docker + single-node swarm via Nix (modules/swarm.nix), then Traefik (file provider
|
||
→ /var/lib/ci-certs/live/) + abra, then a by-hand HTTPS deploy/teardown of a trivial recipe.
|
||
|
||
## 2026-05-26 — M1: Docker + single-node swarm via Nix
|
||
|
||
**modules/swarm.nix:** `virtualisation.docker.enable` + daily autoprune (--all --volumes until=24h
|
||
to protect the 28 GiB root), `docker` in systemPackages, and a `swarm-init` oneshot
|
||
(`docker swarm init --advertise-addr 127.0.0.1` if not active; `docker network create --driver
|
||
overlay --attachable proxy` if absent). Imported into configuration.nix.
|
||
|
||
**Build + switch:** `nixos-rebuild build --flake .#cc-ci` → EXIT 0; `systemd-run … switch` →
|
||
`Result=success`.
|
||
|
||
**Verify (commands + output):**
|
||
- `systemctl show swarm-init -p Result` → `Result=success`
|
||
- `docker info --format ...` → `Swarm=active Managers=1 Nodes=1`
|
||
- `docker network ls --filter name=proxy` → `proxy overlay swarm`
|
||
- `systemctl is-system-running` → `running`; `--failed` → none.
|
||
|
||
**Next:** Traefik as a swarm stack (Nix-declared compose + `docker stack deploy` oneshot): docker
|
||
swarm provider + file provider serving /var/lib/ci-certs/live/{fullchain,privkey}.pem on :443,
|
||
attached to `proxy`. Then abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate).
|
||
Rationale for swarm-service Traefik over a host `services.traefik`: a host process isn't on the
|
||
`proxy` overlay, so it can't reach swarm service VIPs; coop-cloud recipes assume an on-`proxy`
|
||
Traefik watching swarm labels.
|
||
|
||
## 2026-05-26 — M1: Traefik swarm stack + HTTPS path proven
|
||
|
||
**modules/traefik.nix:** Traefik v3.3 as a swarm service on `proxy` (so it reaches recipe VIPs).
|
||
Config via Nix `writeText` store files bind-mounted into the container (real files, not /etc
|
||
symlinks): static `traefik.yml` (entrypoints web/websecure; `providers.swarm` unix socket,
|
||
exposedByDefault=false, network=proxy; `providers.file` dir /etc/traefik/dynamic; ping; no
|
||
dashboard) and dynamic `certs.yml` (wildcard at /var/lib/ci-certs/live/* as `stores.default.
|
||
defaultCertificate` + certificates — so any *.ci.commoninternet.net router with tls=true is covered,
|
||
no ACME). Deployed by a `traefik-deploy` oneshot (`docker stack deploy`) after swarm-init. Opened
|
||
firewall 80/443 (gateway forwards over enp5s0).
|
||
|
||
**Build + switch:** build EXIT 0; switch `Result=success`; `traefik-deploy` `Result=success`;
|
||
`docker service ls` → `traefik_traefik traefik:v3.3 1/1`.
|
||
|
||
**Verify (commands + output):**
|
||
- Local: `curl -ksv -H 'Host: probe-test.ci.commoninternet.net' https://localhost/` →
|
||
`subject: CN=*.ci.commoninternet.net`, `issuer: …Let's Encrypt; CN=E8`, TLSv1.3, HTTP 404.
|
||
- **End-to-end via gateway:** `curl -ksv --resolve probe-test.ci.commoninternet.net:443:143.244.213.108
|
||
https://probe-test.ci.commoninternet.net/` → `Connected to …(143.244.213.108) port 443`,
|
||
same wildcard cert, HTTP 404. Confirms gateway SNI-passthrough → cc-ci Traefik TLS termination.
|
||
404 is correct (no router for that host yet).
|
||
|
||
**Next:** install abra (M1 last task), `abra app new` a trivial recipe (custom-html) → deploy →
|
||
reach over HTTPS at <app>.ci.commoninternet.net → teardown leaving no volumes. That completes M1
|
||
→ CLAIM M1 gate.
|
||
|
||
## 2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED)
|
||
|
||
**Orchestrator decision (mid-M1):** replace the hand-rolled Traefik with the canonical Co-op Cloud
|
||
`traefik` recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom
|
||
`modules/traefik.nix`; moved firewall 80/443 into `modules/swarm.nix`. Recorded in DECISIONS.md.
|
||
|
||
**Why the pivot also fixed a real bug:** my custom Traefik used entrypoint `websecure`; coop-cloud
|
||
recipes label `entrypoints=web-secure`. While chasing that I also hit a sharp **systemd-run gotcha**:
|
||
`systemd-run … nixos-rebuild switch --flake .#cc-ci` runs with cwd `/`, so `.#` → `/` → "could not
|
||
find a flake.nix"; the switch silently failed while a post-`--collect` `systemctl show` returned a
|
||
stale `Result=success`. Fix: always use the **absolute** flake path `/root/cc-ci#cc-ci`, and read the
|
||
result before resetting. (rebuild6/7 had silently not applied; rebuild2–5 used the absolute path.)
|
||
|
||
**abra packaged** (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd.
|
||
`abra --version` → `0.13.0-beta-06a57de`.
|
||
|
||
**scripts/deploy-proxy.sh** (idempotent, pure-bash — host has no python3): ensure local abra server,
|
||
fetch traefik, write wildcard/no-ACME env (`WILDCARDS_ENABLED=1`, `SECRET_WILDCARD_*_VERSION=v1`,
|
||
`COMPOSE_FILE=compose.yml:compose.wildcard.yml`, `LETS_ENCRYPT_ENV=` empty), insert cert secrets via
|
||
`abra app secret insert … -f` from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line
|
||
PEM must use `-f` (not arg); secret-presence must check `docker secret ls` (abra's recipe list always
|
||
shows the name with `created on server:false`).
|
||
|
||
**Traefik deploy:** `abra app deploy` → `deploy succeeded 🟢` (traefik v3.6.15 + socket-proxy).
|
||
Verify: `docker service ls` → app+socket-proxy 1/1; via gateway `curl --resolve probe.*:443:
|
||
143.244.213.108` → `CN=*.ci.commoninternet.net` (LE E8); **0 ACME log lines**.
|
||
|
||
**M1 gate (recipe over HTTPS + teardown):**
|
||
- `abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -n` then set
|
||
`LETS_ENCRYPT_ENV=` and `abra app deploy -n -C` → `🟢` (nginx 1.29.0).
|
||
- `curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/` →
|
||
`http_code=200 size=615`, served the nginx welcome page over HTTPS with the wildcard cert.
|
||
- Teardown: `abra app undeploy -n` → 🟢; `abra app volume remove -f -n` → "1 volumes removed";
|
||
leak check → services 0 / volumes 0 / secrets 0 / containers 0. **Clean.**
|
||
- Correct teardown syntax confirmed: `secret remove <d> --all -n` (not `--all-secrets`).
|
||
|
||
**docs/install.md** seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md.
|
||
|
||
**Next:** M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green.
|
||
|
||
## 2026-05-26 — M2 start: CI engine decision + Gitea OAuth app + Drone secrets
|
||
|
||
**Decision (DECISIONS.md):** keep Drone per plan. nixpkgs 24.11 has drone server 2.24.0 but only the
|
||
abandoned `drone-runner-exec` (unstable-2020) — accepted (stable RPC), Woodpecker is the documented
|
||
fallback. Deploy shape mirrors traefik: server via coop-cloud `drone` recipe (abra, swarm,
|
||
traefik-routed at drone.ci.commoninternet.net, no ACME), exec runner as a host Nix systemd service.
|
||
|
||
**Recipe recon:** coop-cloud `drone` recipe = drone/drone:2.26.0, secrets `rpc_secret` +
|
||
`CLIENT_SECRET` (Gitea OAuth), Gitea SSO via `compose.gitea.yml` (`GITEA_CLIENT_ID`, `GITEA_DOMAIN`).
|
||
Server env: DRONE_SERVER_HOST/PROTO, DRONE_USER_CREATE.
|
||
|
||
**Done this tick:**
|
||
- Created Gitea OAuth app `cc-ci-drone` (bot): client_id `ab4cdb9d-…`, redirect
|
||
`https://drone.ci.commoninternet.net/login`.
|
||
- Generated `DRONE_RPC_SECRET` (openssl-equivalent /dev/urandom hex32) + stored client_secret;
|
||
both added to `secrets/secrets.yaml` via `sops set` (needed `SOPS_AGE_KEY` from the host ssh key:
|
||
`ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key`). Verified: decrypt shows keys
|
||
test_secret/drone_rpc_secret/drone_gitea_client_secret; file stays encrypted (4× ENC).
|
||
|
||
**Next:** scripts/deploy-drone.sh (abra deploy of drone server w/ Gitea SSO + rpc/client secrets),
|
||
modules/drone-runner.nix (exec runner systemd unit, rpc secret from sops), wire sops secrets for the
|
||
runner, then push a hello-world .drone.yml and confirm a green build (M2 gate).
|
||
|
||
## 2026-05-26 — M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots
|
||
|
||
**Orchestrator steer (2×):** collapse install to a single `nixos-rebuild switch` — convert the
|
||
manual deploy scripts into **idempotent-reconcile systemd oneshots** (writeShellApplication, embedded
|
||
in store; after swarm-init+docker; wants network-online; wantedBy multi-user; reconcile every
|
||
activation/boot, NO run-once sentinel; fail visibly on missing cert). Applied to proxy + drone.
|
||
|
||
**Refactor done:**
|
||
- `modules/packages.nix`: `pkgs.abra` overlay (shared pinned build).
|
||
- `modules/proxy.nix`: `deploy-proxy` oneshot — reconciles coop-cloud traefik (wildcard/no-ACME).
|
||
- `modules/drone.nix`: `deploy-drone` oneshot — reconciles coop-cloud drone (Gitea SSO, secrets from
|
||
/run/secrets), after deploy-proxy.
|
||
- `modules/drone-runner.nix`: exec runner (fixed PATH conflict via `lib.mkForce`; allowUnfree for
|
||
drone-runner-exec — Polyform license).
|
||
- `modules/secrets.nix`: declared drone_rpc_secret + drone_gitea_client_secret + a sops *template*
|
||
`drone-runner.env` (DRONE_RPC_SECRET) as the runner's EnvironmentFile (shared secret).
|
||
- Removed `scripts/deploy-*.sh`. install.md now = clone + nixos-rebuild switch + preconditions.
|
||
|
||
**Build/switch:** build EXIT 0 (shellcheck clean via writeShellApplication; runner pkg unfree-allowed).
|
||
`nixos-rebuild switch` → all three units `active`/`success`:
|
||
- `deploy-proxy` success (reconciled traefik), `deploy-drone` → `deploy succeeded 🟢` (drone/drone
|
||
2.26.0, secrets client_secret+rpc_secret v1, drone_env config), `drone-runner-exec` active.
|
||
|
||
**Verify (commands + output):**
|
||
- `docker service ls` → `drone_ci_commoninternet_net_app 1/1`, traefik app+socket-proxy 1/1.
|
||
- Via gateway: `…/healthz` → **200**; `/` → **303** (login redirect, correct).
|
||
- Runner: journal shows a few startup `cannot ping the remote server (404)` (drone RPC not ready
|
||
yet) then `successfully pinged the remote server` + `polling the remote server capacity=2
|
||
endpoint=https://drone.ci.commoninternet.net kind=pipeline type=exec`. **Runner connected via RPC.**
|
||
|
||
**Remaining for M2 gate:** push a hello-world `.drone.yml` to cc-ci + get a green build. Needs the
|
||
cc-ci repo activated in Drone, which requires the bot's Gitea OAuth login (browser flow) to grant
|
||
Drone a Gitea token (to sync repos + set the push webhook). Next tick: script the OAuth login to mint
|
||
a Drone token, activate cc-ci, push .drone.yml, confirm green. (DRONE_USER_CREATE made autonomic-bot
|
||
the admin.)
|
||
|
||
## 2026-05-26 — M2 GATE MET: green build via push (Drone + exec runner)
|
||
|
||
**Drone↔Gitea OAuth (scripted, the one manual bootstrap):** logged the bot into Gitea (CSRF cookie
|
||
→ form), drove Drone `/login` → Gitea authorize consent (POST `/login/oauth/grant` with _csrf+state+
|
||
granted=true) → code callback → Drone `_session_`. Captured the whole flow in
|
||
`scripts/bootstrap-drone-oauth.sh` (reads bot creds from env; documented in install.md §2; one-time,
|
||
token persists in Drone's data volume).
|
||
|
||
**Repo activation:** `GET /api/user` → autonomic-bot admin=true; `GET /api/user/repos?latest=true`
|
||
synced 12 repos; `POST /api/repos/recipe-maintainers/cc-ci` → active=true, config_path .drone.yml
|
||
(sets the Gitea push webhook).
|
||
|
||
**Green build:** added `.drone.yml` (exec pipeline), pushed (0d89e28). Polled
|
||
`/api/repos/recipe-maintainers/cc-ci/builds` → build #1 pending→running→**success**. Steps:
|
||
clone success exit 0; hello success exit 0 — log shows `whoami=root`, `abra 0.13.0-beta-06a57de`,
|
||
`swarm=active` (ran on the host via the exec runner). **M2 gate met; CLAIMED.**
|
||
|
||
**Next:** M3 — comment-bridge service: Gitea issue_comment webhook → verify HMAC + `!testme` exact +
|
||
collaborator → resolve PR head repo/SHA → trigger a parameterized Drone build; post a PR comment with
|
||
the run link. Need a Drone API token for the bridge (mint from the bot's Drone account).
|
||
|
||
## 2026-05-26 — M3 start: bridge secrets + comment-bridge source
|
||
|
||
**Secrets (sops):** minted a Gitea API token (`cc-ci-bridge`, scopes read:org/user, write:repo/issue),
|
||
a Drone API token (`POST /api/user/token`, the stable personal token; rotates on call), and a webhook
|
||
HMAC (urandom hex64). Stored as bridge_gitea_token / bridge_drone_token / bridge_webhook_hmac via
|
||
`sops set` (host age identity). secrets.yaml now holds 6 secrets.
|
||
|
||
**bridge/bridge.py** (Python stdlib only, §4.1): POST /hook handler — verifies Gitea HMAC
|
||
(`X-Gitea-Signature` sha256), requires `X-Gitea-Event: issue_comment`, action=created, body trimmed
|
||
== `!testme`, issue is a PR; checks commenter is a collaborator (Gitea collaborators endpoint, 204);
|
||
resolves PR head sha+repo; triggers a parameterized Drone build
|
||
(`POST /api/repos/<CI_REPO>/builds?branch=main&RECIPE&REF&PR&SRC`, custom params → pipeline env);
|
||
posts a PR comment linking the run. Secrets read from mounted files; config via env. `/healthz` GET.
|
||
|
||
**Next:** package the bridge as a swarm service (dockerTools image, no Docker Hub pull) behind
|
||
traefik at `ci.commoninternet.net/hook` via a reconcile oneshot (modules/bridge.nix); register a
|
||
per-repo webhook with the HMAC; demo on a scratch PR (!testme triggers; non-!testme + non-collab
|
||
rejected). That's the M3 gate.
|