git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and docs/install.md -> machine-docs/... Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
791 lines
52 KiB
Markdown
791 lines
52 KiB
Markdown
# JOURNAL — cc-ci Builder (append-only)
|
||
|
||
## 2026-05-26 — Bootstrap (§1)
|
||
|
||
**Access verification (all pass):**
|
||
- `ssh cc-ci 'hostname && whoami && nixos-version'` → `nixos` / `root` / `24.11.719113.50ab793786d9 (Vicuna)`
|
||
- `curl https://git.autonomic.zone/api/v1/version` → `{"version":"1.24.2"}`
|
||
- Gitea bot auth (`curl -u $GITEA_USERNAME:$GITEA_PASSWORD .../api/v1/user`) → `login: autonomic-bot`, id 64
|
||
- `getent hosts probe-$RANDOM.ci.commoninternet.net` → `143.244.213.108` (the gateway IP, as expected — TLS passthrough)
|
||
- Cert present: `ls /var/lib/ci-certs/live/` → `fullchain.pem` (2909 b), `privkey.pem` (227 b, mode 640)
|
||
- recipe-maintainers org exists (private); `recipe-maintainers/cc-ci` → 404 (created below)
|
||
- Mirrored recipes already present: bluesky-pds, lasuite-docs, custom-html, custom-html-tiny, n8n,
|
||
keycloak, lasuite-meet, matrix-synapse, cryptpad
|
||
|
||
**Baseline (docs/baseline.md):** fresh NixOS 24.11 Incus VM, 2 vCPU, 3.5 GiB RAM, 8.9 GiB disk
|
||
(3.8 GiB free). No docker/swarm/abra. Channel-based `/etc/nixos/configuration.nix` (no flake).
|
||
|
||
**Actions:**
|
||
- Created repo `recipe-maintainers/cc-ci` (private) via Gitea API.
|
||
- `git init` in /srv/cc-ci/cc-ci; credential helper reads creds from /srv/cc-ci/.testenv (no
|
||
secrets stored in git config).
|
||
- Seeded skeleton layout (§3) + loop-state files + docs/baseline.md.
|
||
|
||
**Next:** commit + push bootstrap, then M0 (flake + base config + sops test secret).
|
||
|
||
## 2026-05-26 — M0: flake + base config rebuilt from repo
|
||
|
||
**Authored** `flake.nix` (pins nixpkgs rev `50ab793786d9…`, the exact rev cc-ci ran),
|
||
`hosts/cc-ci/hardware.nix` (incus VM module + cloud-init + DHCP/nameservers) and
|
||
`hosts/cc-ci/configuration.nix` (faithful baseline repro: tailscale w/ hardcoded `--hostname=
|
||
cc-nix-test` since `builtins.readFile /etc/ts-hostname` is impure under flakes; sshd root; firewall
|
||
trust tailscale0 + tcp/22; base pkgs).
|
||
|
||
**Disk/inode hiccup → resolved:** first `nix flake lock`/build hit `No space left on device` —
|
||
diagnosed as **inode** exhaustion (`df -i` → 6005 free of 586336; old 8.9 GiB fs). Operator grew
|
||
the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried.
|
||
|
||
**Build + switch (commands + output):**
|
||
- `ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci'` → `BUILD EXIT 0`,
|
||
produced `nixos-system-nixos-24.11.20250630.50ab793`.
|
||
- `ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch
|
||
--flake /root/cc-ci#cc-ci'` (detached so it survives ssh drop) → unit `Result=success
|
||
ExecMainStatus=0`.
|
||
|
||
**Gate verification:**
|
||
- `systemctl is-system-running` → `running`
|
||
- `readlink /run/current-system` → `…-nixos-system-nixos-24.11.20250630.50ab793` (gen 3, from flake)
|
||
- `systemctl is-active tailscaled` → `active`; `sshd.socket` → `active` (sshd is socket-activated, so
|
||
`sshd.service` reads inactive — live ssh proves it works)
|
||
- `systemctl --failed` → none
|
||
- `nixos-rebuild list-generations` → gen 3 current @20:23, prior channel gen 2 retained for rollback.
|
||
|
||
**Known warning (tracked, non-blocking):** incus module enables `systemd.network` while we keep
|
||
`networking.useDHCP=true` (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from
|
||
baseline; networking is up. Clean up by choosing one stack later.
|
||
|
||
**Deploy mechanism settled** (DECISIONS.md): `switch --flake` on-host, repo synced via `tar | ssh`.
|
||
|
||
**Next:** sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then
|
||
CLAIM the M0 gate for the Adversary.
|
||
|
||
## 2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED)
|
||
|
||
**Keys:**
|
||
- Host age recipient from ssh host key: `ssh cc-ci 'nix run nixpkgs#ssh-to-age -- -i
|
||
/etc/ssh/ssh_host_ed25519_key.pub'` → `age1h90utdztfc23kx8ewrtrtk80mnddvrf8pg4ppej55rwwwupzhfvqhmp3qa`.
|
||
- Master recovery key generated on host (`age-keygen`), public `age1cmk26t…`; private moved off-box
|
||
to `/srv/cc-ci/.sops/master-age.txt` (mode 600) and `shred`-ded from the host. Never in repo.
|
||
|
||
**Files:** `.sops.yaml` (both recipients, rule `secrets/.*\.(yaml|json|env)$`); `modules/secrets.nix`
|
||
(`sops.age.sshKeyPaths=[/etc/ssh/ssh_host_ed25519_key]`, `secrets.test_secret={}`); flake gains
|
||
`sops-nix` input + `sops-nix.nixosModules.sops`; configuration.nix imports the module.
|
||
|
||
**sops-nix version pin (dead-end avoided):** master sops-nix wants `buildGo125Module` (Go 1.25),
|
||
absent in pinned nixpkgs 24.11 → eval error. Pinned sops-nix to `77c423a…` (2025-06-17, last using
|
||
plain `buildGoModule`). Verified the file at that rev uses `buildGoModule`. Build then OK.
|
||
|
||
**Encrypt test secret:** on host, `printf 'test_secret: cc-ci-m0-<rand>' > secrets/secrets.yaml`
|
||
then `nix run nixpkgs#sops -- --encrypt --in-place secrets/secrets.yaml` (run inside repo so
|
||
`.sops.yaml` resolves) → rc=0, two age recipients in the file.
|
||
|
||
**Build + switch (commands + output):**
|
||
- `nixos-rebuild build --flake .#cc-ci` → `BUILD EXIT 0` (built sops-install-secrets w/ Go 1.23.8).
|
||
- `systemd-run --unit=ccci-rebuild2 ... nixos-rebuild switch --flake /root/cc-ci#cc-ci` →
|
||
`Result=success ExecMainStatus=0`.
|
||
|
||
**Gate verification (M0):**
|
||
- `systemctl is-system-running` → `running`; `systemctl --failed` → none.
|
||
- `ls -la /run/secrets/test_secret` → `-r-------- 1 root root 41` ; `stat` → `root:root 400`.
|
||
- `head -c9` → `cc-ci-m0-` (matches generated value), `wc -c` → 41 (9 + 32 hex). Decrypt path proven.
|
||
- Pulled encrypted `secrets/secrets.yaml` + `flake.lock` back to clone; `grep cc-ci-m0 secrets.yaml`
|
||
→ no plaintext leak; lock inputs = nixpkgs, sops-nix.
|
||
|
||
**Gate handshake:** set `Gate: M0 — CLAIMED, awaiting Adversary` in STATUS.md. REVIEW.md still empty
|
||
(no Adversary activity yet). Per §6.1 liveness I won't idle-block: I keep M0 claimed and proceed
|
||
with M1 (independent infra build), without advancing to M2 until M0 shows PASS.
|
||
|
||
**Next:** M1 — Docker + single-node swarm via Nix (modules/swarm.nix), then Traefik (file provider
|
||
→ /var/lib/ci-certs/live/) + abra, then a by-hand HTTPS deploy/teardown of a trivial recipe.
|
||
|
||
## 2026-05-26 — M1: Docker + single-node swarm via Nix
|
||
|
||
**modules/swarm.nix:** `virtualisation.docker.enable` + daily autoprune (--all --volumes until=24h
|
||
to protect the 28 GiB root), `docker` in systemPackages, and a `swarm-init` oneshot
|
||
(`docker swarm init --advertise-addr 127.0.0.1` if not active; `docker network create --driver
|
||
overlay --attachable proxy` if absent). Imported into configuration.nix.
|
||
|
||
**Build + switch:** `nixos-rebuild build --flake .#cc-ci` → EXIT 0; `systemd-run … switch` →
|
||
`Result=success`.
|
||
|
||
**Verify (commands + output):**
|
||
- `systemctl show swarm-init -p Result` → `Result=success`
|
||
- `docker info --format ...` → `Swarm=active Managers=1 Nodes=1`
|
||
- `docker network ls --filter name=proxy` → `proxy overlay swarm`
|
||
- `systemctl is-system-running` → `running`; `--failed` → none.
|
||
|
||
**Next:** Traefik as a swarm stack (Nix-declared compose + `docker stack deploy` oneshot): docker
|
||
swarm provider + file provider serving /var/lib/ci-certs/live/{fullchain,privkey}.pem on :443,
|
||
attached to `proxy`. Then abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate).
|
||
Rationale for swarm-service Traefik over a host `services.traefik`: a host process isn't on the
|
||
`proxy` overlay, so it can't reach swarm service VIPs; coop-cloud recipes assume an on-`proxy`
|
||
Traefik watching swarm labels.
|
||
|
||
## 2026-05-26 — M1: Traefik swarm stack + HTTPS path proven
|
||
|
||
**modules/traefik.nix:** Traefik v3.3 as a swarm service on `proxy` (so it reaches recipe VIPs).
|
||
Config via Nix `writeText` store files bind-mounted into the container (real files, not /etc
|
||
symlinks): static `traefik.yml` (entrypoints web/websecure; `providers.swarm` unix socket,
|
||
exposedByDefault=false, network=proxy; `providers.file` dir /etc/traefik/dynamic; ping; no
|
||
dashboard) and dynamic `certs.yml` (wildcard at /var/lib/ci-certs/live/* as `stores.default.
|
||
defaultCertificate` + certificates — so any *.ci.commoninternet.net router with tls=true is covered,
|
||
no ACME). Deployed by a `traefik-deploy` oneshot (`docker stack deploy`) after swarm-init. Opened
|
||
firewall 80/443 (gateway forwards over enp5s0).
|
||
|
||
**Build + switch:** build EXIT 0; switch `Result=success`; `traefik-deploy` `Result=success`;
|
||
`docker service ls` → `traefik_traefik traefik:v3.3 1/1`.
|
||
|
||
**Verify (commands + output):**
|
||
- Local: `curl -ksv -H 'Host: probe-test.ci.commoninternet.net' https://localhost/` →
|
||
`subject: CN=*.ci.commoninternet.net`, `issuer: …Let's Encrypt; CN=E8`, TLSv1.3, HTTP 404.
|
||
- **End-to-end via gateway:** `curl -ksv --resolve probe-test.ci.commoninternet.net:443:143.244.213.108
|
||
https://probe-test.ci.commoninternet.net/` → `Connected to …(143.244.213.108) port 443`,
|
||
same wildcard cert, HTTP 404. Confirms gateway SNI-passthrough → cc-ci Traefik TLS termination.
|
||
404 is correct (no router for that host yet).
|
||
|
||
**Next:** install abra (M1 last task), `abra app new` a trivial recipe (custom-html) → deploy →
|
||
reach over HTTPS at <app>.ci.commoninternet.net → teardown leaving no volumes. That completes M1
|
||
→ CLAIM M1 gate.
|
||
|
||
## 2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED)
|
||
|
||
**Orchestrator decision (mid-M1):** replace the hand-rolled Traefik with the canonical Co-op Cloud
|
||
`traefik` recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom
|
||
`modules/traefik.nix`; moved firewall 80/443 into `modules/swarm.nix`. Recorded in DECISIONS.md.
|
||
|
||
**Why the pivot also fixed a real bug:** my custom Traefik used entrypoint `websecure`; coop-cloud
|
||
recipes label `entrypoints=web-secure`. While chasing that I also hit a sharp **systemd-run gotcha**:
|
||
`systemd-run … nixos-rebuild switch --flake .#cc-ci` runs with cwd `/`, so `.#` → `/` → "could not
|
||
find a flake.nix"; the switch silently failed while a post-`--collect` `systemctl show` returned a
|
||
stale `Result=success`. Fix: always use the **absolute** flake path `/root/cc-ci#cc-ci`, and read the
|
||
result before resetting. (rebuild6/7 had silently not applied; rebuild2–5 used the absolute path.)
|
||
|
||
**abra packaged** (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd.
|
||
`abra --version` → `0.13.0-beta-06a57de`.
|
||
|
||
**scripts/deploy-proxy.sh** (idempotent, pure-bash — host has no python3): ensure local abra server,
|
||
fetch traefik, write wildcard/no-ACME env (`WILDCARDS_ENABLED=1`, `SECRET_WILDCARD_*_VERSION=v1`,
|
||
`COMPOSE_FILE=compose.yml:compose.wildcard.yml`, `LETS_ENCRYPT_ENV=` empty), insert cert secrets via
|
||
`abra app secret insert … -f` from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line
|
||
PEM must use `-f` (not arg); secret-presence must check `docker secret ls` (abra's recipe list always
|
||
shows the name with `created on server:false`).
|
||
|
||
**Traefik deploy:** `abra app deploy` → `deploy succeeded 🟢` (traefik v3.6.15 + socket-proxy).
|
||
Verify: `docker service ls` → app+socket-proxy 1/1; via gateway `curl --resolve probe.*:443:
|
||
143.244.213.108` → `CN=*.ci.commoninternet.net` (LE E8); **0 ACME log lines**.
|
||
|
||
**M1 gate (recipe over HTTPS + teardown):**
|
||
- `abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -n` then set
|
||
`LETS_ENCRYPT_ENV=` and `abra app deploy -n -C` → `🟢` (nginx 1.29.0).
|
||
- `curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/` →
|
||
`http_code=200 size=615`, served the nginx welcome page over HTTPS with the wildcard cert.
|
||
- Teardown: `abra app undeploy -n` → 🟢; `abra app volume remove -f -n` → "1 volumes removed";
|
||
leak check → services 0 / volumes 0 / secrets 0 / containers 0. **Clean.**
|
||
- Correct teardown syntax confirmed: `secret remove <d> --all -n` (not `--all-secrets`).
|
||
|
||
**docs/install.md** seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md.
|
||
|
||
**Next:** M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green.
|
||
|
||
## 2026-05-26 — M2 start: CI engine decision + Gitea OAuth app + Drone secrets
|
||
|
||
**Decision (DECISIONS.md):** keep Drone per plan. nixpkgs 24.11 has drone server 2.24.0 but only the
|
||
abandoned `drone-runner-exec` (unstable-2020) — accepted (stable RPC), Woodpecker is the documented
|
||
fallback. Deploy shape mirrors traefik: server via coop-cloud `drone` recipe (abra, swarm,
|
||
traefik-routed at drone.ci.commoninternet.net, no ACME), exec runner as a host Nix systemd service.
|
||
|
||
**Recipe recon:** coop-cloud `drone` recipe = drone/drone:2.26.0, secrets `rpc_secret` +
|
||
`CLIENT_SECRET` (Gitea OAuth), Gitea SSO via `compose.gitea.yml` (`GITEA_CLIENT_ID`, `GITEA_DOMAIN`).
|
||
Server env: DRONE_SERVER_HOST/PROTO, DRONE_USER_CREATE.
|
||
|
||
**Done this tick:**
|
||
- Created Gitea OAuth app `cc-ci-drone` (bot): client_id `ab4cdb9d-…`, redirect
|
||
`https://drone.ci.commoninternet.net/login`.
|
||
- Generated `DRONE_RPC_SECRET` (openssl-equivalent /dev/urandom hex32) + stored client_secret;
|
||
both added to `secrets/secrets.yaml` via `sops set` (needed `SOPS_AGE_KEY` from the host ssh key:
|
||
`ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key`). Verified: decrypt shows keys
|
||
test_secret/drone_rpc_secret/drone_gitea_client_secret; file stays encrypted (4× ENC).
|
||
|
||
**Next:** scripts/deploy-drone.sh (abra deploy of drone server w/ Gitea SSO + rpc/client secrets),
|
||
modules/drone-runner.nix (exec runner systemd unit, rpc secret from sops), wire sops secrets for the
|
||
runner, then push a hello-world .drone.yml and confirm a green build (M2 gate).
|
||
|
||
## 2026-05-26 — M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots
|
||
|
||
**Orchestrator steer (2×):** collapse install to a single `nixos-rebuild switch` — convert the
|
||
manual deploy scripts into **idempotent-reconcile systemd oneshots** (writeShellApplication, embedded
|
||
in store; after swarm-init+docker; wants network-online; wantedBy multi-user; reconcile every
|
||
activation/boot, NO run-once sentinel; fail visibly on missing cert). Applied to proxy + drone.
|
||
|
||
**Refactor done:**
|
||
- `modules/packages.nix`: `pkgs.abra` overlay (shared pinned build).
|
||
- `modules/proxy.nix`: `deploy-proxy` oneshot — reconciles coop-cloud traefik (wildcard/no-ACME).
|
||
- `modules/drone.nix`: `deploy-drone` oneshot — reconciles coop-cloud drone (Gitea SSO, secrets from
|
||
/run/secrets), after deploy-proxy.
|
||
- `modules/drone-runner.nix`: exec runner (fixed PATH conflict via `lib.mkForce`; allowUnfree for
|
||
drone-runner-exec — Polyform license).
|
||
- `modules/secrets.nix`: declared drone_rpc_secret + drone_gitea_client_secret + a sops *template*
|
||
`drone-runner.env` (DRONE_RPC_SECRET) as the runner's EnvironmentFile (shared secret).
|
||
- Removed `scripts/deploy-*.sh`. install.md now = clone + nixos-rebuild switch + preconditions.
|
||
|
||
**Build/switch:** build EXIT 0 (shellcheck clean via writeShellApplication; runner pkg unfree-allowed).
|
||
`nixos-rebuild switch` → all three units `active`/`success`:
|
||
- `deploy-proxy` success (reconciled traefik), `deploy-drone` → `deploy succeeded 🟢` (drone/drone
|
||
2.26.0, secrets client_secret+rpc_secret v1, drone_env config), `drone-runner-exec` active.
|
||
|
||
**Verify (commands + output):**
|
||
- `docker service ls` → `drone_ci_commoninternet_net_app 1/1`, traefik app+socket-proxy 1/1.
|
||
- Via gateway: `…/healthz` → **200**; `/` → **303** (login redirect, correct).
|
||
- Runner: journal shows a few startup `cannot ping the remote server (404)` (drone RPC not ready
|
||
yet) then `successfully pinged the remote server` + `polling the remote server capacity=2
|
||
endpoint=https://drone.ci.commoninternet.net kind=pipeline type=exec`. **Runner connected via RPC.**
|
||
|
||
**Remaining for M2 gate:** push a hello-world `.drone.yml` to cc-ci + get a green build. Needs the
|
||
cc-ci repo activated in Drone, which requires the bot's Gitea OAuth login (browser flow) to grant
|
||
Drone a Gitea token (to sync repos + set the push webhook). Next tick: script the OAuth login to mint
|
||
a Drone token, activate cc-ci, push .drone.yml, confirm green. (DRONE_USER_CREATE made autonomic-bot
|
||
the admin.)
|
||
|
||
## 2026-05-26 — M2 GATE MET: green build via push (Drone + exec runner)
|
||
|
||
**Drone↔Gitea OAuth (scripted, the one manual bootstrap):** logged the bot into Gitea (CSRF cookie
|
||
→ form), drove Drone `/login` → Gitea authorize consent (POST `/login/oauth/grant` with _csrf+state+
|
||
granted=true) → code callback → Drone `_session_`. Captured the whole flow in
|
||
`scripts/bootstrap-drone-oauth.sh` (reads bot creds from env; documented in install.md §2; one-time,
|
||
token persists in Drone's data volume).
|
||
|
||
**Repo activation:** `GET /api/user` → autonomic-bot admin=true; `GET /api/user/repos?latest=true`
|
||
synced 12 repos; `POST /api/repos/recipe-maintainers/cc-ci` → active=true, config_path .drone.yml
|
||
(sets the Gitea push webhook).
|
||
|
||
**Green build:** added `.drone.yml` (exec pipeline), pushed (0d89e28). Polled
|
||
`/api/repos/recipe-maintainers/cc-ci/builds` → build #1 pending→running→**success**. Steps:
|
||
clone success exit 0; hello success exit 0 — log shows `whoami=root`, `abra 0.13.0-beta-06a57de`,
|
||
`swarm=active` (ran on the host via the exec runner). **M2 gate met; CLAIMED.**
|
||
|
||
**Next:** M3 — comment-bridge service: Gitea issue_comment webhook → verify HMAC + `!testme` exact +
|
||
collaborator → resolve PR head repo/SHA → trigger a parameterized Drone build; post a PR comment with
|
||
the run link. Need a Drone API token for the bridge (mint from the bot's Drone account).
|
||
|
||
## 2026-05-26 — M3 start: bridge secrets + comment-bridge source
|
||
|
||
**Secrets (sops):** minted a Gitea API token (`cc-ci-bridge`, scopes read:org/user, write:repo/issue),
|
||
a Drone API token (`POST /api/user/token`, the stable personal token; rotates on call), and a webhook
|
||
HMAC (urandom hex64). Stored as bridge_gitea_token / bridge_drone_token / bridge_webhook_hmac via
|
||
`sops set` (host age identity). secrets.yaml now holds 6 secrets.
|
||
|
||
**bridge/bridge.py** (Python stdlib only, §4.1): POST /hook handler — verifies Gitea HMAC
|
||
(`X-Gitea-Signature` sha256), requires `X-Gitea-Event: issue_comment`, action=created, body trimmed
|
||
== `!testme`, issue is a PR; checks commenter is a collaborator (Gitea collaborators endpoint, 204);
|
||
resolves PR head sha+repo; triggers a parameterized Drone build
|
||
(`POST /api/repos/<CI_REPO>/builds?branch=main&RECIPE&REF&PR&SRC`, custom params → pipeline env);
|
||
posts a PR comment linking the run. Secrets read from mounted files; config via env. `/healthz` GET.
|
||
|
||
**Next:** package the bridge as a swarm service (dockerTools image, no Docker Hub pull) behind
|
||
traefik at `ci.commoninternet.net/hook` via a reconcile oneshot (modules/bridge.nix); register a
|
||
per-repo webhook with the HMAC; demo on a scratch PR (!testme triggers; non-!testme + non-collab
|
||
rejected). That's the M3 gate.
|
||
|
||
## 2026-05-26 — M3: bridge deployed + verified; webhook DELIVERY blocked (Gitea-side)
|
||
|
||
**Deployed** the comment-bridge as a Nix-built OCI image (no Docker Hub pull) → swarm service on
|
||
`proxy`, behind traefik at `ci.commoninternet.net/hook`, via reconcile oneshot `modules/bridge.nix`.
|
||
Swarm secrets (webhook_hmac/drone_token/gitea_token) materialised from /run/secrets.
|
||
|
||
**Verified working (bridge side):**
|
||
- `docker service ls` → ccci-bridge_app 1/1.
|
||
- `GET /hook/healthz` → 200 **from the sandbox over real public DNS** (ci.commoninternet.net →
|
||
143.244.213.108); also 200 via gateway from cc-ci.
|
||
- HMAC logic: bad sig → 401; a manually openssl-HMAC-signed body → 204 (passes sig, ignored as
|
||
non-trigger); wrong event → 204. (Debug log added: `got=/want=/bodylen/seclen`.)
|
||
- Registered per-repo `issue_comment` webhook (id 210) on recipe-maintainers/cc-ci → ci.../hook with
|
||
the HMAC. Created scratch PR #1.
|
||
|
||
**Blocker found:** commenting `!testme` (×several) and Gitea's "Test Delivery" (UI returns 200) yield
|
||
ZERO requests at the bridge container. Bridge is publicly reachable by hostname from a 3rd network;
|
||
gateway accepts public sources; public DNS correct → Gitea is not *sending* the delivery. Deliveries
|
||
panel is AJAX (uninspectable via curl); bot is not Gitea admin (can't read `ALLOWED_HOST_LIST`).
|
||
Conclusion: git.autonomic.zone webhook policy (likely `ALLOWED_HOST_LIST`) blocks ci.commoninternet.net.
|
||
Recorded in STATUS ## Blocked with operator options (whitelist host, or I pivot bridge to polling).
|
||
|
||
**Plan:** surface to operator; meanwhile proceed to M4 (harness + install stage) which doesn't depend
|
||
on the webhook (dev recipe-CI builds triggerable directly via the Drone API). Revisit M3 gate once the
|
||
host is whitelisted or via the polling fallback.
|
||
|
||
## 2026-05-27 — M4: harness + install stage green (custom-html), guaranteed teardown
|
||
|
||
**Built the harness:** `runner/harness/abra.py` (abra wrappers w/ gotchas: no --chaos on
|
||
undeploy/volume-remove, `-n` everywhere, parse `app ls -S -m` nested {server:{apps}}, timeouts),
|
||
`runner/harness/lifecycle.py` (deploy_app forcing `LETS_ENCRYPT_ENV=""` [A1], wait_healthy =
|
||
services-converged + HTTPS, teardown_app = undeploy+volume+secret+env-config, janitor for orphans),
|
||
`tests/conftest.py` (`deployed_app` session fixture with finalizer teardown; short unique domain),
|
||
`tests/custom-html/test_install.py` (HTTP 200 + Playwright/Chromium content assertion),
|
||
`runner/run_recipe_ci.py` (orchestrator: fetch recipe@REF, run stage pytest), `modules/harness.nix`
|
||
(`cc-ci-run` = Nix python3+pytest+playwright with PLAYWRIGHT_BROWSERS_PATH from nixpkgs).
|
||
|
||
**Bugs fixed en route (3):**
|
||
1. Swarm config name > 64 chars (long domain) → switched to short `<recipe[:4]>-<6hex>` domain
|
||
scheme (DECISIONS.md).
|
||
2. `services_converged` used wrong stack name (replaced hyphens) → abra keeps hyphens, only dots→_.
|
||
3. `http_get` connected to the gateway IP (drops SNI, gateway routes by SNI) → use the real URL
|
||
(resolves to gateway on cc-ci, correct SNI). Also teardown now removes the app .env config.
|
||
|
||
**Green run + teardown (commands + output):**
|
||
- `RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py` →
|
||
`tests/custom-html/test_install.py::test_http_reachable PASSED`,
|
||
`::test_playwright_page PASSED` — **2 passed in 57.99s**.
|
||
- Leak check after: services 0 / volumes 0 / secrets 0 / containers 0 / env config removed. Clean.
|
||
|
||
**A1 addressed:** deploy_app forces `LETS_ENCRYPT_ENV=""` (no ACME) on every deploy. M4 CLAIMED.
|
||
|
||
**M3 still blocked** (Gitea webhook delivery — operator); no response yet. Next: M5 (upgrade +
|
||
backup/restore for custom-html), then wire the parameterized Drone pipeline (API-triggerable).
|
||
|
||
## 2026-05-27 — M5: upgrade + backup/restore stages green (custom-html)
|
||
|
||
**Upgrade stage** (tests/custom-html/test_upgrade.py): deploy previous published version
|
||
(git-tag sort, second-newest), write a data marker into the served volume (nginx serves
|
||
/usr/share/nginx/html, so the marker is HTTP-fetchable), `abra app upgrade` to current, assert
|
||
healthy + marker survived. Fix: `upgrade` has no `--chaos` flag (used `-f -D -n`).
|
||
|
||
**backup-bot-two** deployed as reconcile oneshot (modules/backupbot.nix): restic repo in a local
|
||
`backups` volume, restic_password abra-generated (only if missing). Fixes: `abra app secret generate`
|
||
needs `-m` (machine) to avoid the TTY/ioctl path, and stdout redirected so generated values never
|
||
hit the journal (D6). `abra app backup create`/`restore` need a real PTY ('input device is not a
|
||
TTY') → run via util-linux `script -qec` (harness `_run_pty`; util-linux added to cc-ci-run).
|
||
|
||
**Backup stage** (test_backup.py): write "original" → `abra app backup create` → mutate to
|
||
"mutated" → `abra app restore` → assert state back to "original".
|
||
|
||
**Full 3-stage run** (`STAGES=install,upgrade,backup`):
|
||
- install: 2 passed (http 200 + playwright)
|
||
- upgrade: 1 passed (data survives upgrade)
|
||
- backup: 1 passed (restore returns pre-mutation state)
|
||
- teardown: 0 orphaned run services/volumes/secrets; infra (traefik/drone/bridge/backupbot) all 1/1.
|
||
M5 CLAIMED.
|
||
|
||
**M3 still blocked** (webhook; no operator response across several ticks). Plan: if still blocked,
|
||
pivot the bridge to poll the Gitea API (self-service, Adversary-endorsed) to unblock D1. Next: M6.
|
||
|
||
## 2026-05-27 — Fix adversary findings A2 (dead janitor) + A3 (unverified teardown)
|
||
|
||
**A2 (janitor matched dead `-pr` filter):** rewrote `harness.lifecycle.janitor` to match the real
|
||
run-app naming (`RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`), reap via
|
||
docker primitives, AND scan `docker service ls` to catch orphans whose `.env` is already gone
|
||
(reconstructs the domain from the service name). Age-gated (default 2h, env `CCCI_JANITOR_MAX_AGE`)
|
||
so concurrent in-flight runs are never killed.
|
||
|
||
**A3 (teardown unverified + unconditional .env removal):** `teardown_app` now (1) `docker stack rm`
|
||
fallback if `abra undeploy` leaves services, (2) removes volumes/secrets *before* the `.env` and
|
||
only drops the `.env` after the stack is confirmed gone, (3) retries docker volume rm (a stopped
|
||
task briefly holds the volume), (4) **verifies** no residual services/volumes/secrets and raises
|
||
`TeardownError` otherwise — so a partial teardown FAILS the run instead of silently orphaning.
|
||
|
||
**Re-test (commands + output):**
|
||
- Normal install run → 2 passed, verified teardown clean.
|
||
- Orphan (deploy, no teardown) → `janitor(CCCI_JANITOR_MAX_AGE=0)` → services/volumes/secrets/env 0.
|
||
- **Env-less orphan** (deploy then `rm` the .env, the A3 bad state) → janitor reaps via docker stack
|
||
rm → services/volumes/secrets 0.
|
||
- Full 3-stage run (install/upgrade/backup) still green with verified teardown, no TeardownError.
|
||
|
||
A2/A3 fixed; left for the Adversary to re-test + close.
|
||
|
||
## 2026-05-27 — M6 (part 1): harness enhancements for recipe #2 + D4 discovery
|
||
|
||
Before enrolling recipe #2, made the shared harness recipe-agnostic so enrolling a recipe needs no
|
||
harness-code change (D5):
|
||
- **Per-recipe meta** (`tests/<recipe>/recipe_meta.py`, optional): HEALTH_PATH, HEALTH_OK,
|
||
DEPLOY_TIMEOUT, HTTP_TIMEOUT. conftest reads it; `wait_healthy` gained a `path` param (e.g.
|
||
keycloak `/realms/master`). Defaults preserve custom-html behaviour (verified: install still green).
|
||
- **Shared naming** (`harness/naming.py`): single source for the `<recipe[:4]>-<6hex>` domain, used
|
||
by conftest + the orchestrator.
|
||
- **D4 recipe-local discovery** (`run_recipe_ci.run_recipe_local`): if a recipe ships `tests/` with
|
||
`test_*.py`, deploy the app, run those tests against the LIVE deployment (contract: env
|
||
`CCCI_BASE_URL` + `CCCI_APP_DOMAIN`), merge as another reported stage, guaranteed teardown. Real
|
||
recipes ship tests/ committed in their repo (clean checkout) → discovered on clone/fetch. (custom-
|
||
html via catalogue is an awkward case — abra refuses an unstaged recipe and `abra recipe fetch`
|
||
resets local commits — so D4 is demonstrated end-to-end with recipe #2 hedgedoc, which ships
|
||
committed tests/.)
|
||
|
||
**Next:** mirror hedgedoc (postgres+hedgedoc, DB-backed) via the mirror+PR flow with a committed
|
||
tests/ dir, write tests/hedgedoc/ (install/upgrade/backup + recipe_meta), run all stages + D4 green.
|
||
|
||
## 2026-05-27 — M6 (part 2): recipe #2 keycloak install green (DB-backed, no harness surgery)
|
||
|
||
Enrolled keycloak (recipe #2): keycloak 26.6.2 **+ mariadb 12.2** — genuinely DB-backed/multi-service
|
||
(vs custom-html stateless). Added only `tests/keycloak/recipe_meta.py` (HEALTH_PATH=/realms/master,
|
||
HEALTH_OK=(200,), 600s timeouts) + `tests/keycloak/test_install.py` (realm-endpoint health +
|
||
Playwright admin-console login). **No change to runner/harness code** — the recipe-agnostic harness
|
||
(per-recipe meta) handled it (D5 evidence).
|
||
|
||
Run: `RECIPE=keycloak STAGES=install cc-ci-run runner/run_recipe_ci.py` → 2 passed in 545s (keycloak
|
||
is slow: image pull + JVM + mariadb migration). Teardown clean (0 keyc-* services/volumes after).
|
||
|
||
**Next:** D4 demo via a mirror shipping committed tests/ (recipe-local run against live app); then
|
||
keycloak upgrade + backup/restore (DB data survival via a realm marker through the admin API).
|
||
|
||
## 2026-05-27 — M6: D4 recipe-local discovery + recipe #2 enrolled (CLAIMED)
|
||
|
||
**D4 recipe-local discovery working.** Demo: pushed a committed `tests/test_recipe_local.py` to the
|
||
mirror on branch `recipe-maintainers/custom-html@ci/d4-recipe-local`; ran
|
||
`RECIPE=custom-html SRC=recipe-maintainers/custom-html REF=ci/d4-recipe-local STAGES=install` →
|
||
install 2 passed, then `===== STAGE: recipe-local (D4) =====` ran the recipe-shipped test against
|
||
the LIVE app (CCCI_BASE_URL) → 1 passed. Clean teardown (0 orphans).
|
||
|
||
**Hard-won abra behaviour (DECISIONS.md):** private mirror clone needs the bot token (per-command
|
||
`http.extraHeader`, not persisted/logged). abra commands (`app ls`, `secret generate`, version
|
||
resolution) silently `git checkout <tag>` the recipe, dropping a PR branch's files — so (1) all
|
||
harness abra calls use `-C -o` (chaos+offline = current checkout, no remote fetch), and (2) D4
|
||
snapshots the recipe's tests/ to a temp dir right after fetch (later abra cmds still reset it).
|
||
Traced the drop step-by-step: app_new ok, deploy ok, but `secret generate` (no flags) and `app ls`
|
||
each reset the checkout.
|
||
|
||
**Recipe #2 = keycloak** (keycloak + mariadb, DB-backed) install green with only
|
||
`tests/keycloak/recipe_meta.py` + `test_install.py` — **no runner/harness change** (D5). custom-html
|
||
remains 3-stage green (M5). docs/enroll-recipe.md written.
|
||
|
||
**M6 CLAIMED.** keycloak's full 3-stage (DB data survival via a realm marker) folds into M6.5.
|
||
**Next:** M6.5 — keycloak upgrade/backup, then recipes 3–6 across the remaining D10 categories.
|
||
|
||
---
|
||
## 2026-05-27 — Trigger redesign (polling primary) + resource safety + M3 verified
|
||
|
||
Session restarted by watchdog (prior tmux died mid-turn with uncommitted bridge WIP). Re-oriented
|
||
from STATUS + plan; two orchestrator design changes landed and are now implemented + verified.
|
||
|
||
**(1) Trigger: POLLING PRIMARY, webhook optional, org-membership auth** (plan §4.1/§1.5; commit
|
||
7addb96). Rewrote `bridge/bridge.py`: a poll thread (`poll_loop`, always-on, primary) scans each
|
||
`POLL_REPOS` repo's open PRs every 30s for new `!testme`; the `/hook` webhook stays as an optional
|
||
admin-registered push optimization. Both share an in-memory comment-id seen-set → a comment seen by
|
||
both fires once. First poll marks pre-existing comments seen (no startup re-fire). Authorization now
|
||
`GET /orgs/{owner}/members/{user}` (204=member, read-level) + optional `AUTH_ALLOWLIST`, replacing
|
||
the admin-requiring `/collaborators/{user}/permission`. Bot never self-registers webhooks.
|
||
- Verified org endpoint at read level (bot basic-auth):
|
||
`members/{autonomic-bot,trav,notplants}` → 204; `members/definitely-not-a-member-xyz` → 404.
|
||
- Deployed (nixos-rebuild, deploy-bridge reconcile); new container logs:
|
||
`poller (primary) watching ['recipe-maintainers/cc-ci'] every 30s` + `(poll primary + optional webhook)`.
|
||
- **End-to-end M3 trigger (poll path):** posted `!testme` on PR #1 (comment 13705, by bot) →
|
||
Drone build **#26** appeared after **6s** (latest was #25); bridge logged
|
||
`[poll] triggered build 26 for cc-ci@d397720a (PR #1, comment 13705) by autonomic-bot`; bridge
|
||
posted back `cc-ci: started CI run for cc-ci @ d397720a → https://drone.ci.commoninternet.net/...`.
|
||
Satisfies D1 (<60s) over the read-only outbound path — no operator webhook whitelist needed.
|
||
|
||
**(2) Resource safety: bound live test apps** (plan §4.2/§4.3; commit 72ff8e2). MAX_TESTS =
|
||
`DRONE_RUNNER_CAPACITY` = 1 (`modules/drone-runner.nix`) → Drone runs ≤1 build at once, queues the
|
||
rest natively. Per-build timeout = 60m, reconciled best-effort in `modules/drone.nix`
|
||
(`PATCH /api/repos/.../cc-ci {"timeout":60}`, non-fatal). Janitor remains the backstop for
|
||
SIGKILL'd/timed-out builds (reaps orphaned run apps at run-start before each deploy).
|
||
- Verified on host after rebuild: `DRONE_RUNNER_CAPACITY=1`; deploy-drone logged
|
||
`set cc-ci build timeout = 60m`; Drone API confirms repo `timeout: 60`.
|
||
|
||
**Gap noted (next item):** `.drone.yml` still only has the `self-test` pipeline — a bridge-triggered
|
||
build runs the self-test, NOT `runner/run_recipe_ci.py`. M4/M5 ran the orchestrator by hand
|
||
(`cc-ci-run`). Need a recipe-CI pipeline keyed on the `RECIPE` build param (runs
|
||
`cc-ci-run runner/run_recipe_ci.py` with STAGES=install,upgrade,backup, `CCCI_JANITOR_MAX_AGE=0`,
|
||
`concurrency:{limit:1}`) to connect bridge→Drone→harness end-to-end (required for D2/D10 via real
|
||
`!testme`). Added to Build backlog.
|
||
|
||
**M3 CLAIMED** (gate). Trigger + auth + comment-back demoed live; the webhook-delivery blocker is
|
||
moot now that polling is primary.
|
||
|
||
---
|
||
## 2026-05-27 — Bridge→Drone→harness integration (recipe-ci pipeline) wired & green
|
||
|
||
Closed the gap where a bridge-triggered build ran only the self-test. Split `.drone.yml` into two
|
||
event-filtered exec pipelines (commits 9d51cb6, bc8baae, 7aa0346):
|
||
- `self-test` — `trigger.event: [push]` (M2 sanity on pushes).
|
||
- `recipe-ci` — `trigger.event: [custom]` (bridge fires event=custom builds): runs
|
||
`cc-ci-run runner/run_recipe_ci.py` with STAGES=install,upgrade,backup, `CCCI_JANITOR_MAX_AGE=0`
|
||
(safe at capacity=1), `concurrency:{limit:1}`, and `HOME=/root` (the exec runner otherwise points
|
||
HOME at an empty per-build workspace → abra `FATA directory is empty: .../.abra/servers`).
|
||
|
||
Verified by triggering a `custom` build (RECIPE=custom-html, as the bridge does) via the Drone API:
|
||
- **Build #31** got past `abra app new` (HOME fix) but failed at backup:
|
||
`abra app backup create … FATA … authentication required: Unauthorized` — backup/restore weren't
|
||
passing `-C -o`, so abra fetched recipe tags from the (private) remote. Also `recipe versions`
|
||
found no tags (contaminated recipe dir: private-mirror origin, no tags) → upgrade stage SKIPPED.
|
||
- Fixes: `abra.py` backup_create/restore now pass `-C -o`; `fetch_recipe` catalogue path rm's the
|
||
recipe dir first so a leftover private-mirror clone can't poison version resolution.
|
||
- **Build #33 → SUCCESS (124s)**, all three stages green through Drone:
|
||
install `2 passed` (real deploy + Playwright), upgrade `1 passed` (real — tags restored by the
|
||
clean re-clone, no longer skipped), backup `1 passed` (the -C -o fix). Post-run on host:
|
||
0 run-app services, 0 run-app volumes; traefik/drone/bridge infra intact. Event filtering works
|
||
(only recipe-ci ran, not self-test).
|
||
|
||
So the full D1→D2 path is wired and proven in two verified halves: poll-trigger→Drone (build #26,
|
||
RECIPE param correct) and Drone→harness 3-stage CI (build #33, green + clean teardown). Remaining for
|
||
full single-comment E2E on a *recipe* PR: enroll the recipe in the bridge POLL_REPOS + open a recipe
|
||
PR (M6.5/M10 breadth work).
|
||
|
||
**Adversary findings status (signal for re-test):** A2 (janitor `-pr` filter) and A3 (teardown
|
||
verification + `.env`-last ordering) are both already fixed in the current code
|
||
(`lifecycle.RUN_APP_RE` hashed-scheme match; `teardown_app` `_residual()` raise + `docker stack rm`
|
||
fallback) — awaiting the Adversary's kill-probe re-test on an idle host. A4 (concurrent same-recipe
|
||
collision): its named root cause "no Drone concurrency cap (capacity=2)" is eliminated by
|
||
MAX_TESTS=capacity=1 — no concurrent runs possible on this single node, so the shared-recipe-dir race
|
||
can't occur. No Builder fix outstanding on findings; next milestone work is M6.5 breadth.
|
||
|
||
---
|
||
## 2026-05-27 — M6.5: keycloak full 3-stage GREEN through the Drone recipe-ci pipeline
|
||
|
||
Ran keycloak (DB-backed, SSO/identity category) end-to-end via the integrated recipe-ci pipeline
|
||
(triggered `custom` build #39, RECIPE=keycloak). **Build #39 → success (~31m)**, all three stages
|
||
green as separate reported stages:
|
||
- install `2 passed` (8m30s): `test_realm_endpoint_healthy` (/realms/master 200) + Playwright admin
|
||
console login.
|
||
- upgrade `1 passed` (10m10s): `test_upgrade_preserves_realm` — realm marker written pre-upgrade
|
||
survives the previous→latest upgrade (DB data survival).
|
||
- backup `1 passed` (8m15s): `test_backup_mutate_restore` — backup→mutate→restore returns original.
|
||
Clean teardown verified on host: 0 keyc services, 0 keyc volumes. keycloak cold start is slow on
|
||
this VM (Quarkus augmentation ~80s + Liquibase schema init), so each deploy is ~5-8m — well within
|
||
the 60m build timeout; that's why the run took ~31m. No harness surgery (D5): keycloak runs off
|
||
`tests/keycloak/{recipe_meta,test_install,test_upgrade,test_backup}.py` + `kc_admin.py` only.
|
||
|
||
This both advances M6.5 (first DB-backed recipe full 3-stage) and confirms the recipe-ci integration
|
||
works on a heavy DB-backed recipe (Drone→harness→3 stages→teardown). Next M6.5: enroll recipes 3–6
|
||
covering the remaining D10 categories (stateful-no-DB, multi-service+S3, large-volume, etc.).
|
||
|
||
---
|
||
## 2026-05-27 — M6.5: cryptpad (recipe #3) enrolled + full 3-stage green; fixed a real backup bug
|
||
|
||
Enrolled **cryptpad** (stateful, no external DB — the D10 "stateful/no-DB" category). No shared-harness
|
||
surgery beyond a *generic* feature: added per-recipe **EXTRA_ENV** (recipe_meta.py dict or
|
||
domain-callable) applied in `deploy_app` at every deploy path. cryptpad uses it for its required
|
||
distinct `SANDBOX_DOMAIN` (a sibling subdomain under the wildcard, so no cert work). Data-survival
|
||
tests write a marker into the backed-up `cryptpad_data` volume and read it via `exec_in_app`
|
||
(cryptpad's datastore isn't HTTP-served like custom-html).
|
||
|
||
Host runs (HOME=/root, cc-ci-run): install **2 passed** (~2m; http 200 + Playwright loads cryptpad),
|
||
upgrade **1 passed** (~1m; marker survives previous→current), backup **1 passed** after a fix
|
||
(below). Clean teardown (0 cryp services/volumes).
|
||
|
||
**Real bug found+fixed — backups were silently mis-wired (set_env newline).** cryptpad backup first
|
||
failed: `abra app backup create` → backup-bot-two's `/usr/bin/backup` raised
|
||
`KeyError: 'RESTIC_REPOSITORY'`. Root cause: backup-bot-two's `.env.sample` ends with a *newline-less*
|
||
comment line, and the reconcile's `set_env` did a bare `printf >> .env`, gluing
|
||
`RESTIC_REPOSITORY=/backups/restic` onto that comment → commented out. abra `--debug` confirmed the
|
||
backupbot env map lacked `RESTIC_REPOSITORY`, and `docker exec backupbot printenv RESTIC_REPOSITORY`
|
||
was empty. Fix: `set_env` now ensures a trailing newline before appending (modules/backupbot.nix +
|
||
modules/drone.nix, same latent bug). After rebuild: `.env` has a clean `RESTIC_REPOSITORY=` line, the
|
||
backupbot container has `RESTIC_REPOSITORY=/backups/restic`, and cryptpad backup→mutate→restore
|
||
passes. NOTE: keycloak backup (build #39) passed off an *earlier, non-corrupted* backupbot deploy;
|
||
worth a re-verify, but the mechanism is now correct/reproducible. Triggered Drone build #46 (cryptpad)
|
||
as the canonical recipe-ci run.
|
||
|
||
---
|
||
## 2026-05-27 — M6.5: matrix-synapse (recipe #4, DB+media/large-volume) full 3-stage green
|
||
|
||
Enrolled matrix-synapse (synapse `app` + postgres `db` + nginx `web`) — the large-volume/DB+media
|
||
D10 category. No harness surgery (server_name = DOMAIN; no EXTRA_ENV needed). Host runs (cc-ci-run):
|
||
install **2 passed** (~2.7m; client API 200 + real `/_matrix/client/versions` JSON), upgrade
|
||
**1 passed** (~2.3m; postgres marker survives previous→current), backup **1 passed** (~1.5m). Clean
|
||
teardown (0 matr services). The data-survival tests use a `ci_marker` postgres row exec'd via
|
||
`psql` in the `db` service — this exercises the recipe's real DB-dump backup hook
|
||
(`backupbot.backup.pre-hook=/pg_backup.sh backup` / `restore.post-hook`), the meaningful matrix data
|
||
path (not a plain volume copy). Worked first try (the set_env/RESTIC fix holds for hook-based
|
||
backups too). Triggering the canonical Drone recipe-ci run.
|
||
|
||
4 of 6 D10 recipes now green: custom-html (simple), keycloak (SSO/DB), cryptpad (stateful/no-DB),
|
||
matrix-synapse (DB+media/large-volume). Remaining categories: multi-service+S3 (lasuite-docs) and
|
||
TLS-passthrough (bluesky-pds).
|
||
|
||
---
|
||
## 2026-05-27 — M6.5: lasuite-docs (recipe #5, multi-service + S3/MinIO) full 3-stage green
|
||
|
||
Enrolled lasuite-docs (the object-storage/S3 + multi-service D10 category): a 9-service stack
|
||
(frontend app + Django backend + celery + y-provider + docspec + postgres + redis + minio + nginx).
|
||
Host runs (cc-ci-run): install **2 passed** (~2.5m; SPA served + Playwright), upgrade **1 passed**
|
||
(~3m; postgres marker survives previous→current, incl. cold-pulling the older images), backup
|
||
**1 passed** (~2.3m; pg_backup.sh dump/restore). Clean teardown.
|
||
|
||
Root-caused the initial deploy timeout: cold-pulling ~9 large images (impress frontend/backend,
|
||
minio, postgres18, docspec, y-provider, redis) exceeds abra's default 300s convergence TIMEOUT →
|
||
`FATA deploy timed out 🟠`. A manual deploy confirmed the stack converges 9/9 once images are pulled.
|
||
Fix: bump the recipe TIMEOUT to 900 via the generic EXTRA_ENV mechanism (no harness surgery). OIDC is
|
||
config-only (Django `manage.py check` validates but doesn't fetch), so the stack starts healthy with
|
||
placeholder OIDC; login isn't exercised in CI (documented in recipe_meta). Data-survival uses a
|
||
postgres marker (docs/docs) via the pg_backup hook.
|
||
|
||
5 of 6 D10 recipes green: custom-html (simple), keycloak (SSO/DB), cryptpad (stateful/no-DB),
|
||
matrix-synapse (DB+media/large-volume), lasuite-docs (multi-service + S3/MinIO). Remaining: a
|
||
TLS-passthrough recipe (bluesky-pds) for the 6th, which needs cc-ci Traefik passthrough config
|
||
(plan §4.0 caveat) — the hardest infra-wise.
|
||
|
||
---
|
||
## 2026-05-27 — M6.5 COMPLETE: n8n (recipe #6) full 3-stage green — all 6 D10 recipes done
|
||
|
||
Enrolled n8n (workflow automation; single `app` service, stateful via the /home/node/.n8n volume,
|
||
normal terminate-at-Traefik). Host runs: install **2 passed** (~3.8m; /healthz 200 + Playwright
|
||
editor), upgrade **1 passed** (~1.3m; marker in /home/node/.n8n survives), backup **1 passed**
|
||
(~0.8m; backupbot.backup.path file backup). Clean teardown. (Caught a sync gap first: committed the
|
||
tests but forgot to tar tests/n8n to the host → run skipped "no stage test files"; synced + re-ran.)
|
||
|
||
n8n is recipe #6 in place of bluesky-pds (TLS-passthrough), swapped per DECISIONS (caddy self-ACME
|
||
conflicts with cc-ci's no-ACME/static-wildcard design).
|
||
|
||
**All 6 D10 recipes now have a full 3-stage green run (host):**
|
||
1. custom-html — simple/stateless
|
||
2. keycloak — SSO/identity + DB (Drone #39)
|
||
3. cryptpad — stateful/no-DB (Drone #46)
|
||
4. matrix-synapse — DB+media/large-volume (Drone #51)
|
||
5. lasuite-docs — multi-service + S3/MinIO/object-storage (Drone #57)
|
||
6. n8n — workflow automation (Drone canonical run triggering now)
|
||
All 5 required D10 categories covered. Triggering n8n canonical Drone run, then claiming the M6.5 gate.
|
||
|
||
---
|
||
## 2026-05-27 — M8/D7: results dashboard live (overview + badges)
|
||
|
||
Built the results dashboard (dashboard/dashboard.py + modules/dashboard.nix): a stdlib HTTP service
|
||
(Nix-built OCI image, swarm service on proxy, reconcile oneshot like bridge/drone) that polls the
|
||
Drone API for recipe-CI builds (event=custom), groups latest-run-per-recipe, and renders a
|
||
YunoHost-CI-like overview at **ci.commoninternet.net/** with pass/fail/running badges, last ref,
|
||
when, and a link to the canonical Drone run. Plus /badge/<recipe>.svg embeddable badges.
|
||
|
||
Verified live via the public gateway: overview lists exactly the 6 enrolled recipes (cryptpad,
|
||
custom-html, keycloak, lasuite-docs, matrix-synapse, n8n) each **success**; `/badge/keycloak.svg` →
|
||
200 image/svg+xml; `/healthz` → 200; **`/hook` still routes to the bridge** (200) — the bridge's
|
||
Host && PathPrefix(`/hook`) rule keeps priority over the dashboard's Host-only rule.
|
||
|
||
Two fixes en route: (1) filter out the cc-ci repo's own name as a recipe row (Adversary !testme on
|
||
the cc-ci PR showed a spurious cc-ci=failure); (2) **content-hash image tag** — a fixed `:latest`
|
||
tag + unchanged stack spec does NOT roll the swarm service on a code change, so the tag is now
|
||
derived from a hash of dashboard.py → `docker stack deploy` rolls reliably (reproducible/self-heal).
|
||
NOTE: the bridge image has the same latent `:latest` issue (only rolled this session because its
|
||
.nix env also changed) — worth the same content-tag treatment (backlog).
|
||
|
||
Remaining M8 piece: PR-comment **outcome reflection** — the bridge posts the start/run-link comment
|
||
but doesn't yet update it with the final pass/fail (needs a Drone build-completion hook or the
|
||
bridge polling build status). Overview + badges (the core of D7) are done.
|
||
|
||
---
|
||
## 2026-05-27 — M8/D7 complete: PR-comment outcome reflection + gate claim
|
||
|
||
Added outcome reflection to the bridge: after triggering, a daemon watcher polls the Drone build to
|
||
completion and edits the run-link PR comment to ✅ passed / ❌ <status> (Gitea PATCH
|
||
issues/comments/{id}). Gave the bridge image a content-hash tag so the swarm service actually rolls
|
||
on bridge.py changes (same latent :latest no-roll issue the dashboard had).
|
||
|
||
Verified end-to-end: posted a fresh `!testme` on PR #1 → poller fired → "started" comment posted →
|
||
build #76 (RECIPE=cc-ci, fails fast: no tests/cc-ci) → within ~20s the **same comment was edited to
|
||
`cc-ci: run for cc-ci @ d397720a ❌ failure → …/76`**. The pass/fail now mirrors onto the PR comment.
|
||
|
||
D7 fully met: per-run logs (Drone UI) + overview page with badges (dashboard, live) + PR comment
|
||
links back AND reflects the outcome. Claiming the M8 gate.
|
||
|
||
---
|
||
## 2026-05-27 — M10/D10: real !testme path proven on custom-html; enrolling the breadth set
|
||
|
||
Wired the real-PR path end-to-end and proved it on custom-html. `!testme` on
|
||
recipe-maintainers/custom-html#2 → bridge poller fired → recipe-ci build (SRC=mirror, REF=PR head
|
||
db9a9502) → **build #84 success, all 3 stages green** (install 2✓, upgrade 1✓ — now runs for real,
|
||
backup 1✓) → bridge comment edited to ✅ passed. Clean teardown.
|
||
|
||
Three fixes to make the real-PR path exercise the upgrade stage (mirror PR clones carry no tags):
|
||
1. fetch_recipe (SRC+REF) read-only fetches the published version tags from the PUBLIC upstream
|
||
(`git fetch <upstream> refs/tags/*:refs/tags/*` — bare `--tags` errored "no remote HEAD"); plain
|
||
git, never pushes to the mirror (guardrail-safe).
|
||
2. abra.upgrade now passes `-o` (offline) — it was 401'ing trying to fetch tags from the private
|
||
mirror origin; offline uses the local (upstream-populated) tags.
|
||
3. (earlier) backup/restore already pass `-C -o`.
|
||
Now firing !testme on the other recipes' open PRs (keycloak#1, matrix-synapse#1, lasuite-docs#1,
|
||
n8n#1) — they queue at MAX_TESTS=1. cryptpad has no open PR → opening one next.
|
||
|
||
---
|
||
## 2026-05-27 — M10/D10: real !testme breadth runs — 5/6 green, lasuite-docs upgrade retry
|
||
|
||
Fired !testme on all 6 recipe PRs (capacity=1, sequential). Results (real PR-triggered, full 3-stage):
|
||
- custom-html #84 ✅ (PR head db9a9502)
|
||
- keycloak #86 ✅ (DB realm marker survives upgrade)
|
||
- matrix-synapse #87 ✅ (postgres marker, pg_backup hook)
|
||
- n8n #89 ✅
|
||
- cryptpad #90 ✅ (test PR #2 opened via Gitea API: branch ci/testme + .ci-testme marker)
|
||
- **lasuite-docs #88 ❌** — install ✅ + backup ✅, but UPGRADE failed: `abra app upgrade … -o`
|
||
→ `FATA deploy failed` (a convergence failure during the 9-service rolling upgrade prev→latest,
|
||
not a timeout). It PASSED on the host/catalogue run, and ran right after the heavy matrix build,
|
||
so likely transient resource contention. Re-fired !testme on lasuite-docs#1 to test
|
||
transient-vs-persistent.
|
||
|
||
So the real-!testme path + the upgrade fixes (upstream tags + `upgrade -o`) work across simple, DB,
|
||
DB+media, workflow, and stateful recipes. lasuite-docs (the object-storage/S3 category, required)
|
||
needs its upgrade to pass on the real path for the 6/6 D10 proof.
|
||
|
||
---
|
||
## 2026-05-27 — M10: 5/6 real-!testme green; lasuite-docs blocked on Docker Hub rate limit (A1)
|
||
|
||
lasuite-docs #88/#92 upgrade failed "deploy failed" → diagnosed: node disk at 90% (2.7G free) — a
|
||
9-service rolling upgrade couldn't converge. Pruned 30 unused images (reclaimed 12GB → 15G free).
|
||
Retry #93: got further (5/8 services up) but redis task Rejected "No such image: redis:8.2.6" →
|
||
`docker pull redis:8.2.6` on the node = `toomanyrequests: unauthenticated pull rate limit`. So the
|
||
prune fixed disk but forced re-pulls that hit Docker Hub's anonymous limit (A1 registry-creds
|
||
finding, §1.5/§4.4). Recorded in STATUS ## Blocked + DECISIONS; surfaced to operator (provide Docker
|
||
Hub creds). 5/6 recipes green via real !testme; lasuite install+backup green, upgrade gated.
|
||
Pivoting to M9 (docs/reproducibility, unblocked) while the limit resets / creds arrive.
|
||
|
||
---
|
||
## 2026-05-27 — lasuite quota-window retry insufficient; halting retries pending creds (3rd attempt)
|
||
|
||
Re-fired lasuite-docs !testme during the apparently-eased window (#96). The cached image redis:8.2.6
|
||
gave "up to date", but the LATEST version's uncached redis:8.6.3 → `toomanyrequests` again. So the
|
||
anonymous quota isn't reset enough for a full 9-service × 2-version deploy. Cancelled #96 + tore down
|
||
clean. This is the 3rd confirmation the blocker is the Docker Hub rate limit. Per anti-thrash:
|
||
**halting lasuite retries until the operator provides Docker Hub creds** (A1, STATUS ## Blocked).
|
||
5/6 D10 recipes remain green via real !testme. Pivoting to M9 (docs/reproducibility) — fully
|
||
unblocked, no image pulls.
|
||
|
||
---
|
||
## 2026-05-27 — M10/D10 BUILDER-COMPLETE: all 6 recipes green via real !testme
|
||
|
||
Diagnosed the lasuite-docs upgrade failure with an instrumented host run: `abra app upgrade` reported
|
||
`FATA deploy failed` while all 9 services were actually 1/1 healthy — abra's convergence poll gives
|
||
up too early on the slow stop-first rolling upgrade (pulling new images). Fix: pass `-c`
|
||
(`--no-converge-checks`) to `abra app upgrade` and let the harness's wait_healthy + data-survival
|
||
assertion be the (patient, real) gate. (Also: `/root/cc-ci` was stale — fully synced; the first diag
|
||
hit the old no-`-o` auth error, masking this.)
|
||
|
||
**lasuite-docs #108 → success** with the fix: install 2✓, upgrade 1✓, backup 1✓; bridge comment
|
||
edited to `✅ passed`. So **all 6 D10 recipes are green via REAL `!testme` on a PR**, full 3-stage,
|
||
comment-reflected, clean teardown:
|
||
| recipe | category | build |
|
||
|---|---|---|
|
||
| custom-html | simple/stateless | #84 |
|
||
| keycloak | SSO/identity + DB | #86 |
|
||
| matrix-synapse | DB + media / large-volume | #87 |
|
||
| n8n | workflow automation | #89 |
|
||
| cryptpad | stateful / no external DB | #90 |
|
||
| lasuite-docs | multi-service + S3/MinIO/object-storage | #108 |
|
||
|
||
All 5 required D10 categories covered. The earlier Docker Hub rate-limit blocker resolved on quota
|
||
reset (registry creds still recommended for reproducibility under load — see DECISIONS). D10 is
|
||
Builder-complete; DONE awaits the Adversary's <24h PASS on D1–D10 (esp. independent D10 verification).
|
||
|
||
---
|
||
## 2026-05-27 — D10 PASS (6/6); DONE now blocked only on D8 live VM rebuild (Adversary)
|
||
|
||
Adversary logged **D10 PASS (6/6) @11:57Z** (all 6 recipes green via real !testme, all categories) +
|
||
**M8/D7 PASS, D8-core PASS, D9 PASS** this window. No VETO, no open findings. Audited REVIEW: D1
|
||
(=M3), D2 (=M5/M6.5/D10), D3 (=M4), D4/D5 (=M6/M6.5), D6 (=M7), D7 (=M8), D9, D10 all have <24h PASS.
|
||
**The sole remaining item is D8's full live blank-VM rebuild** — the Adversary marked it "core PASS;
|
||
live rebuild pending — to complete before DONE," deferred on the premise that it needs operator
|
||
registry creds. That premise is now OBSOLETE (D10 went 6/6 with no creds; rate limit was transient;
|
||
real fix was `-c`). Surfaced in STATUS that the VM rebuild is feasible now without creds. NOT writing
|
||
`## DONE` until REVIEW shows a full D8 PASS (per §6.1 — every D1–D10 needs a <24h PASS). No Builder
|
||
implementation remains; awaiting the Adversary's D8 completion.
|
||
|
||
---
|
||
## 2026-05-27 — ## DONE
|
||
|
||
Adversary logged **D8 PASS** (byte-identical `nixos-rebuild build` closure == `/run/current-system`
|
||
+ documented-alternative: a throwaway-VM live rebuild is infeasible by design — its host key isn't a
|
||
sops recipient and the A1 preconditions bind to cc-ci itself) and the **DONE-readiness table: all
|
||
D1–D10 PASS <24h, no VETO**, explicitly: "the DONE handshake (§6.1) is CLEARED — Builder may flip
|
||
STATUS → DONE."
|
||
|
||
Wrote `## DONE` to STATUS.md with the D1–D10 evidence table. Every Definition-of-Done item is
|
||
independently Adversary-verified. **Stopping the self-paced loop** (no further ScheduleWakeup).
|
||
|
||
Final system: NixOS-flake-declared cc-ci host running single-node Docker Swarm + coop-cloud/traefik
|
||
(wildcard cert, no ACME) + Drone (server recipe + host exec runner, MAX_TESTS=1, 60m timeout) +
|
||
comment-bridge (polling-primary `!testme`, org-membership auth, PR-comment outcome reflection) +
|
||
backup-bot-two + results dashboard. `!testme` on an enrolled recipe PR → 3-stage (install/upgrade/
|
||
backup) real e2e CI with Playwright → live Drone logs + dashboard + PR ✅/❌. Six recipes proven.
|