refactor(1b): RL6 — move Builder protocol files into machine-docs/ (README stays root)
git mv STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md -> machine-docs/. README.md kept at root (operator decision). Updated in-repo refs: README (status line + lint section + Loop-state section) and docs/install.md -> machine-docs/... Safe to move now: launch.sh already has resolve_state() (prefers machine-docs/ else root) used by every STATUS/REVIEW read, and the running watchdog (pid 133191) was restarted AFTER that update, so it is location-agnostic. scripts/lint.sh -> lint: PASS post-move. Adversary moves its own REVIEW*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
47
machine-docs/BACKLOG-1b.md
Normal file
47
machine-docs/BACKLOG-1b.md
Normal file
@ -0,0 +1,47 @@
|
||||
# BACKLOG — Phase 1b (review & lint pass)
|
||||
|
||||
Phase-namespaced backlog. Builder owns `## Build backlog`; Adversary owns `## Adversary findings`.
|
||||
|
||||
## Build backlog
|
||||
|
||||
### W0 — Tooling + format (RL1) — DONE (Adversary PASS @2026-05-27)
|
||||
- [x] Add lint tooling to the flake: a `lint` devshell (nixpkgs-fmt, statix, deadnix, ruff,
|
||||
shellcheck, shfmt, yamllint) built from the pinned nixpkgs.
|
||||
- [x] Add a `lint` entrypoint script (`scripts/lint.sh`) with check + `--fix` modes; tool configs
|
||||
(ruff, yamllint, etc.).
|
||||
- [x] Auto-format the codebase (nix + python + shell).
|
||||
- [x] Fix remaining lint findings (statix/deadnix/ruff-lint/shellcheck) without weakening any test.
|
||||
- [x] Wire a `lint` stage into `.drone.yml` (push event); verified green from a clean checkout
|
||||
(Adversary cold PASS + break-it probe).
|
||||
|
||||
### W1 — Review checklist + fixes (RL2)
|
||||
- [x] Run the §3 white-box checklist (Builder side): all blocking invariants hold (tests-real,
|
||||
harness-DRY, nix-idempotent, no-footguns, no-secrets, log-redaction); no fix needed; no advisory
|
||||
to file. Recorded in JOURNAL-1b. Awaiting Adversary's own §3 pass #2 to confirm RL2.
|
||||
|
||||
### W2 — Re-verify + document (RL3/RL4)
|
||||
- [x] RL4 docs: README "Linting & formatting" (local + CI-enforced); architecture.md `nix/` layout;
|
||||
decisions in DECISIONS.md (lint tooling, RL5/RL6).
|
||||
- [x] Rebuild canonical cc-ci to the cleaned+RL5 closure (`8i3jcad9`) so `build == running`; healthy
|
||||
(0 failed, stacks up, public dashboard 200).
|
||||
- [ ] **RL3**: Adversary cold re-verification of all D1–D10 (now also covers the RL5 byte-identical
|
||||
rebuild). Gate claimed in STATUS-1b.
|
||||
- [ ] On full PASS handshake, write `## DONE` to STATUS-1b.md.
|
||||
|
||||
### RL5 — Nix-folder consolidation (operator §7) — DONE
|
||||
- [x] `modules/`→`nix/modules/`, `hosts/`→`nix/hosts/`; flake at root (#cc-ci unchanged); paths fixed;
|
||||
docs updated; builds byte-identical `8i3jcad9`; lint PASS; canonical switched + healthy.
|
||||
|
||||
### RL6 — protocol files → machine-docs/ (operator §7) — DEFERRED (coordinated, LAST)
|
||||
- [ ] `git mv STATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.md machine-docs/` (README stays root);
|
||||
update refs. MUST be lockstep with orchestrator (launch.sh + watchdog restart). Do as the final
|
||||
1b step; flag the orchestrator first. Not while a phase transition is pending.
|
||||
|
||||
### Advisories triaged (from Adversary §3 pass #2)
|
||||
- [idea] Share the `old_app` upgrade fixture across recipe suites instead of per-recipe copy-paste —
|
||||
advisory only (per-recipe upgrade tests are by design; not a harness-DRY blocker). Defer to Phase 2.
|
||||
- App-secret redaction (`cc-ci-run` Drone step not wrapped by `run_stage_redacted`) — Adversary RL3/D6
|
||||
behavioral leak test re-checks published logs + dashboard. Adversary-owned watch-item.
|
||||
|
||||
## Adversary findings
|
||||
(empty — Adversary owns this section)
|
||||
56
machine-docs/BACKLOG-1c.md
Normal file
56
machine-docs/BACKLOG-1c.md
Normal file
@ -0,0 +1,56 @@
|
||||
# BACKLOG — Phase 1c
|
||||
|
||||
Single-writer rule (§6.1): Builder edits `## Build backlog`; Adversary edits `## Adversary findings`.
|
||||
|
||||
## Build backlog
|
||||
|
||||
Method W1–W6 from the phase plan §5. Each milestone ends with an Adversary gate.
|
||||
|
||||
- [x] **W2 — Secrets repo + cert into git.** (build items done; awaiting Adversary gate)
|
||||
- [x] Create private repo `recipe-maintainers/cc-ci-secrets` (bot admin, private).
|
||||
- [x] Move secrets + add wildcard cert+key as sops secrets (root `secrets.yaml`; sha256 verified).
|
||||
- [x] Wire base flake to consume `cc-ci-secrets` — **git submodule** at `secrets/` (DECISIONS).
|
||||
- [x] secrets.nix: `wildcard_cert`/`wildcard_key` → `path=/var/lib/ci-certs/live/*`.
|
||||
- [x] proxy.nix: cert reframed as sops-from-git.
|
||||
- [x] Verify byte-identical `build`==`/run/current-system` (`vh6vwxbl…`); git-clone `?submodules=1` matches too.
|
||||
- [x] Verify clean switch on cc-nix-test; live TLS served from git cert (ssl_verify=0).
|
||||
- [x] **Gate W2 CLAIMED** → Adversary verifies byte-identical + TLS-from-git-cert.
|
||||
- [x] **W1 — Headroom.** Resized `cc-nix-test` 6→4 GB (stop→PATCH→start via Incus API); healthy at 4 GB,
|
||||
0 failed units, all stacks 1/1, cert survived reboot via sops, TLS 200. Running RAM 8 GB.
|
||||
- [x] **W3 — Throwaway VM.** `ccci-throwaway` (incus-base, 4 GB/20 GB) reachable at 100.126.124.86
|
||||
(used live TS_AUTH_KEY; workspace key stale). Bootstrap age key provisioned in W4.
|
||||
- [x] **W4 — Reproducible live rebuild.** Fresh blank VM + recovery age key only → `git clone
|
||||
--recursive` + ONE `nixos-rebuild switch ?submodules=1` → running/0-failed, byte-identical
|
||||
`ld19aj2`==cc-ci, 6 stacks 1/1, all secrets+cert decrypt, TLS leaf==git cert. Found+fixed a
|
||||
concurrent-abra race (serialized reconcilers). **Gate W4 CLAIMED** (awaiting Adversary W5).
|
||||
- [ ] **W5.5 — Functional-acceptance e2e (E2E-TESTME, operator-gated).** Authority:
|
||||
`cc-ci-plan/test-e2e-testme-acceptance.md`. After C4/C5 PASS + orchestrator renames rebuilt VM→
|
||||
cc-nix-test + confirms public gateway + SIGNALS: `!testme` (bot) on a fast enrolled recipe
|
||||
(custom-html); verify E1–E6 (self-check 200/cert → new Drone build via bridge → app reachable
|
||||
EXTERNALLY at `<app>.ci.commoninternet.net` w/ valid cert+content → real assertions pass → clean
|
||||
undeploy → reported). Evidence→JOURNAL-1c, verdict→STATUS/REVIEW-1c. Fail⇒fix in git, re-run.
|
||||
Do NOT start before the signal; keep VM stack up. Adversary independently verifies.
|
||||
- [ ] **W5 — Adversary cold proof + honest D8.** Adversary repeats W4 independently; rewrites D8
|
||||
evidence (static+live), removes "infeasible by design". Accept: Adversary D8 live-rebuild PASS
|
||||
(or narrow signed-off limitation per C5).
|
||||
- [ ] **W6 — Cleanup + docs + final sizing.** Destroy throwaway VM; update docs (C7); decide+apply
|
||||
final cc-nix-test sizing. Accept: no leftover; docs match; flip STATUS-1c → `## DONE`.
|
||||
|
||||
## Adversary findings
|
||||
|
||||
- [x] **ADV-1c-1 [adversary] — `docs/architecture.md` not updated to the 1c model (blocks C7). CLOSED @2026-05-27 20:10Z (Adversary re-verified).**
|
||||
Fixed by Builder (`6276bfd`/`2a5affc`). Re-read at HEAD: secrets row now = "`secrets/` = **cc-ci-secrets submodule** … ALL secrets incl. wildcard cert+key sops-encrypted in git … base holds **no** secret material … decrypted by the bootstrap age key (`sops.age.keyFile`), host-derived or **off-box recovery key on a fresh/cloned host**; one age key the only secret not in git"; Network/TLS + swarm rows now say the cert is "**sops-decrypted from git** (`cc-ci-secrets`) to `/var/lib/ci-certs/live/`". No stale pre-1c phrasing remains. → C7 met. (Minor non-blocking note: the *external* orchestrator doc `/srv/cc-ci/cc-ci-plan/plan.md §1.5/§4.0/§4.4` still has pre-1c cert wording, but it's outside the repo / not loop-git-managed and not the doc a new engineer installs from — the repo docs install/secrets/architecture are authoritative and correct.)
|
||||
|
||||
~~Original finding:~~
|
||||
C7 requires `architecture.md` reflect the new model, but it still describes the **pre-1c** layout:
|
||||
- Line ~17 (secrets row): "`modules/secrets.nix` + `secrets/secrets.yaml` (sops-nix) | Infra secrets,
|
||||
decrypted at activation **via the host SSH key** as the age identity" — no mention of the private
|
||||
**`cc-ci-secrets` repo / git submodule** split, the **recovery age key** bootstrap for a fresh host,
|
||||
or that the **wildcard cert+key are sops secrets in git** (C1/C2/C3 — the core of 1c).
|
||||
- §Network/TLS (lines ~40–41): cert described as "**pre-issued** wildcard cert at
|
||||
`/var/lib/ci-certs/live/`" (out-of-band), not **sops-decrypted-from-git** to that path.
|
||||
Repro: `grep -n "host SSH key\|secrets/secrets.yaml\|pre-issued wildcard" docs/architecture.md`.
|
||||
A new engineer reading it gets the wrong mental model of where secrets/cert live. **Fix:** update the
|
||||
secrets row + Network/TLS section to the 1c model (cc-ci-secrets submodule, cert sops-in-git decrypted
|
||||
at activation, recovery-key as the one out-of-band bootstrap secret), consistent with install.md/secrets.md.
|
||||
Only the Adversary closes this, after re-reading the updated doc. (Doc gap — not a VETO.)
|
||||
231
machine-docs/BACKLOG.md
Normal file
231
machine-docs/BACKLOG.md
Normal file
@ -0,0 +1,231 @@
|
||||
# BACKLOG — cc-ci
|
||||
|
||||
Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adversary edits only
|
||||
`## Adversary findings`. Closing an item = checking the box in your own section.
|
||||
|
||||
## Build backlog
|
||||
|
||||
### M0 — Foundations
|
||||
- [x] Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline
|
||||
- [x] Deploy mechanism decision + first rebuild from repo (DECISIONS.md) — switch --flake on host
|
||||
- [x] sops-nix wiring: host age key (from ssh host key) + master recovery key; secrets/secrets.yaml;
|
||||
decrypt a test secret on host → /run/secrets/test_secret (0400 root) verified
|
||||
- [x] Gate: M0 — `ssh cc-ci 'systemctl is-system-running'` healthy after rebuild from repo
|
||||
→ CLAIMED 2026-05-26, awaiting Adversary (see STATUS.md)
|
||||
|
||||
### M1 — Swarm + abra target
|
||||
- [x] Docker + single-node swarm via Nix (modules/swarm.nix: docker + swarm-init oneshot + `proxy`
|
||||
overlay net + daily autoprune). Verified: Swarm=active, proxy overlay present.
|
||||
- [x] Proxy = real coop-cloud/traefik via abra (orchestrator decision, replaces custom traefik.nix):
|
||||
wildcard/file-provider mode, pre-issued cert as ssl_cert/ssl_key swarm secrets, LETS_ENCRYPT_ENV
|
||||
empty → no ACME. `scripts/deploy-proxy.sh` (idempotent). Verified E2E via gateway: wildcard cert
|
||||
served, 0 ACME log lines.
|
||||
- [x] abra installed (modules/abra.nix, pinned 0.13.0-beta); deployed custom-html by hand over HTTPS
|
||||
(HTTP 200 nginx page via gateway) and tore it down clean (services/volumes/secrets/containers=0).
|
||||
- [x] Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean →
|
||||
CLAIMED 2026-05-26, awaiting Adversary.
|
||||
|
||||
### M2 — Drone online
|
||||
- [x] Drone server (coop-cloud recipe, reconcile oneshot) + exec runner via Nix; Gitea OAuth app.
|
||||
Server healthz 200 via gateway; runner polling (capacity=2, type=exec).
|
||||
- [x] hello-world .drone.yml runs green; logs visible (Drone UI + API). Build #1 success: clone +
|
||||
hello (echo/whoami=root/abra 0.13.0-beta/swarm=active), both exit 0.
|
||||
- [x] Gate: M2 — push to cc-ci triggers visible green build → CLAIMED 2026-05-26, awaiting Adversary.
|
||||
OAuth link via one-time `scripts/bootstrap-drone-oauth.sh` (documented in install.md §2).
|
||||
|
||||
### M3 — Comment bridge
|
||||
- [x] comment-bridge service: polling PRIMARY (read-only, ≤30s) + optional admin webhook; !testme
|
||||
exact match; org-membership auth (`GET /orgs/{owner}/members/{user}` 204) + allowlist; Drone API
|
||||
- [x] PR comment posting with run link
|
||||
- [x] Gate: M3 — live demo on scratch PR; auth enforced → CLAIMED 2026-05-27. Posted `!testme` on
|
||||
PR #1 → poll fired in 6s → Drone build #26 for head d397720a → bridge commented run link back.
|
||||
Org-membership auth verified (bot/trav/notplants 204, non-member 404 at read level).
|
||||
|
||||
### Bridge→Drone→harness integration (connects M3 trigger to M4/M5 recipe CI; blocks D2/D10 via !testme)
|
||||
- [x] Add a recipe-CI pipeline to `.drone.yml` keyed on `event=custom`: runs
|
||||
`cc-ci-run runner/run_recipe_ci.py` STAGES=install,upgrade,backup, `CCCI_JANITOR_MAX_AGE=0`,
|
||||
`concurrency:{limit:1}`, `HOME=/root`. Self-test pipeline now `event=push`. (commits 9d51cb6+)
|
||||
- [x] Verify a recipe build runs the full 3-stage CI through Drone (not self-test): **build #33 →
|
||||
success**, install/upgrade/backup all green, clean teardown (0 orphans). HOME + backup `-C -o`
|
||||
+ clean-reclone fixes applied.
|
||||
- [ ] Full single-comment E2E: enroll a recipe in the bridge `POLL_REPOS` + open a recipe PR →
|
||||
`!testme` → full 3-stage CI + PR comment outcome (folds into M6.5/M10 breadth).
|
||||
|
||||
### M4 — Harness + install stage
|
||||
- [x] run_recipe_ci.py + conftest + harness (abra wrappers, lifecycle) + Nix python/playwright env
|
||||
(cc-ci-run); install stage for recipe #1 (custom-html) + Playwright assertion; guaranteed teardown
|
||||
- [x] Gate: M4 — green install run, no orphaned app/volume → CLAIMED 2026-05-27, awaiting Adversary.
|
||||
Repro: `cd /root/cc-ci && RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py`
|
||||
→ 2 passed (http 200 + playwright); teardown leaves services/volumes/secrets/containers/env = 0.
|
||||
|
||||
### M5 — Upgrade + backup/restore stages
|
||||
- [x] Add upgrade + backup/restore stages for recipe #1 (custom-html). backup-bot-two deployed as a
|
||||
reconcile oneshot (modules/backupbot.nix). Data marker served via nginx for assertions.
|
||||
- [x] Gate: M5 — upgrade preserves data; backup→mutate→restore returns original → CLAIMED 2026-05-27.
|
||||
Full 3-stage run green: install(2)+upgrade(1)+backup(1) passed; teardown leaves 0 orphans, infra intact.
|
||||
|
||||
### M6 — Recipe-local tests + second recipe
|
||||
- [x] D4 recipe-local discovery: recipe-shipped tests/ snapshotted post-fetch + run against the live
|
||||
app as a `recipe-local` stage (contract CCCI_BASE_URL/CCCI_APP_DOMAIN). Demo'd via mirror branch
|
||||
recipe-maintainers/custom-html@ci/d4-recipe-local → recipe-local test PASSED against live app.
|
||||
- [x] Enroll DB-backed recipe #2 (keycloak + mariadb) via per-recipe tests/keycloak/ only (no harness
|
||||
surgery): install green (realm health + Playwright admin login). docs/enroll-recipe.md written.
|
||||
- [x] Gate: M6 — both recipes green (custom-html 3-stage; keycloak install) + recipe-local merged →
|
||||
CLAIMED 2026-05-27. keycloak full 3-stage (DB data survival) folds into the M6.5 breadth ramp.
|
||||
|
||||
### M6.5 — Breadth ramp (recipes 3→6)
|
||||
- [x] keycloak (SSO/DB-backed, recipe #2) full 3-stage green through the Drone recipe-ci pipeline:
|
||||
build #39 success (~31m): install 2✓ (realm health + Playwright admin login), upgrade 1✓
|
||||
(`test_upgrade_preserves_realm` — DB data survives), backup 1✓ (`test_backup_mutate_restore`).
|
||||
Clean teardown (0 keyc services/volumes). Proves DB-backed data survival + integration path.
|
||||
- [x] cryptpad (stateful/no-DB, recipe #3) full 3-stage green on host (cc-ci-run): install 2✓
|
||||
(http + Playwright), upgrade 1✓ (marker in cryptpad_data survives), backup 1✓
|
||||
(`test_backup_mutate_restore`). No harness surgery — added generic per-recipe EXTRA_ENV
|
||||
(handles cryptpad's SANDBOX_DOMAIN). Fixed a real backup bug en route: set_env glued
|
||||
RESTIC_REPOSITORY onto a comment → backupbot had no restic repo (now newline-safe). Drone
|
||||
canonical run = **build #46 success** (~6m, all 3 stages green, clean teardown).
|
||||
- [x] matrix-synapse (DB+media/large-volume, recipe #4) full 3-stage green on host: install 2✓
|
||||
(client API + versions JSON), upgrade 1✓ (postgres marker survives), backup 1✓ — exercises the
|
||||
recipe's pg_backup.sh DB-dump hook (not a plain volume copy). No harness surgery. Drone
|
||||
canonical run = **build #51 success** (~10.5m, all 3 stages green, clean teardown).
|
||||
- [x] lasuite-docs (multi-service + S3/MinIO, recipe #5) full 3-stage green on host: install 2✓
|
||||
(9-service stack converges + SPA + Playwright), upgrade 1✓ (postgres marker survives), backup
|
||||
1✓ (pg_backup.sh hook). Fixed deploy timeout (cold-pull of ~9 images > abra 300s) via
|
||||
TIMEOUT=900 EXTRA_ENV; OIDC config-only so starts healthy w/ placeholder. Drone canonical run
|
||||
= **build #57 success** (all 3 stages green, clean teardown).
|
||||
- [x] n8n (workflow automation, recipe #6 — bluesky-pds swapped out per DECISIONS) full 3-stage
|
||||
green on host: install 2✓ (/healthz + Playwright editor), upgrade 1✓ (marker in /home/node/.n8n
|
||||
survives), backup 1✓ (backupbot.backup.path file backup). Drone canonical run = **build #63
|
||||
success** (~5.5m, all 3 stages green, clean teardown).
|
||||
- [ ] Re-verify keycloak backup post set_env fix (build #39 ran off an earlier backupbot deploy)
|
||||
- [x] Gate: M6.5 — recipes 3–6 three-stage green → **CLAIMED 2026-05-27**. All 6 D10 recipes have a
|
||||
full 3-stage green run (host + canonical Drone): custom-html, keycloak(#39), cryptpad(#46),
|
||||
matrix-synapse(#51), lasuite-docs(#57), n8n(#63). All 5 categories covered; D5 no-harness-surgery
|
||||
held (per-recipe tests/<recipe>/ + recipe_meta EXTRA_ENV only). Awaiting Adversary.
|
||||
|
||||
### M7 — Secrets hardening (D6)
|
||||
- [x] Full sops model + rotation doc (docs/secrets.md: 3 classes, decryption chain, rotation per
|
||||
class) + log redaction filter (run_recipe_ci masks /run/secrets/* values in stage output,
|
||||
live-streaming preserved). Adversary leak scans clean (baseline + recipe-CI logs).
|
||||
- [x] Gate: M7 — secret-grep finds nothing → **CLAIMED 2026-05-27**. No-plaintext: harness never
|
||||
prints secrets, abra doesn't echo generated ones, reconciles redirect secret-gen to /dev/null,
|
||||
dashboard shows status only; redaction filter as belt-and-suspenders. Awaiting Adversary
|
||||
(re-grep published logs + dashboard; optionally follow a rotation procedure).
|
||||
|
||||
### M8 — Dashboard (D7)
|
||||
- [x] Overview page + badges: dashboard/dashboard.py + modules/dashboard.nix — live at
|
||||
ci.commoninternet.net/, lists the 6 recipes w/ pass/fail/running badges + run links, plus
|
||||
/badge/<recipe>.svg. Verified via gateway; /hook still routes to bridge. (content-hash image
|
||||
tag so the swarm service rolls on code change.)
|
||||
- [x] PR-comment outcome reflection: bridge watcher polls the Drone build to completion + edits its
|
||||
run comment to ✅ passed / ❌ <status> (Gitea PATCH). Verified: fresh !testme on PR #1 → comment
|
||||
edited to "❌ failure → …/76" within ~20s.
|
||||
- [x] [idea] gave the bridge image a content-hash tag (fixed latent `:latest` no-roll issue)
|
||||
- [x] Gate: M8 — overview matches reality; outcomes mirrored → **CLAIMED 2026-05-27**. Dashboard
|
||||
overview lists the 6 recipes w/ correct status badges (live, gateway-verified); PR comments link
|
||||
back AND reflect final pass/fail. Awaiting Adversary.
|
||||
|
||||
### M9 — Reproducibility + docs (D8/D9)
|
||||
- [x] D9 docs complete: README + docs/{install,enroll-recipe,secrets,architecture,runbook,baseline}.
|
||||
Covers architecture, enroll a recipe, add/run tests locally, operate/rotate secrets, debug a
|
||||
failed run. install.md = from-scratch path (clone + nixos-rebuild + operator preconditions).
|
||||
- [ ] Gate: M9 — Adversary rebuilds from docs on throwaway host (D8) — Adversary action; install.md
|
||||
ready. (Note: a from-scratch rebuild pulls images → needs the registry creds / quota too.)
|
||||
|
||||
### M10 — Proof (D10)
|
||||
- [x] **All 6 recipes green via REAL !testme PRs** (full 3-stage install/upgrade/backup,
|
||||
comment-reflected ✅, clean teardown): custom-html #84, keycloak #86, matrix-synapse #87,
|
||||
n8n #89, cryptpad #90, **lasuite-docs #108**. All 5 D10 categories covered.
|
||||
- [x] lasuite-docs (6th, object-storage/S3) unblocked: quota reset + `abra app upgrade -c` fix
|
||||
(abra false-failed a converging rolling upgrade) → #108 all 3 stages green.
|
||||
- [x] Gate: M10 — six recipes green via !testme → **CLAIMED 2026-05-27**, awaiting Adversary D10
|
||||
verification.
|
||||
- [ ] DONE: write `## DONE` only once REVIEW shows <24h PASS for ALL D1–D10 + no VETO (Adversary).
|
||||
|
||||
## Adversary findings
|
||||
<!-- Adversary-only section. Builder must not edit below this line. -->
|
||||
|
||||
- [x] **[adversary] A1 — Test-app deploys can silently trigger ACME (no-ACME design hazard).**
|
||||
**CLOSED @2026-05-27T00:35Z** by Adversary re-test. `runner/harness/lifecycle.deploy_app`
|
||||
calls `abra.env_set(domain, "LETS_ENCRYPT_ENV", "")` before every deploy. Verified on a live
|
||||
harness app (`cust-c95a69`): env `LETS_ENCRYPT_ENV=` empty, no `certresolver` label, **0 ACME
|
||||
log lines**, and the served cert is the **wildcard** `CN=*.ci.commoninternet.net` (verify ok)
|
||||
— not a per-host ACME cert. No-ACME holds for harness deploys. (Structural belt-and-suspenders
|
||||
— dropping the unused `certificatesResolvers` from traefik — remains a nice-to-have, tracked
|
||||
under A3/M7, not required to close A1.)
|
||||
|
||||
- [x] **[adversary] A2 — Janitor never reaps current-scheme orphans (dead `-pr` filter).**
|
||||
**CLOSED @2026-05-27T10:45Z** by Adversary live re-test of the fix. Deployed a synthetic
|
||||
env-less orphan `advx-bbbbbb_ci_commoninternet_net` (docker stack, no `.env` — the case the old
|
||||
`-pr` filter AND abra-ls both miss). (1) `janitor()` at the default 2h age gate **spared** it
|
||||
(fresh) — concurrent runs protected. (2) `janitor(max_age_seconds=0)` **reaped** it fully
|
||||
(services 1→0, volumes 1→0) via the service-name reconstruction regex + docker-fallback
|
||||
teardown. Janitor now matches the real `<tag>-<6hex>` scheme and reaps even `.env`-gone orphans.
|
||||
Original finding below.
|
||||
Found during M4 review. `harness.lifecycle.janitor()` only tears down apps where
|
||||
`"-pr" in name`, but per DECISIONS the harness now names apps `<recipe[:4]>-<6hex>` (e.g.
|
||||
`cust-c95a69`) — **no `-pr` substring**. So the run-start crash-recovery sweep (§4.3: "nuke
|
||||
any orphaned `*-pr*` apps") matches **nothing** and is effectively a no-op. The happy-path
|
||||
finalizer in `conftest.deployed_app` does work (observed: `cust-e084bd` from a prior run was
|
||||
torn down), but a run that crashes/reboots *before* the finalizer runs leaves an orphan that
|
||||
no later run will reap. *Fix:* match the actual naming (e.g. regex `^[a-z]{1,4}-[0-9a-f]{6}\.`
|
||||
or a dedicated CI label/prefix) and gate on age. *Re-test:* deploy a harness app, simulate a
|
||||
crash (kill the run before teardown), then start a new run and confirm janitor reaps the
|
||||
orphan. Adversary closes after re-test.
|
||||
**Re-test progress @2026-05-27T05:00Z (fix b7a2d70):** the reaping *mechanism* is verified —
|
||||
janitor now matches the real naming via `RUN_APP_RE` (`^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci…`,
|
||||
matches `cust-c95a69`) AND reconstructs `.env`-gone orphans from orphaned *service* names
|
||||
(regex matches my synthetic `advx-aaaaaa_ci_commoninternet_net_app`), with an age gate to spare
|
||||
concurrent runs, then reaps via `teardown_app` (verified clean under A3). **Still pending:** one
|
||||
live `janitor()` end-to-end sweep — needs `CCCI_JANITOR_MAX_AGE=0`, which would also reap the
|
||||
Builder's live apps, so it must run on an **idle host**. Will close then.
|
||||
|
||||
- [x] **[adversary] A3 — Teardown is unverified/best-effort; a failure silently orphans + run stays green.**
|
||||
**CLOSED @2026-05-27T05:00Z** by Adversary re-test of the Builder's fix (commit b7a2d70).
|
||||
`teardown_app` now: `undeploy` → if the service persists, `docker stack rm` **fallback** (needs
|
||||
no `.env`) → remove volumes/secrets *by stack name* (retry loop) → drop `.env` LAST → **verify**
|
||||
`_residual()` and raise `TeardownError` if anything remains. Empirical worst-case test: I
|
||||
`docker stack deploy`-ed a synthetic orphan `advx-aaaaaa_ci_commoninternet_net` (service +
|
||||
volume + network, **no `.env`** — exactly the crash-orphan that defeated the old code), then
|
||||
called `lifecycle.teardown_app("advx-aaaaaa.ci.commoninternet.net")` → returned OK (verify
|
||||
passed) and afterwards services/volumes/networks = **0**. So a `.env`-less orphan is fully
|
||||
reaped and teardown is now verified (would raise on residual). Original finding below.
|
||||
Found during M4 review (to confirm empirically with a kill-mid-run probe). `lifecycle.teardown_app`
|
||||
runs every abra call with `check=False` and "never raises"; the conftest finalizer never
|
||||
asserts teardown succeeded. Worse, `abra.app_config_remove` deletes the app `.env`
|
||||
**unconditionally**, even if `abra.undeploy` failed first — leaving the swarm service+volume
|
||||
running but with no `.env`, so the app can no longer be managed/undeployed via abra (and a
|
||||
fixed janitor that shells `abra app undeploy` couldn't reap it either). Net: a partial teardown
|
||||
leaves a silent orphan while pytest still reports the run **green**, so the M4/D2 guarantee
|
||||
"no orphaned app/volume afterward" is not actually *verified* by the harness. *Fix:* assert
|
||||
post-teardown that the stack/services/volumes/secrets are gone (fail the run otherwise); only
|
||||
remove the `.env` after a confirmed undeploy, or undeploy-by-stack-name as a fallback that
|
||||
doesn't need the `.env`. *Re-test:* run install, kill the process mid-deploy, verify the next
|
||||
run (or janitor) leaves zero residual service/volume/secret. Adversary closes after re-test.
|
||||
|
||||
- [x] **[adversary] A4 — Concurrent same-recipe runs collide on the shared recipe checkout.**
|
||||
**CLOSED @2026-05-27T03:13Z — mitigated by the runtime concurrency cap.** The Builder's
|
||||
resource-safety change sets `DRONE_RUNNER_CAPACITY=1` (verified live: runner logs `capacity=1`)
|
||||
+ the recipe-CI pipeline has `concurrency:limit:1`, so recipe-CI builds **serialize** — two
|
||||
runs never overlap, hence the shared `~/.abra/recipes/<recipe>` checkout collision cannot
|
||||
occur via the production trigger path. The §6 "two concurrent runs don't collide" guarantee
|
||||
holds by serialization (an explicitly endorsed design per plan §4.2). **Latent caveat:** the
|
||||
checkout is still *not* per-run isolated, so raising `DRONE_RUNNER_CAPACITY`>1 (the module
|
||||
comments allow it) would reintroduce the collision — fix the per-run abra home/checkout before
|
||||
ever doing so. (A positive "two triggers serialize & both complete" check folds into the M10
|
||||
concurrency verification.)
|
||||
Found by review (M6 verify); to confirm empirically. Per-run isolation is correct for the app
|
||||
**domain/volume/secret** (hashed `<recipe[:4]>-<6hex(recipe|pr|ref)>`), but the recipe *source
|
||||
checkout* is a single shared path `~/.abra/recipes/<recipe>`: `run_recipe_ci.fetch_recipe`
|
||||
does `rm -rf ~/.abra/recipes/<recipe>` then `git clone`+`checkout <ref>`, and abra itself
|
||||
re-checks-out the recipe to a version tag mid-deploy. There is **no per-run abra home
|
||||
(`ABRA_DIR`/`HOME`), no lock, and no Drone concurrency cap** (runner capacity=2). So two
|
||||
concurrent runs of the **same recipe at different refs** (e.g. `!testme` on two PRs of one
|
||||
recipe) race on that dir — one can deploy/test the other's code, or fail mid-fetch. (Benign
|
||||
when both want identical content, which is why an earlier accidental same-recipe overlap
|
||||
didn't visibly break — masking the bug.) This weakens the §6 "two concurrent runs don't
|
||||
collide" guarantee and matters for D10 (6 recipes via real PRs). *Repro:* start two runs of
|
||||
one recipe with different REFs simultaneously; check each deploys its own ref's code (add a
|
||||
per-ref marker) and neither errors mid-fetch. *Fix:* per-run abra home/recipe dir (e.g.
|
||||
`ABRA_DIR=$(mktemp -d)` or `~/.abra-runs/<app>`), or a per-recipe lock, or cap Drone to
|
||||
serialize same-recipe builds. Adversary confirms + closes after re-test.
|
||||
273
machine-docs/DECISIONS.md
Normal file
273
machine-docs/DECISIONS.md
Normal file
@ -0,0 +1,273 @@
|
||||
# DECISIONS — cc-ci Builder
|
||||
|
||||
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
|
||||
|
||||
## Settled
|
||||
|
||||
- **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file
|
||||
provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.)
|
||||
- **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.)
|
||||
- **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call
|
||||
time — no secret values stored in `.git/config` or commits.
|
||||
|
||||
- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26,
|
||||
overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the
|
||||
canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for
|
||||
end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every
|
||||
recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO
|
||||
DNS token on the box:
|
||||
- `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the
|
||||
`ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from
|
||||
`/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`).
|
||||
- `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's
|
||||
`tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the
|
||||
wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
|
||||
- Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches
|
||||
recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in
|
||||
`docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm
|
||||
init + `proxy` net + firewall 80/443.
|
||||
- **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then
|
||||
`abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump
|
||||
`SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.)
|
||||
- **abra teardown syntax** (for harness, §4.3): `abra app undeploy <d> -n`,
|
||||
`abra app volume remove <d> -f -n`, `abra app secret remove <d> --all -n`. None take `--chaos`.
|
||||
|
||||
- **Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer
|
||||
2026-05-26).** Every piece of swarm infra that abra deploys (traefik `modules/proxy.nix`, Drone
|
||||
`modules/drone.nix`, later comment-bridge + dashboard) is a `systemd.services.<x>` with
|
||||
`Type=oneshot` + `RemainAfterExit`, `after`/`requires` swarm-init + docker, `wants`
|
||||
network-online, `wantedBy` multi-user, embedding its script via **`pkgs.writeShellApplication`**
|
||||
(self-contained in the store, not a `/root/cc-ci` path). The script **reconciles** (inspect →
|
||||
converge → no-op if correct) on *every* activation/boot — **no run-once sentinel** — so it
|
||||
self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit)
|
||||
on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to
|
||||
`git clone` + `nixos-rebuild switch` + operator preconditions, no manual post-steps. The old
|
||||
`scripts/deploy-*.sh` were folded into these modules and removed. `pkgs.abra` is provided via an
|
||||
overlay (`modules/packages.nix`) so all modules share the one pinned build.
|
||||
- *Cert rotation note:* the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
|
||||
wildcard means bumping `SECRET_WILDCARD_*_VERSION` (operator) so the next reconcile re-inserts.
|
||||
Documented in docs/secrets.md at M7.
|
||||
|
||||
- **Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27,
|
||||
supersedes the earlier "keep webhook, do NOT pivot to polling" steer).** Hard constraint: the
|
||||
bot/server runs at **READ level, never repo-admin**, and **never self-registers a webhook**.
|
||||
- **Polling is PRIMARY and the source of truth for D1.** The bridge polls each enrolled repo's
|
||||
open PRs for new `!testme` comments every `POLL_INTERVAL` (30s ≤ 60s). Outbound
|
||||
(cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On
|
||||
startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
|
||||
- **Webhook is an OPTIONAL push optimization.** The `/hook` endpoint stays live (HMAC-verified)
|
||||
so an *admin-registered* `issue_comment` webhook lowers latency, but the bridge never registers
|
||||
one. Manual registration is documented in `docs/enroll-recipe.md`. Both paths share an
|
||||
in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
|
||||
- **Commenter authorization via org membership (read-level, no admin).** Allowed iff
|
||||
`GET /orgs/{owner}/members/{user}` → 204 (verified 2026-05-27: admits bot/trav/notplants, 404
|
||||
for a non-member, works with bot read-level basic-auth) **or** the user is in the optional
|
||||
`AUTH_ALLOWLIST`. Replaces the earlier `/collaborators/{user}/permission` check, which needs
|
||||
repo-admin. Fail-closed on any error.
|
||||
- **Enrollment** = add the repo to the bridge `POLL_REPOS` csv + ensure `tests/<recipe>/` exists.
|
||||
No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't
|
||||
matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in
|
||||
Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.)
|
||||
|
||||
- **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27,
|
||||
plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
|
||||
- **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding).
|
||||
Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending
|
||||
queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is
|
||||
never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
|
||||
- **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled
|
||||
best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's
|
||||
Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone →
|
||||
the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue
|
||||
once a test finishes OR times out".
|
||||
- **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys
|
||||
(guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its
|
||||
own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in
|
||||
both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI
|
||||
path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent
|
||||
runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default
|
||||
2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
|
||||
- Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary
|
||||
mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.)
|
||||
|
||||
- **D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0
|
||||
sanctions this swap-with-reason).** bluesky-pds routes via a Traefik **TCP router with
|
||||
`tls.passthrough=true`** to an in-container **caddy** that terminates TLS itself and obtains its own
|
||||
cert via **ACME**. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through
|
||||
to cc-ci's Traefik, which **terminates** it with the pre-issued static wildcard cert, and **ACME is
|
||||
hard-forbidden** for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds
|
||||
would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into
|
||||
caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke
|
||||
proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness
|
||||
gap. Per the plan's explicit allowance, **bluesky-pds is a documented non-CI'd recipe** (reason
|
||||
here), and **n8n** takes the 6th slot. The 5 required D10 categories are already covered by recipes
|
||||
1–5 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume=
|
||||
matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app
|
||||
(workflow automation) behind the normal terminate-at-Traefik path.
|
||||
|
||||
- **Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27).** D10 real-`!testme` breadth
|
||||
runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage:
|
||||
`toomanyrequests`). Two lessons: (1) **registry pull creds are an A1 operator input** needed for
|
||||
reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2)
|
||||
**Don't `docker image prune -af` mid-breadth** — it evicts cached recipe images and forces re-pulls
|
||||
that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but
|
||||
triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune only `dangling`
|
||||
(not `-a`) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green
|
||||
via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).
|
||||
|
||||
## Open (defaults from §8, to confirm as reality lands)
|
||||
|
||||
- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
|
||||
cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over
|
||||
`--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS
|
||||
proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`).
|
||||
The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild
|
||||
--collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build
|
||||
loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host);
|
||||
source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo
|
||||
on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`).
|
||||
- **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild
|
||||
is a true no-op-then-base. Bump deliberately, never drift.
|
||||
- **Webhook scope:** default per-repo via enroll script.
|
||||
- **CI engine: Drone (per plan) — kept, with a noted risk.** nixpkgs 24.11 has Drone **server**
|
||||
2.24.0 but `drone-runner-exec` is **abandoned (unstable-2020-04-19)** — the only exec runner Drone
|
||||
ever shipped (upstream archived ~2021). The maintained fork **Woodpecker** (2.7.3, with NixOS
|
||||
modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific
|
||||
(D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern
|
||||
Drone server (RPC protocol stable). **Fallback:** if the exec runner proves incompatible/broken,
|
||||
pivot to Woodpecker (coop-cloud ships a `woodpecker` recipe too) and record it — like the traefik
|
||||
pivot. Re-evaluate at the M2 gate.
|
||||
- **Drone deployment shape — SETTLED (M2):** mirror the traefik pattern. The **server** is the
|
||||
coop-cloud `drone` recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by
|
||||
traefik at `drone.ci.commoninternet.net`, `LETS_ENCRYPT_ENV` empty → wildcard cert, no ACME),
|
||||
with Gitea SSO (`compose.gitea.yml`). The **exec runner** runs as a Nix systemd service on the
|
||||
host (`modules/drone-runner.nix`) so it can drive host abra/swarm (plan §4.2). One generated
|
||||
`DRONE_RPC_SECRET` is shared: inserted as the server's `rpc_secret` swarm secret AND read by the
|
||||
runner from sops. Reproducible deploy: `scripts/deploy-drone.sh`.
|
||||
- Gitea OAuth app `cc-ci-drone` created under the bot (client_id `ab4cdb9d-ee96-4867-875f-
|
||||
87384505fc52`, redirect `https://drone.ci.commoninternet.net/login`); client_secret +
|
||||
rpc_secret stored sops-encrypted in `secrets/secrets.yaml` (A2 internal secrets).
|
||||
- **Drone runner type:** exec (must drive host abra).
|
||||
- **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH
|
||||
host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box.
|
||||
Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box
|
||||
**master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on
|
||||
the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing
|
||||
plaintext into `secrets/<f>.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found).
|
||||
- **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple),
|
||||
cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3),
|
||||
bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
|
||||
|
||||
- **Per-run app domain scheme — adapted (M4, deviates from plan §4.0).** Plan §4.0 wanted
|
||||
`<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`, but Docker swarm config/secret names
|
||||
(`<stackname>_<resource>_<version>`) must be ≤ 64 chars and abra derives `<stackname>` from the
|
||||
domain (dots→`_`, hyphens kept). `.ci.commoninternet.net` alone is 22 chars, so long recipe names
|
||||
+ config names overflow 64 (hit with `custom-html-pr0-m4demo…_nginx_default_conf_v6` = 66). New
|
||||
scheme: **`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net`** (e.g. `cust-e084bd`) — short,
|
||||
unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/
|
||||
ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
|
||||
|
||||
- **abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6).** Many
|
||||
abra commands (`app ls`, `secret generate` without flags, version resolution) silently
|
||||
`git checkout <version-tag>` in `~/.abra/recipes/<recipe>`, discarding a PR branch's files. To
|
||||
test the *PR head code* (not a re-resolved tag): (1) `fetch_recipe` clones the mirror branch/ref
|
||||
(private → bot token via per-command `http.extraHeader`, never persisted/logged); (2) all harness
|
||||
abra calls that touch the recipe pass `-C` (chaos: use current checkout) `-o` (offline: no remote
|
||||
fetch); (3) recipe-shipped `tests/` (D4) are **snapshotted to a temp dir right after fetch**, since
|
||||
later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.
|
||||
|
||||
## Risks
|
||||
|
||||
- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard
|
||||
**inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
|
||||
inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free);
|
||||
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
|
||||
periodic `docker image prune` to avoid regressing during M6.5 breadth.
|
||||
|
||||
## Dead-ends
|
||||
- (none yet)
|
||||
|
||||
## Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27
|
||||
|
||||
- **Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default).** `cc-ci-secrets`
|
||||
is mounted as a submodule at `cc-ci/secrets/` rather than a flake `inputs.secrets`. Rationale: a
|
||||
private flake input must be re-fetched at **every nix eval**, requiring the bot token persistently
|
||||
in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band
|
||||
secret, which 1c forbids). A submodule makes `secrets/secrets.yaml` a plain path in the working
|
||||
tree → `defaultSopsFile = ../secrets/secrets.yaml` is unchanged (minimal diff, trivially
|
||||
byte-identical), and the only credential use is the one `git clone --recursive` at provisioning
|
||||
("the two repos are *given*", Mission §1). Build invocation becomes
|
||||
`nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` so the submodule tree is
|
||||
included. (Revisit if `?submodules=1` proves unreliable on cc-ci's nix version.)
|
||||
- **Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via
|
||||
`sops.age.keyFile`.** The recovery key (`age1cmk26…`, private at `/srv/cc-ci/.sops/master-age.txt`)
|
||||
is already a sops recipient, so a fresh host with a *different* ssh host key still decrypts every
|
||||
secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding".
|
||||
Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting
|
||||
via its host key (`age.sshKeyPaths`); secrets.nix will offer both identity sources. (Per-host
|
||||
re-encrypt is cleaner for a *permanent* new instance — documented as the alternative, not used for
|
||||
the throwaway test.)
|
||||
- **Cert into git:** wildcard cert+key become sops secrets in `cc-ci-secrets`, decrypted at
|
||||
activation back to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` via
|
||||
`sops.secrets.<name>.path`; proxy.nix keeps reading that path (now sops-sourced, not operator-drop).
|
||||
- **cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM.** The
|
||||
freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical
|
||||
cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
|
||||
- **C6 teardown OVERRIDE (operator, 2026-05-27):** do NOT destroy the FINAL throwaway VM after
|
||||
W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says
|
||||
otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other
|
||||
cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).
|
||||
|
||||
## Phase 1b — lint/format tooling (open decisions §6, settled W0)
|
||||
- **Formatters/linters (RL1):** Nix = `nixpkgs-fmt` (format) + `statix` (lints) + `deadnix` (dead
|
||||
code); Python = `ruff` (lint + format); Shell = `shellcheck` + `shfmt -i 2 -ci`; YAML = `yamllint`.
|
||||
Kept `nixpkgs-fmt` over `alejandra` because it was already the repo `formatter` and devshell tool
|
||||
(no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flake
|
||||
`lint` devshell (`nix develop .#lint`) so CI and local use byte-identical tool versions.
|
||||
- **Lint entrypoint:** `scripts/lint.sh` (check-only by default; `--fix` auto-applies). The
|
||||
`.drone.yml` push pipeline runs it via `nix develop .#lint --command bash scripts/lint.sh`.
|
||||
- **ruff strictness:** `select = [E,F,W,I,UP,B,C4,SIM]`, `ignore = [E501]` (line length is the
|
||||
formatter's job; only un-splittable strings would trip it). `line-length=100`, `target=py311`.
|
||||
- **Drone lint stage = FAIL (not warn).** The codebase is green now, so enforce from here on — an
|
||||
unclean commit fails the `lint` step. (Resolves the §6 open question.)
|
||||
- **Python type-checking (mypy/pyright): DEFERRED to IDEAS**, not added in 1b. The harness is small
|
||||
and dynamically typed around `abra`/subprocess JSON; gradual typing is a larger effort than this
|
||||
bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs.
|
||||
- **blocking vs advisory split (§3):** treated as in the phase plan — tests-real, Nix-idempotent,
|
||||
no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift =
|
||||
advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
|
||||
- **cc-ci self-CI push trigger:** the lint stage lives in the `event: push` pipeline. The Gitea→Drone
|
||||
push webhook on this instance is flaky (`last_status: None`; documented §4.1) and predates 1b —
|
||||
recipe CI uses polling as primary, but cc-ci's *own* self-test/lint relies on the push webhook.
|
||||
The lint stage is correctly wired and proven green via the identical `nix develop .#lint` command;
|
||||
reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.
|
||||
|
||||
## Phase 1b — repo layout (operator review items RL5/RL6, plan §7)
|
||||
- **RL5 — all Nix code under `nix/`.** Moved `modules/`→`nix/modules/` and `hosts/`→`nix/hosts/`.
|
||||
`flake.nix`/`flake.lock` STAY at the repo root (entry point) so the build ref `#cc-ci` and
|
||||
`nixos-rebuild --flake '…#cc-ci'` are unchanged — only `flake.nix`'s internal
|
||||
`./hosts/cc-ci/configuration.nix` → `./nix/hosts/cc-ci/configuration.nix` changed. Root-relative
|
||||
refs inside the moved modules were re-based `../X` → `../../X` (secrets.nix → `../../secrets/`,
|
||||
bridge.nix → `../../bridge/`, dashboard.nix → `../../dashboard/`); `configuration.nix`'s
|
||||
`../../modules/*` imports are unchanged (both dirs moved under `nix/`, so the relative path still
|
||||
resolves). **Toplevel is byte-identical (`8i3jcad9…`) before/after the move** — store derivations
|
||||
are content-addressed on the copied file *contents*, and the module `.nix` files aren't part of the
|
||||
runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash
|
||||
change; in practice it's stable, which is even stronger for reproducibility.) Living docs
|
||||
(README, architecture/install/secrets/enroll) + the `.drone.yml` comment updated to `nix/…`;
|
||||
append-only history logs left as the record of what was true then.
|
||||
- **RL6 — protocol files → `machine-docs/`: DEFERRED to the coordinated end of 1b.** Will `git mv`
|
||||
`STATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.md` into `machine-docs/` (README.md STAYS at root —
|
||||
operator decision, it's the human readme, not a protocol file). The live watchdog (`launch.sh`)
|
||||
reads `STATUS-<id>.md`/`REVIEW-<id>.md` at the repo root for handoffs/transition, so this is done
|
||||
LAST, in lockstep with the orchestrator updating `launch.sh` + restarting the watchdog — not
|
||||
unilaterally and not while a phase transition is pending. The Adversary likewise `git mv`s its own
|
||||
REVIEW files at the cutover (single-writer rule).
|
||||
|
||||
## Phase 1b — recorded deviation: no `tests/_template/` dir (enroll = copy an existing recipe)
|
||||
Plan §3's repo layout lists a `tests/_template/` "copy-to-add-a-recipe" dir. It was **never created**
|
||||
(pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in
|
||||
`docs/enroll-recipe.md` is **"copy an existing recipe's tree, e.g. `tests/custom-html/…`, then adjust
|
||||
`recipe_meta.py` + the per-recipe test files."** This satisfies D5's "small, repeatable, documented
|
||||
operation with no harness surgery" the same way (a concrete recipe is a better starting template than
|
||||
an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.
|
||||
170
machine-docs/JOURNAL-1b.md
Normal file
170
machine-docs/JOURNAL-1b.md
Normal file
@ -0,0 +1,170 @@
|
||||
# JOURNAL — Phase 1b (review & lint pass)
|
||||
|
||||
Append-only Builder log: what I did + verifying command/output + next. (Adversary logs to REVIEW-1b.)
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-27 — Phase 1b kickoff (first wake)
|
||||
|
||||
Read the phase plan (`plan-phase1b-review-lint.md`) + plan.md §6.1/§7/§9. Confirmed Phase 1c is
|
||||
genuinely DONE (STATUS-1c `## DONE`, REVIEW-1c all C1–C7 + E2E PASS, no VETO, ADV-1c-1 closed). Phase
|
||||
1b state files did not exist — seeded STATUS-1b / BACKLOG-1b / JOURNAL-1b / REVIEW-1b (stub).
|
||||
|
||||
Access + environment probes:
|
||||
- `ssh cc-ci 'hostname && systemctl is-system-running'` → `nixos` / `running`.
|
||||
- Lint tools are NOT in the sandbox and `nix` is not installed locally, so linting must run on cc-ci
|
||||
(NixOS, nix 2.24.14, flakes enabled). `nix build github:NixOS/nixpkgs/<our-pin>#ruff` resolves from
|
||||
cache.nixos.org (ruff 0.7.3) → building a `lint` devshell from the already-pinned nixpkgs is viable
|
||||
with no registry/network surprises. shellcheck-0.10.0 already realized in the host store.
|
||||
|
||||
Lint-target inventory: 14 `.nix`, 32 `.py`, 1 `.sh` (`scripts/bootstrap-drone-oauth.sh`), plus
|
||||
`.drone.yml` / `.sops.yaml` YAML. No prior lint/format decisions in DECISIONS.md (clean slate).
|
||||
|
||||
Next: W0 — add the `lint` devshell + entrypoint + tool configs to the flake; auto-format; fix
|
||||
findings; wire the `.drone.yml` lint stage.
|
||||
|
||||
## 2026-05-27 — W0 built: lint toolchain + format + drone stage
|
||||
|
||||
Added (commits 2cede01 format/fixes, 4af427c drone stage, + tooling commits):
|
||||
- `flake.nix`: `lint` devshell (`nix develop .#lint`) = nixpkgs-fmt, statix, deadnix, ruff,
|
||||
shellcheck, shfmt, yamllint, built from the already-pinned nixpkgs (no registry/network surprise —
|
||||
`nix build <pin>#ruff` resolves from cache.nixos.org). Default devshell also gets them.
|
||||
- `scripts/lint.sh` (check / `--fix`), `ruff.toml`, `.yamllint.yaml`.
|
||||
- `.drone.yml`: a `lint` step in the `event: push` pipeline running
|
||||
`nix develop .#lint --command bash scripts/lint.sh` (FAILs the build on any unclean file).
|
||||
|
||||
Format/lint cleanup (semantics-preserving): ruff format on all 32 .py; nixpkgs-fmt drone-runner.nix;
|
||||
shfmt scripts; ruff SIM105/SIM115 (contextlib.suppress / `with open`); statix (merge sops
|
||||
`secrets.*`, empty-pattern → `_`); deadnix (drop unused `self`/`lib`/overlay `final`).
|
||||
|
||||
Verification (on cc-ci, clean tar'd checkout /tmp/ccci-lint):
|
||||
```
|
||||
$ nix develop .#lint --command bash scripts/lint.sh
|
||||
=== Nix — nixpkgs-fmt === 0 / 14 would have been reformatted
|
||||
=== Nix — statix === (clean)
|
||||
=== Nix — deadnix === (clean)
|
||||
=== Python — ruff format === 32 files already formatted
|
||||
=== Python — ruff check === All checks passed!
|
||||
=== Shell — shfmt/shellcheck === (clean)
|
||||
=== YAML — yamllint === (clean)
|
||||
lint: PASS
|
||||
```
|
||||
nix eval `.#nixosConfigurations.cc-ci.config.system.build.toplevel` → a derivation (evals OK; the
|
||||
networkd/dhcp warning is pre-existing). Built toplevel `8i3jcad9…` differs from running
|
||||
`cqym8knjg7…` — EXPECTED: bridge.py/dashboard.py (and runner) are `cp`'d into the store, so the
|
||||
reformat changes their hash. cc-ci will be rebuilt to the formatted closure in W2 before RL3.
|
||||
All Python byte-compiles (store python 3.12.8).
|
||||
|
||||
Drone CI note: triggered build #150 via API but that's `event=custom` (→ recipe-ci pipeline, not the
|
||||
push lint pipeline) — cancelled it. The Gitea→Drone push webhook (hook 211) shows `last_status: None`
|
||||
and Drone logs show no inbound hook deliveries → the documented flaky webhook (§4.1). Public and
|
||||
canonical (100.90.116.4) Drone build lists are identical, so the gateway routes to canonical cc-ci
|
||||
(no rebuild-VM split). Recorded the flaky-webhook as a pre-existing infra item in DECISIONS.md; the
|
||||
lint stage itself is wired + proven green via the identical command.
|
||||
|
||||
Claimed W0 gate (RL1) in STATUS-1b. Next: W1 white-box review checklist over the cleaned codebase.
|
||||
|
||||
## 2026-05-27 — W0 PASS (Adversary cold, RL1) + W1 Builder-side §3 self-review
|
||||
|
||||
Adversary logged **W0/RL1 PASS** (REVIEW-1b): cold checkout of my HEAD `233939a` archived to cc-ci,
|
||||
`nix develop .#lint --command bash scripts/lint.sh` → exit 0 `lint: PASS`, plus a break-it probe
|
||||
(injected bad .py/.nix → exit 1 `lint: FAIL`) proving the gate has teeth. Advisory only (flaky push
|
||||
webhook → confirm a real push fires the Drone lint build at RL3); not a finding.
|
||||
|
||||
W1 — ran the §3 white-box checklist myself (Builder side), to fix anything blocking before the
|
||||
Adversary's RL2 confirmation. Findings over the post-W0 (cleaned) codebase:
|
||||
- **Tests real (blocking)** — holds. (Adversary pass #1 PASS; my W0 cleanup touched only formatting +
|
||||
SIM/contextlib rewrites, no assertion changed.)
|
||||
- **Harness DRY (blocking-ish)** — holds. `grep` for recipe-name conditionals in the SHARED harness
|
||||
(`runner/harness/*.py`, `run_recipe_ci.py`, `conftest.py`) → NONE. Per-recipe quirks are data:
|
||||
optional `tests/<recipe>/recipe_meta.py` (HEALTH_PATH/HEALTH_OK/DEPLOY_TIMEOUT/HTTP_TIMEOUT) +
|
||||
per-recipe test files (e.g. keycloak `kc_admin.py`). Enrolling needs no shared-harness edit (D5).
|
||||
- **Nix idempotent (blocking)** — holds (no `.bootstrapped` sentinels; reconcile oneshots; Adversary
|
||||
pass #1 confirmed).
|
||||
- **No footguns (blocking)** — holds. Every `time.sleep()` (lifecycle.py 160/170/226/252,
|
||||
bridge.py 304) sits inside a `while time.time() < deadline:` poll/retry loop (verified each), not a
|
||||
bare readiness wait. `--chaos` appears ONLY in "never pass it" comments (abra.py). No `shell=True`.
|
||||
- **No secrets in code (blocking)** — holds (Adversary pass #1 grep clean; full leak re-verify is RL3).
|
||||
- **Log redaction real (blocking)** — holds. `run_recipe_ci.py` `run_stage_redacted()` masks any
|
||||
>=8-char `/run/secrets/*` value from streamed stage output; no secret-named value is print/logged in
|
||||
`bridge.py`/`dashboard.py` (grep clean).
|
||||
- **Architecture matches plan (advisory→blocking on drift)** — holds; settled in Phase 1/1c (poll is
|
||||
primary in `bridge.py`'s loop; `/hook` optional; traefik is the coop-cloud recipe via `proxy.nix`).
|
||||
No drift; not reopening settled design (guardrail §5).
|
||||
- **Readability / docs (advisory)** — fine; nothing worth churning in a bounded pass.
|
||||
|
||||
**No blocking finding; nothing to fix; no advisory item to file.** The Adversary owns the RL2
|
||||
confirmation and is running its own §3 pass #2 (harness-DRY / redaction / architecture). Awaiting that;
|
||||
W2 (rebuild cc-ci to the formatted closure + request cold RL3 D1–D10) follows once RL2 is confirmed.
|
||||
|
||||
## 2026-05-27 — RL2 clean + RL5 (nix/ consolidation) + W2 switch to cleaned closure
|
||||
|
||||
**RL2 (Adversary §3 pass #2):** no blocking findings; 2 advisories — (a) `old_app` upgrade-fixture
|
||||
copy-paste across recipes → triaged to IDEAS (per-recipe upgrade tests are by design; sharing is a
|
||||
nicety, not a DRY-blocker); (b) app-secret redaction: the `cc-ci-run` Drone step path isn't wrapped by
|
||||
`run_stage_redacted`, so the Adversary will re-run the behavioral D6 leak test at RL3 (grep published
|
||||
Drone logs + dashboard for a known generated app password). My Builder §3 self-review agreed (no
|
||||
blockers). W1 is light/clean.
|
||||
|
||||
**RL5 — consolidate Nix code under `nix/`** (operator item, plan §7). `git mv modules nix/modules`,
|
||||
`git mv hosts nix/hosts`; flake.nix/flake.lock stay at root (`#cc-ci` unchanged); only flake's
|
||||
internal configuration.nix path + the moved modules' root-relative refs changed (`../X`→`../../X`).
|
||||
Built on cc-ci → toplevel `8i3jcad9…` **byte-identical to the pre-move build** (content-addressed;
|
||||
module .nix not in the runtime closure). Living docs + `.drone.yml` comment updated to `nix/…`.
|
||||
|
||||
**W2 — switched canonical cc-ci to the cleaned+RL5 closure** so `build == running` (required before
|
||||
RL3: a fresh clone builds `8i3jcad9`; running had to match or the byte-identical-to-running check
|
||||
would fail). Re-synced `/root/cc-ci` to HEAD, `nixos-rebuild switch --flake 'path:/root/cc-ci#cc-ci'`:
|
||||
```
|
||||
stopping units: deploy-bridge.service, deploy-dashboard.service
|
||||
sops-install-secrets: Imported …ssh_host_ed25519_key as age key (age1h90utdz…)
|
||||
starting units: deploy-bridge.service, deploy-dashboard.service
|
||||
```
|
||||
Post-switch health (all green):
|
||||
- `readlink /run/current-system` → `8i3jcad9mrr01558lqckpi26nxn2ra3m-…` (== fresh-clone build; was
|
||||
`cqym8knjg7…` pre-format).
|
||||
- `systemctl is-system-running` → `running`, **0 failed**. deploy-bridge/deploy-dashboard `active`.
|
||||
- 5 stacks up (backups, ccci-bridge, ccci-dashboard, drone, traefik); `ccci-bridge_app` +
|
||||
`ccci-dashboard_app` 1/1 with NEW content-hash image tags (reformatted source redeployed).
|
||||
- Public via SOCKS proxy → gateway → cc-ci: `https://ci.commoninternet.net/` → **200**
|
||||
(`<title>cc-ci — Co-op Cloud recipe CI</title>`); `/badge/custom-html.svg` → **200**.
|
||||
|
||||
Net: RL1 PASS, RL2 clean, RL4 docs landed (README lint section + architecture.md `nix/` layout),
|
||||
RL5 done + healthy, running==build==`8i3jcad9`. Remaining for DONE: **RL3** (Adversary cold D1–D10
|
||||
re-verify, now also covering the RL5 byte-identical rebuild) and **RL6** (coordinated machine-docs/
|
||||
move — LAST, with orchestrator lockstep). Claiming the RL3 gate.
|
||||
|
||||
## 2026-05-27 — push-webhook diagnostic (the RL1 "future commits stay clean" advisory)
|
||||
|
||||
Timeboxed root-cause on why pushes don't auto-create a Drone lint build. Fired Gitea's webhook test
|
||||
for the Drone hook (211) while tailing the Drone server logs:
|
||||
- `POST /repos/recipe-maintainers/cc-ci/hooks/211/tests` → Gitea returns **204** (accepted).
|
||||
- `docker service logs --since 20s drone_…_app` → **NOTHING** — no inbound request logged at all.
|
||||
|
||||
So the delivery `git.autonomic.zone (Gitea) → drone.ci.commoninternet.net (public gateway) → cc-ci`
|
||||
isn't reaching Drone. This is a **gateway/network reachability** condition, NOT a Drone-side config
|
||||
I can fix — and per §9 the gateway is operator-managed (not ours to reconfigure). Leaving it as the
|
||||
documented pre-existing advisory (hook `last_status: None`, §4.1). Impact is limited to cc-ci's OWN
|
||||
self-test/lint pipeline auto-firing; **recipe-CI triggering is unaffected** — the comment-bridge
|
||||
polls Gitea *outbound* (cc-ci → git.autonomic.zone, the reliable direction), which is the plan's
|
||||
primary trigger (§4.1). The lint stage is wired + proven green via its exact command; manual/API
|
||||
Drone builds work. Not expanding scope to re-engineer the inbound path (bounded pass).
|
||||
|
||||
## 2026-05-27 — RL3 FULL D1–D10 PASS (Adversary cold). Only RL6 (coordinated) left.
|
||||
|
||||
Adversary logged **RL3 PASS** (REVIEW-1b): all D1–D10 re-verified cold on the cleaned+RL5
|
||||
byte-identical closure (`8i3jcad9`==running==fresh-clone build), fresh <24h evidence, nothing
|
||||
weakened. Highlights: D1 trigger 20s/8s; D2 install/upgrade/backup green (upgrade actually ran, not
|
||||
skipped) on custom-html + keycloak; D6 leak test 0 hits (8/8 infra + cert/key + generated keycloak
|
||||
admin pw absent from logs/dashboard); D8 fresh-recursive-clone rebuild == running; D10 = 2 fresh
|
||||
category runs (#151 custom-html, #152 keycloak) + carry-forward of the Phase-1 Adversary-verified
|
||||
6/6 set (byte-identical harness/test/closure). Cardinal-rule PASS. **RL1–RL5 Adversary-PASS, no open
|
||||
findings, NO VETO.**
|
||||
|
||||
→ Flagged the orchestrator (STATUS-1b) that I'm **ready for the RL6 coordinated cutover**: it updates
|
||||
`launch.sh` to `machine-docs/` paths + restarts the watchdog; on its signal I `git mv`
|
||||
STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md into `machine-docs/` (README stays root), the Adversary moves
|
||||
REVIEW*, I fix the only in-repo refs (README Loop-state + docs/install.md:15), Adversary re-verifies,
|
||||
then I write `## DONE`. Holding all root protocol files in place until that signal (moving them early
|
||||
breaks the live watchdog). Loop continues; not idling on a long sleep — short fallback while awaiting
|
||||
the orchestrator go-ahead.
|
||||
440
machine-docs/JOURNAL-1c.md
Normal file
440
machine-docs/JOURNAL-1c.md
Normal file
@ -0,0 +1,440 @@
|
||||
# JOURNAL — Phase 1c (Builder)
|
||||
|
||||
Append-only. Each entry: what I did + verifying command/output + next.
|
||||
|
||||
## 2026-05-27 — Phase 1c kickoff / orientation
|
||||
|
||||
**Context:** Phase 1 closed DONE (1c10fa5; all D1–D10 Adversary-PASS, no VETO; Adversary loop
|
||||
terminated for Phase 1). Starting Phase 1c per `plan-phase1c-full-reproducibility.md`.
|
||||
|
||||
**Did:**
|
||||
- Read the phase plan in full + plan.md §6.1/§7/§9. Verified access & surveyed current state.
|
||||
- `ssh cc-ci 'hostname && nixos-version && systemctl is-system-running'`
|
||||
→ `nixos` / `24.11.20250630.50ab793 (Vicuna)` / `running`. Healthy.
|
||||
- Surveyed secrets layout:
|
||||
- `secrets/secrets.yaml` holds 6 infra secrets: test_secret, drone_rpc_secret,
|
||||
drone_gitea_client_secret, bridge_drone_token, bridge_gitea_token, bridge_webhook_hmac.
|
||||
- `.sops.yaml`: recipients = `&host` (ssh-to-age of cc-ci ed25519 host key) + `&master`
|
||||
(off-box recovery, private at `/srv/cc-ci/.sops/master-age.txt`, sandbox-only).
|
||||
- `modules/secrets.nix`: `defaultSopsFile = ../secrets/secrets.yaml`,
|
||||
`age.sshKeyPaths = ["/etc/ssh/ssh_host_ed25519_key"]`.
|
||||
- Wildcard cert OUT-OF-BAND at `cc-ci:/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
|
||||
(fullchain 2909 B, privkey 227 B → privkey looks like an EC key). `modules/proxy.nix` reconcile
|
||||
reads it from there and inserts as traefik `ssl_cert`/`ssl_key` swarm secrets; treats a missing
|
||||
cert as FATAL (operator precondition).
|
||||
- `flake.nix`: nixpkgs pinned `50ab793`, sops-nix pinned; single local `nixosConfigurations.cc-ci`.
|
||||
- Tooling note: sandbox host has NO sops/nix/age/ssh-to-age (`which` → only git). sops/age work
|
||||
must run on cc-ci (has nix + host age key) or via a sops binary fetched there with the master key.
|
||||
- Bootstrapped Phase-1c state: STATUS-1c.md, BACKLOG-1c.md, JOURNAL-1c.md (this file). REVIEW-1c.md
|
||||
left for the Adversary (its file per §6.1). Appended Phase-1c decisions to DECISIONS.md.
|
||||
|
||||
**Decisions recorded (DECISIONS.md):** secrets linkage = **git submodule** (deviates from the
|
||||
flake-input default — rationale: no private-repo fetch credential needed at nix-eval on every
|
||||
rebuild, keeps `defaultSopsFile` a local path = minimal change + trivially byte-identical);
|
||||
bootstrap key for throwaway = **recovery age key via `sops.age.keyFile`**.
|
||||
|
||||
**Next (W2):** create private `recipe-maintainers/cc-ci-secrets`; move secrets + wildcard cert into
|
||||
sops there as a submodule of the base; wire secrets.nix (cert→`/var/lib/ci-certs/live` via `path=`);
|
||||
prove byte-identical build + clean switch with TLS from the git cert. Then claim Gate W2.
|
||||
|
||||
## 2026-05-27 — W2 step 1: cc-ci-secrets repo created + populated (DONE)
|
||||
|
||||
**Did:**
|
||||
- Created private `recipe-maintainers/cc-ci-secrets` via Gitea API (bot, org admin). HTTP 201, private=True.
|
||||
- Confirmed cc-ci host SSH key → age identity == `&host` recipient `age1h90utd…`:
|
||||
`ssh cc-ci 'nix shell nixpkgs#ssh-to-age --command ssh-to-age -i /etc/ssh/ssh_host_ed25519_key.pub'`
|
||||
→ exact match. So I can decrypt/re-encrypt on cc-ci with the host key (master stays sandbox-only).
|
||||
- Built `secrets.yaml` on cc-ci (script with file redirections, no key material in argv):
|
||||
`sops -d` existing 6 secrets → append `wildcard_cert`/`wildcard_key` as YAML block scalars from
|
||||
`/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` → `sops -e`. Verified round-trip:
|
||||
- recipients: 2 (host+master)
|
||||
- keys: test_secret, drone_rpc_secret, drone_gitea_client_secret, bridge_drone_token,
|
||||
bridge_gitea_token, bridge_webhook_hmac, wildcard_cert, wildcard_key
|
||||
- cert sha256 file==decrypt `c1d96d61…`; key sha256 file==decrypt `9ec25d00…`; test_secret decrypts OK
|
||||
- Retrieved ciphertext (7219 B) to sandbox; created cc-ci-secrets repo (root `secrets.yaml`, own
|
||||
`.sops.yaml` w/ `path_regex: secrets\.yaml$`, README). Pushed to main (auth via per-command
|
||||
http.extraHeader; verified `.git/config` has NO creds). Remote lists .sops.yaml/README.md/secrets.yaml.
|
||||
- Cleaned `/root/cc-ci-secrets.yaml` + build script off cc-ci.
|
||||
|
||||
**Layout decision:** cc-ci-secrets has `secrets.yaml` at ROOT → submodule mounts at base `secrets/`
|
||||
→ base sees `secrets/secrets.yaml`, so `defaultSopsFile = ../secrets/secrets.yaml` is UNCHANGED.
|
||||
|
||||
**Next (W2 step 2):** in base repo — replace tracked `secrets/` with the submodule; add
|
||||
`wildcard_cert`/`wildcard_key` sops secrets in secrets.nix (path= → /var/lib/ci-certs/live, + recovery
|
||||
keyFile); adjust proxy.nix framing; switch cc-ci to new config via
|
||||
`nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'`; prove byte-identical +
|
||||
TLS-from-git-cert; then claim Gate W2. (Riskier — touches live server config; fresh iteration.)
|
||||
|
||||
## 2026-05-27 — W2a DONE + verified live; Gate W2 CLAIMED
|
||||
|
||||
**Discovery:** cc-ci's build source `/root/cc-ci` is NOT a git repo — it's a plain dir synced from
|
||||
the sandbox via `tar | ssh` and built as a `path:` flake (DECISIONS.md:126). So cc-ci's deploy needs
|
||||
NO submodule fetch / `?submodules=1` (the rsync'd dir already contains `secrets/`). The git-clone
|
||||
`--recursive` + `?submodules=1` path is only for the documented install / throwaway (W4).
|
||||
|
||||
**Did (W2a — secrets split + cert into git, deployed to live cc-ci):**
|
||||
- secrets.nix: added `wildcard_cert`(0444)/`wildcard_key`(0400) sops secrets → `path=/var/lib/ci-certs/live/*`.
|
||||
- proxy.nix: reframed cert as sops-from-git (not operator drop); kept FATAL guard as a decrypt-path check.
|
||||
- Base repo: `git rm secrets/secrets.yaml`; `git submodule add cc-ci-secrets secrets` (gitlink 2312f1c,
|
||||
`.gitmodules` has NO creds). Pushed f79e542 (rebased over Adversary's c360520; resolved the
|
||||
tracked-file→submodule transition by removing the submodule wd before rebase, repopulating after).
|
||||
- Synced to cc-ci via `tar | ssh` (excluded .git). `nixos-rebuild build` → exit 0, only **6 derivations
|
||||
built** (sops manifest gains cert/key + proxy unit error-msg edit) → toplevel
|
||||
`vh6vwxbl4qr9whzpwgjimhf9gn4329p8` (differs from pre-W2 `m1pdvbhl…` — EXPECTED: cert moved
|
||||
out-of-band-file → Nix-managed sops; that is C2's whole point, not drift).
|
||||
- Backed up operator cert (`/root/ci-certs-operator-bak`), removed the regular files, `nixos-rebuild
|
||||
switch` (detached unit `ccci-w2-switch`, Result=success).
|
||||
|
||||
**Verified live:**
|
||||
- sops cert decrypt: `/var/lib/ci-certs/live/{fullchain,privkey}.pem` are now symlinks → `/run/secrets/
|
||||
wildcard_{cert,key}`; content sha256 == source: `c1d96d61…` / `9ec25d00…` (byte-identical to the
|
||||
original operator cert, now git-sourced).
|
||||
- `systemctl is-system-running` → running, 0 failed. `deploy-proxy` active/success.
|
||||
- **Byte-identical (zero drift):** `nixos-rebuild build` == `/run/current-system` == `vh6vwxbl…`.
|
||||
- **Documented git-clone path also reproduces it:** fresh `git clone --recursive` into a temp git repo
|
||||
+ `nixos-rebuild build --flake 'git+file:///tmp/ccci-git?submodules=1#cc-ci'` → **vh6vwxbl… (MATCH)**.
|
||||
Proves the install/throwaway path works and equals running.
|
||||
- **Live TLS from git cert:** `https://ci.commoninternet.net` http=200 ssl_verify=0; random
|
||||
`probe-*.ci.commoninternet.net` handshake ssl_verify=0 (404 route, expected) via gateway→cc-ci;
|
||||
served leaf `CN=*.ci.commoninternet.net`, LE issuer, valid to Aug 24 2026.
|
||||
|
||||
**For the Adversary verifying Gate W2 cold:** must init the submodule (`git clone --recursive` OR
|
||||
`git submodule update --init`, bot creds) then build with `?submodules=1`, else `secrets/` is empty.
|
||||
Both path: and git+submodules builds yield the same toplevel `vh6vwxbl…` (content-addressed).
|
||||
|
||||
**Deferred to W3/W4 prep (NOT in W2):** the recovery-key `sops.age.keyFile` for the throwaway VM —
|
||||
adding it changes the closure again, so I'll add + test it on the throwaway (safe) and re-establish
|
||||
byte-identical there. cc-ci stays on its proven host-key decrypt path for now.
|
||||
|
||||
**Next:** Gate W2 CLAIMED → await Adversary PASS on byte-identical + cert-in-git/TLS. Meanwhile prep W1
|
||||
(resize) / W3 (throwaway VM) — read the incus skill.
|
||||
|
||||
## 2026-05-27 — W3 recon (read-only; while parked at Gate W2)
|
||||
|
||||
Incus skill read. b1 = 100.117.251.31:8443, project terraform-ci, mTLS certs at
|
||||
/srv/incus-terraform-nix-vm-creator/terraform-secrets/{terraform.crt,terraform.key}. **b1 reachable
|
||||
via the EXISTING cc-ci proxy** (`curl --proxy socks5h://127.0.0.1:1055 --cert/--key -k …`) — no
|
||||
separate tailscaled needed (skill's own 1055 proxy would collide; reuse cc-ci's).
|
||||
|
||||
terraform-ci instances + RAM:
|
||||
- cc-nix-test Running 6GB VM ← this IS the live cc-ci; W1 resizes 6→4 (stop→set→start, hotplug times out)
|
||||
- lichen-staging Running 4GB container (leave alone)
|
||||
- kube-base / kube-base-test Stopped 4GB VMs
|
||||
- release-runner Stopped 8GB VM
|
||||
Running total now = 10GB. After W1 + throwaway(4GB): 4+4+4 = 12GB ≤ 16 physical (phase-plan ~12GB
|
||||
doc-only guideline; terraform-ci has no enforced limits.memory). VM create = `projects/incus-base`
|
||||
Terraform template (NixOS base image, cloud-init+tailscale+nix flakes), set instance_name + limits.memory=4GB.
|
||||
|
||||
## 2026-05-27 — W1 DONE: cc-nix-test resized 6→4 GB (verified)
|
||||
|
||||
Gate W2 PASSED (Adversary, cold) → proceeded. No active CI run (only 5 permanent stacks). Resized via
|
||||
Incus API on b1 (mTLS certs through the existing 1055 proxy): PUT state stop (op Success, Stopped) →
|
||||
PATCH `limits.memory=4GB` (http 200) → PUT state start (op Success, Running).
|
||||
**Verified after reboot:**
|
||||
- SSH back in ~30s; `systemctl is-system-running` → running after ~104s (swarm/reconcile converge), 0 failed units.
|
||||
- `free -h` total 3.5Gi (≈4 GB, down from 6). All stacks 1/1 (traefik app+socket-proxy, drone, bridge, dashboard, backups).
|
||||
- **Cert survived reboot via sops:** `/var/lib/ci-certs/live/{fullchain,privkey}.pem` still symlinks →
|
||||
/run/secrets/* (sops re-decrypted on cold boot). current-system still `vh6vwxbl…`.
|
||||
- TLS: `https://ci.commoninternet.net/` http=200 ssl_verify=0 (dashboard served from git cert).
|
||||
Running RAM now: cc-nix-test 4 + lichen-staging 4 = 8 GB; throwaway 4 → 12 GB ≤ 16 physical (guideline OK).
|
||||
|
||||
**Next: W3** — create blank 4 GB NixOS VM in terraform-ci, provision ONLY the bootstrap (recovery) age key.
|
||||
|
||||
## 2026-05-27 — W3: throwaway VM created (booting) + W4 design notes
|
||||
|
||||
**W3:** Created `ccci-throwaway` in terraform-ci via the **Incus REST API** (curl through the 1055
|
||||
proxy — terraform/nix absent on sandbox; replicated `projects/incus-base/main.tf`): image
|
||||
`incus-base-vm` (fp 3a0c4160), 4 GB RAM / 2 cpu / **20 GB disk** (>10 GB default, to dodge cc-ci's old
|
||||
ENOSPC), cloud-init writes /etc/nixos/{configuration,incus-base}.nix + setup.sh + /etc/ts-auth-key
|
||||
(incus workspace reusable key) + /etc/ts-hostname=ccci-throwaway; runcmd setup.sh (nix-channel
|
||||
nixos-24.11, `nixos-rebuild boot`, sysrq reboot → tailscale auto-joins). ssh_authorized_keys = vm_ssh_key
|
||||
(I hold private) + mfowler + cc-ci-root key. CREATE+START ops Success, status Running; first boot ~4-6 min.
|
||||
NOTE: cc-nix-test was terraform-created (`projects/cc-nix-test`); my W1 API resize drifts its tfstate
|
||||
(reconcile or accept in W6 final-sizing).
|
||||
|
||||
**W4 design (analysis; implement next):**
|
||||
- cc-ci's `hosts/cc-ci/configuration.nix` pins tailscale `--hostname=cc-nix-test` + reads /etc/ts-auth-key,
|
||||
and `secrets.nix` decrypts ONLY via `age.sshKeyPaths` (host SSH key). Consequences for the throwaway:
|
||||
1. **Decryption:** throwaway's host SSH key is NOT a sops recipient → cc-ci config as-is can't decrypt
|
||||
there. **W4 must add `sops.age.keyFile = "/var/lib/sops-nix/key.txt"`** and provision the **recovery
|
||||
age key** there (the ONE out-of-band secret). Open Q: does a *missing* keyFile abort activation on
|
||||
cc-ci (where the file won't exist)? If yes, also provision cc-ci's own host-derived age key at that
|
||||
path (no new exposure) OR keep sshKeyPaths+keyFile and confirm sops-nix tolerates the absence.
|
||||
Test path: add keyFile, deploy to cc-ci (rollback-safe via generations), observe.
|
||||
2. **Tailnet hostname:** after rebuild the throwaway re-ups as `cc-nix-test` → tailscale auto-suffixes
|
||||
the duplicate; the REAL cc-ci is accessed by IP (100.90.116.4) so it's unaffected. Verify the
|
||||
throwaway via its own IP (Incus state tailscale0 addr) and/or incus-agent `exec` (hostname-independent).
|
||||
3. **Bridge side effect:** throwaway's bridge would poll Gitea with the real token (fresh state ⇒ could
|
||||
re-trigger already-`!testme`'d PRs). Mitigate: run W4 when no `!testme` is pending; destroy promptly.
|
||||
- Adding keyFile changes the closure again (W2 byte-identical was at `vh6vwxbl`); re-verify after.
|
||||
|
||||
## 2026-05-27 — W3 DONE (VM reachable) + keyFile finding
|
||||
|
||||
**W3 reachable:** throwaway base boot initially failed tailscale auth — the incus-workspace
|
||||
`.test.env` key is **stale** ("invalid key: API key does not exist"). Fixed by writing the **current
|
||||
`TS_AUTH_KEY` from /srv/cc-ci/.testenv** (same tailnet `taila4a0bf.ts.net`) to /etc/ts-auth-key and
|
||||
`tailscale up`. VM now at **100.126.124.86**; `ssh -i vm_ssh_key` via the 1055 proxy works → NixOS
|
||||
24.11 (rev 50ab793, == cc-ci), nix 2.24 flakes, 4 GB / 20 GB (13 G free). *(install.md/Adversary note:
|
||||
provision the live TS key, not the stale workspace one.)*
|
||||
|
||||
**keyFile finding (decisive):** read sops-install-secrets main.go (sops-nix 77c423a, store
|
||||
`hm2xjph…-source/pkgs/sops-install-secrets/main.go`): when `age.keyFile` is set, line ~1349
|
||||
`os.ReadFile(AgeKeyFile)` and **returns a fatal error if the file is missing** → activation fails.
|
||||
⇒ Adding `keyFile` to cc-ci's config FORCES the file to exist on cc-ci. Also: `sshKeyPaths` reads
|
||||
`/etc/ssh/ssh_host_ed25519_key` (exists on any host; non-recipient keys are simply unused), so keeping
|
||||
both is safe on both hosts.
|
||||
|
||||
**W4 design (locked):** secrets.nix gets `sops.age.keyFile = "/var/lib/sops-nix/key.txt"` (keep
|
||||
sshKeyPaths). Provision that file = the host's bootstrap age key: on **cc-ci** = its host-derived age
|
||||
key (ssh-to-age of the host SSH key — no new secret exposure); on the **throwaway** = the **recovery
|
||||
key** (/srv/cc-ci/.sops/master-age.txt). cc-ci must get the file BEFORE the keyFile config deploys.
|
||||
Adding keyFile changes the closure (supersedes W2 `vh6vwxbl`) → re-verify byte-identical after.
|
||||
|
||||
## 2026-05-27 — Orchestrator guidance for C4 TLS verification (W4 Step B)
|
||||
|
||||
The throwaway has a NEW tailscale IP (100.126.124.86); the canonical `ci.commoninternet.net`
|
||||
gateway/DNS still points at the LIVE cc-ci, and the git cert is `*.ci.commoninternet.net`. So verify
|
||||
C4 TLS **locally ON the throwaway**, WITHOUT repointing the live gateway and WITHOUT changing the
|
||||
throwaway DOMAIN (keep DOMAIN=ci.commoninternet.net so the cert matches):
|
||||
- ssh into the throwaway; `curl --resolve probe.ci.commoninternet.net:443:127.0.0.1 \
|
||||
https://probe.ci.commoninternet.net/` → hits the local traefik with SNI ci.commoninternet.net.
|
||||
- Confirm the served leaf == the git cert (sha256 fullchain `c1d96d61…`; Adversary's leaf fingerprint
|
||||
`57:8D:67:9E:FE:89:…:B8:A6`). That proves the rebuilt system serves the git-sourced cert reproducibly.
|
||||
- Do NOT use ci2 for the TLS test (no `*.ci2` cert → would mismatch). Operator wired
|
||||
`ci2.commoninternet.net` + `*.ci2` → 100.126.124.86 for *plain* reachability only (not needed for TLS).
|
||||
- DNS/gateway/cert are documented external INSTANCE preconditions; C4 proves the VM rebuilds from git
|
||||
+ the single bootstrap age key. Don't skip/fake the TLS check.
|
||||
|
||||
## 2026-05-27 — W4 Step A DONE + Step B launched (throwaway rebuild in flight)
|
||||
|
||||
**Step A (cc-ci → final keyFile config):** provisioned cc-ci `/var/lib/sops-nix/key.txt` = host-derived
|
||||
age key (pub == `age1h90utd…` == &host recipient, verified via age-keygen -y). Added
|
||||
`sops.age.keyFile` to secrets.nix (9cc6788), synced, `nixos-rebuild build`→`izsmiajw…` (only
|
||||
manifest+system rebuilt), switched (unit ccci-w4a-switch success). Verified: system running 0 failed,
|
||||
**byte-identical build==running==`izsmiajw…` (ZERO DRIFT)**, cert still sha256 `c1d96d61…`. So cc-ci
|
||||
activates cleanly with keyFile. NOTE: toplevel evolved `vh6vwxbl` (W2) → **`izsmiajw`** (final, +keyFile);
|
||||
the published repo now builds to izsmiajw==running — this is the form the Adversary re-verifies for C4/DONE.
|
||||
|
||||
**Step B (throwaway live rebuild — IN FLIGHT):**
|
||||
- Provisioned throwaway `/var/lib/sops-nix/key.txt` = **recovery key** (via stdin; pub == `age1cmk26…`
|
||||
== &master recipient, verified) — the ONE out-of-band secret.
|
||||
- `git clone --recursive` base (bot creds via http.extraHeader, the "given the repos" provisioning) →
|
||||
/root/cc-ci, submodule `secrets`→2312f1c, secrets.yaml ENC. Confirmed clone has `age.keyFile` line.
|
||||
- Launched `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` as detached unit
|
||||
`ccci-rebuild` (survives the tailscale re-up when cc-ci config activates). Monitoring via incus-agent
|
||||
`exec` (vsock — survives network restart). Expect 10-30 min (builds sops-install-secrets/abra/etc).
|
||||
|
||||
C4/W5 standard (Adversary dd710a6 == orchestrator guidance): keep DOMAIN=ci.commoninternet.net, verify
|
||||
TLS locally on the VM via `curl --resolve …:443:127.0.0.1` (SNI ci.commoninternet.net), served leaf
|
||||
fingerprint must == git cert leaf `57:8D:67:9E:…:B8:A6`; oneshots converge; only age key out-of-band.
|
||||
|
||||
## 2026-05-27 — W4 Step B: throwaway rebuilt; concurrent-abra race found + fixed
|
||||
|
||||
**Throwaway rebuild result (pre-fix config, clone @dd710a6):** `nixos-rebuild switch` BUILD succeeded
|
||||
(2.8 G peak RAM < 4 GB, 11.5 min CPU) → toplevel **`izsmiajw…` == cc-ci's running system** (blank VM
|
||||
reproduces cc-ci byte-for-byte from git + the bootstrap age key). **sops cert decrypted via the
|
||||
RECOVERY key**: /var/lib/ci-certs/live/{fullchain,privkey}.pem → /run/secrets/*, sha256 `c1d96d61…`
|
||||
(match). swarm-init + docker active (node Ready/Leader). BUT activation reported "error(s) while
|
||||
switching": `deploy-proxy` + `deploy-drone` FAILED → system `degraded`.
|
||||
|
||||
**Root cause:** the abra reconcilers (proxy/drone/bridge/dashboard/backupbot) are all
|
||||
`wantedBy multi-user.target`; drone/bridge/dashboard were `after deploy-proxy` but **concurrent with
|
||||
each other**, and backupbot concurrent with proxy. On a FRESH `~/.abra` they race on catalogue/recipe
|
||||
init → fast failures. Confirmed: `abra recipe fetch traefik` works fine alone (rc=0); re-running the
|
||||
oneshots **sequentially** (`systemctl restart deploy-proxy; …drone; …bridge; …dashboard; …backupbot`)
|
||||
→ ALL success, system `running`, **0 failed, all 6 stacks 1/1** (traefik app+socket-proxy, drone,
|
||||
bridge, dashboard, backups) — identical to cc-ci.
|
||||
|
||||
**Fix (7563d47):** serialize the chain via ordering-only `after`:
|
||||
proxy → drone → bridge → dashboard → backupbot (bridge after drone, dashboard after bridge, backupbot
|
||||
after dashboard). So a single `nixos-rebuild switch` on a blank host converges with no concurrent abra.
|
||||
New toplevel `ld19aj2…`. Deploying to cc-ci (reconcilers already deployed there ⇒ serial no-op
|
||||
re-runs) + re-verify byte-identical, then **recreate the throwaway FRESH** to prove single-switch
|
||||
convergence (authoritative C4; mirrors the Adversary's W5 cold test).
|
||||
|
||||
This is the LAST planned config change before W4 completes (config stable ld19aj2 thereafter).
|
||||
|
||||
## 2026-05-27 — W4: cc-ci on serialized config (ld19aj2) + throwaway TLS leaf-match PASS
|
||||
|
||||
- cc-ci switched to serialized config: `systemctl is-system-running`=running, **byte-identical
|
||||
build==running==`ld19aj2dcrjm6jarq1k6rvhc0zww34qq` (ZERO DRIFT)**, 6 stacks.
|
||||
- **Throwaway local TLS (C4 cert proof):** on the rebuilt throwaway (IP 100.126.124.86),
|
||||
`curl --resolve probe.ci.commoninternet.net:443:127.0.0.1` → http=404 (no route, expected)
|
||||
**ssl_verify=0**. Served leaf sha256 fingerprint == git-cert leaf:
|
||||
`57:8D:67:9E:FE:89:D5:FB:43:2E:2A:02:D6:A6:BA:F4:9B:98:1A:78:4A:6C:6A:85:DB:F6:A2:81:61:A6:B8:A6`
|
||||
(== Adversary reference). Full chain of custody: git sops → recovery-key decrypt → /var/lib/ci-certs/
|
||||
live → traefik swarm secret → served leaf. The rebuilt host serves the git-sourced cert.
|
||||
|
||||
Next: recreate throwaway FRESH with fixed config to prove SINGLE nixos-rebuild switch converges (0 failed).
|
||||
|
||||
## 2026-05-27 — W4 DONE: genuine throwaway-VM live rebuild, SINGLE switch converges (Gate W4 CLAIMED)
|
||||
|
||||
**Authoritative C4 proof on a FRESH blank VM** (destroyed the pre-fix VM, recreated clean; cloud-init
|
||||
used the LIVE TS_AUTH_KEY so it auto-joined the tailnet — no manual tailscale step):
|
||||
- Provisioned ONLY `/var/lib/sops-nix/key.txt` = recovery age key (pub == `age1cmk26…` == &master) —
|
||||
the single out-of-band secret. `git clone --recursive` base+secrets (submodule 2312f1c, secrets ENC).
|
||||
- **One** `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (detached
|
||||
--no-block) → `ccci-rebuild` Result=**success** (~15 min, 2.8 G peak < 4 GB).
|
||||
- **`systemctl is-system-running` → running, 0 failed units** (the serialization fix works: single
|
||||
switch converges, no manual re-runs). Toplevel **`ld19aj2…` == cc-ci** (byte-identical).
|
||||
- **All 6 stacks 1/1**: traefik app+socket-proxy, drone, ccci-bridge, ccci-dashboard, backups.
|
||||
- **All secrets decrypted via the recovery key**; wildcard cert sops-decrypted from git →
|
||||
`/var/lib/ci-certs/live/fullchain.pem` (symlink→/run/secrets, sha256 `c1d96d61…`).
|
||||
- **TLS from git cert (local, per C4 standard):** `curl --resolve probe.ci.commoninternet.net:443:
|
||||
127.0.0.1` → http=404 (no route, expected) **ssl_verify=0**; served leaf sha256 fingerprint
|
||||
**== git-cert leaf == `57:8D:67:9E:FE:89:…:B8:A6`** (Adversary reference). Full chain of custody.
|
||||
|
||||
So: blank NixOS host + the two git repos + the one bootstrap age key + external DNS/gateway → one
|
||||
`nixos-rebuild switch` → working cc-ci. No undocumented manual step. This closes D8 honestly (static
|
||||
byte-identical closure + live throwaway rebuild). install.md updated to this validated procedure.
|
||||
|
||||
Destroying the throwaway now (frees RAM for the Adversary's independent W5 cold rebuild; C6 no-leftover).
|
||||
Gate W4 CLAIMED — awaiting Adversary cold W5 (their own fresh VM).
|
||||
|
||||
## 2026-05-27 — Operator override: keep the FINAL throwaway (promote → cc-nix-test)
|
||||
|
||||
Orchestrator/operator note: do NOT destroy the FINAL W5/C4-C5 clean-room throwaway VM after it
|
||||
PASSes — the operator repurposes it as the new cc-nix-test for a live real-traffic test through the
|
||||
public gateway. Keep it running; defer its C6 teardown until the operator explicitly says otherwise.
|
||||
Overrides plan §5/§6 "destroy the throwaway" for that one VM. Settles **C6 final sizing = promote the
|
||||
rebuilt VM**. Recorded in DECISIONS.md + STATUS-1c (flagged for the Adversary so they don't tear down
|
||||
their W5 VM on PASS). My already-destroyed first throwaway + RAM accounting unaffected.
|
||||
|
||||
## 2026-05-27 — Added acceptance step: real e2e !testme on the promoted VM (operator-gated)
|
||||
|
||||
Orchestrator added a functional-acceptance step for the clean-room rebuild. SEQUENCING (strict):
|
||||
(1) finish W5/C4-C5; (2) ORCHESTRATOR renames the verified throwaway → cc-nix-test so the public
|
||||
gateway (ci.commoninternet.net + `*.ci` via MagicDNS) routes to it, and SIGNALS me; (3) THEN I run a
|
||||
genuine e2e: `!testme` (as bot) on ONE enrolled recipe (fast, e.g. custom-html) → confirm bridge
|
||||
picks up → Drone builds → app deploys to `<recipe>.ci.commoninternet.net` reachable **through the
|
||||
public gateway** (curl the public subdomain, not localhost) → test passes → undeploy → result
|
||||
reported. Record Drone run # + public-URL curl in JOURNAL-1c/STATUS-1c as functional acceptance of
|
||||
D8/clean-room. Until the swap-done signal: keep the rebuilt VM's full stack running, do NOT tear down,
|
||||
do NOT start the e2e. (Tracked as W5.5 in BACKLOG-1c.)
|
||||
|
||||
## 2026-05-27 — E2E-TESTME spec is authoritative (cc-ci-plan/test-e2e-testme-acceptance.md)
|
||||
|
||||
Orchestrator: the full spec at `/srv/cc-ci/cc-ci-plan/test-e2e-testme-acceptance.md` is the AUTHORITY
|
||||
(supersedes earlier inline wording). Read it. It's MY test to execute; Adversary independently
|
||||
verifies. Preconditions P1-P3 are orchestrator-provided (node rename → cc-nix-test, public-gateway
|
||||
routing, then a SIGNAL). Self-check on signal: `curl https://ci.commoninternet.net/` → 200 ssl_verify=0.
|
||||
Pass criteria E1-E6 (new spec §3): E1 self-check; E2 new Drone build via bridge (not manual); E3 app
|
||||
answers EXTERNAL request at `<app>.ci.commoninternet.net` through gateway (real 200+cert+content, not
|
||||
localhost); E4 real assertions pass / build success; E5 clean undeploy; E6 reported + dashboard
|
||||
updated. Evidence→JOURNAL-1c, verdict→STATUS/REVIEW-1c as E2E-TESTME PASS. On fail: clean-room finding
|
||||
→ fix in GIT SOURCE (base/cc-ci-secrets), not the live VM → re-run. Bound: one recipe, one green run.
|
||||
Not started — awaiting orchestrator signal; rebuilt VM stack kept up.
|
||||
|
||||
## 2026-05-27 — E2E-TESTME: Builder now owns the tailnet swap (no orchestrator signal)
|
||||
|
||||
Spec §1 updated (re-read): the Builder performs the swap end-to-end after C4/C5 PASS + rebuilt stack
|
||||
up — NO orchestrator signal. Two reversible `tailscale set --hostname` (ORDER MATTERS):
|
||||
(1) `ssh cc-ci 'tailscale set --hostname=cc-nix-test-orig'` (original aside, KEEP running for swap-back;
|
||||
ssh cc-ci pinned to 100.90.116.4 still hits original); (2) rebuilt throwaway → cc-nix-test (re-derive
|
||||
its current online IP from `tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock status | grep -i
|
||||
throwaway`). Then cc-nix-test.taila4a0bf.ts.net → rebuilt VM tailnet-wide; gateway auto-follows ~10s.
|
||||
Verify P1+P2 (status shows cc-nix-test→throwaway IP; `curl https://ci.commoninternet.net/` 200
|
||||
ssl_verify=0) → run E2E-TESTME (E1-E6) → swap-back (rebuilt→old name, `ssh cc-ci 'tailscale set
|
||||
--hostname=cc-nix-test'`). Orchestrator just monitors / safety-net.
|
||||
|
||||
**Two execution watch-outs I'll handle at run time** (reasoned, not yet done): (a) the original
|
||||
(cc-nix-test-orig) keeps its bridge polling Gitea with the same token → would duplicate builds/PR
|
||||
comments; pause it during the e2e (`docker service scale ccci-bridge_app=0` on the original, restore
|
||||
after). (b) the rebuilt VM's Drone needs the one-time OAuth bootstrap (install.md §2,
|
||||
scripts/bootstrap-drone-oauth.sh) before it can clone/build — a documented post-step, run it on the
|
||||
rebuilt VM as part of e2e setup. Still gated on C4/C5 PASS (W5) — not started.
|
||||
|
||||
## 2026-05-27 — E2E-TESTME actor/critic split clarified (avoid node-rename collision)
|
||||
|
||||
Orchestrator disambiguation: only ONE loop runs `tailscale set --hostname`. **Builder (me) owns the
|
||||
swap + the !testme test**; the swap TARGET is the **Adversary's** kept-running W5 VM (Incus instance
|
||||
**`ccci-w5-rebuild`**) — my own throwaway was destroyed. The **Adversary does NOT rename**; it keeps
|
||||
its W5 VM up, **records the VM identity (Incus instance + current tailscale IP) in REVIEW-1c/STATUS**,
|
||||
and independently VERIFIES E1-E6 cold (critic role). So I **WAIT for (i) Adversary W5 PASS + (ii) the
|
||||
recorded VM IP** before swapping (original→cc-nix-test-orig, then ccci-w5-rebuild→cc-nix-test). Updated
|
||||
STATUS-1c pending-e2e accordingly. Still gated on W5 — not started.
|
||||
|
||||
## 2026-05-27 — E2E-TESTME clean-room finding: Drone bot token not reproducible (FIXED in git)
|
||||
|
||||
Doing the e2e setup on the swapped-in rebuilt VM, found the sops `bridge_drone_token` gets **401
|
||||
Unauthorized** from the rebuilt VM's Drone. Root cause: `modules/drone.nix` set
|
||||
`DRONE_USER_CREATE=username:autonomic-bot,admin:true` with **no `token:`** → Drone auto-generates a
|
||||
RANDOM bot machine token in its fresh DB, which can't equal the committed sops token (the original
|
||||
cc-ci only matched because its token was captured FROM the running Drone out-of-band). So on a genuine
|
||||
clean-room rebuild the bridge can't authenticate to Drone → can't trigger builds. This is precisely the
|
||||
out-of-band gap the E2E-TESTME is designed to catch (spec §4). **Fix (git source):**
|
||||
`DRONE_USER_CREATE=...,token:$(cat /run/secrets/bridge_drone_token)` so the bot's machine token is the
|
||||
deterministic sops token on every rebuild. Confirmed via: rebuilt Drone container env had no token;
|
||||
`GET /api/repos/.../builds` with sops token → `{"message":"Unauthorized"}`.
|
||||
Evolves the toplevel again (ld19aj2 → new); will re-deploy to cc-ci + re-verify byte-identical after
|
||||
the e2e, Adversary re-checks C1. Next: apply fix on the rebuilt VM (rebuild → redeploy Drone; wipe
|
||||
Drone DB if DRONE_USER_CREATE doesn't update the existing bot), re-run OAuth, then the !testme e2e.
|
||||
|
||||
## 2026-05-27 — E2E-TESTME on the rebuilt VM: E1-E3 PASS (E4/E5 tracking)
|
||||
|
||||
After applying the Drone-token fix (new toplevel `cqym8knj…`), the rebuilt VM is operational. Restarted
|
||||
drone-runner-exec (stale RPC after the Drone redeploy) → queue drained (cc-ci self-test #1 success).
|
||||
Posted `!testme` (comment 13740, autonomic-bot) on custom-html#2 (head db9a9502). Evidence:
|
||||
- **E1 PASS** — `https://ci.commoninternet.net/` via public gateway → 200 ssl_verify=0 (rebuilt VM).
|
||||
- **E2 PASS** — bridge (poll) picked up the comment → **new Drone build #4** (event=custom, > baseline
|
||||
#3) on the rebuilt VM's Drone. Not a manual trigger.
|
||||
- **E3 PASS** — app deployed to `cust-bdddd9.ci.commoninternet.net`; EXTERNAL curl through the public
|
||||
gateway (sandbox → socks proxy → public DNS → gateway → MagicDNS cc-nix-test → rebuilt VM → Traefik →
|
||||
app) → **HTTP/2 200, ssl_verify=0**, `server: nginx/1.31.1`, body `<!DOCTYPE html>…Welcome to nginx!`
|
||||
(real app content, NOT a Traefik 404), cert `CN=*.ci.commoninternet.net` (LE E8). Crux proven.
|
||||
- E4 (build #4 success), E5 (teardown), E6 (reported+dashboard): monitor tracking to build terminal.
|
||||
|
||||
## 2026-05-27 — E2E-TESTME: ALL E1–E6 PASS (functional acceptance of D8/clean-room)
|
||||
|
||||
Real `!testme` on the rebuilt-from-git VM (swapped in as cc-nix-test), full pipeline against the
|
||||
PUBLIC domain:
|
||||
- **E1 PASS** — `https://ci.commoninternet.net/` (public gateway → rebuilt VM) → 200 ssl_verify=0.
|
||||
- **E2 PASS** — `!testme` (bot, comment 13740) on custom-html#2 → bridge poll → **new Drone build #4**
|
||||
(event=custom, > baseline #3), via the bridge (not manual).
|
||||
- **E3 PASS** — app `cust-bdddd9.ci.commoninternet.net` answered an EXTERNAL request through the public
|
||||
gateway → HTTP/2 200, ssl_verify=0, nginx/1.31.1, real body `…Welcome to nginx!`, cert
|
||||
`CN=*.ci.commoninternet.net` (LE E8). Routing public-DNS→gateway→MagicDNS→rebuilt VM→Traefik→app proven.
|
||||
- **E4 PASS** — build #4 success; build log shows the REAL 3 stages all passing (no softening):
|
||||
install (`test_http_reachable`, `test_playwright_page` — Playwright), upgrade
|
||||
(`test_upgrade_preserves_data`), backup (`test_backup_mutate_restore`). 2+1+1 assertions passed.
|
||||
- **E5 PASS** — app undeployed cleanly afterward (0 residual `<tag>-<6hex>` app .envs/stacks).
|
||||
- **E6 PASS** — bridge posted to custom-html#2: "custom-html @ db9a9502 ✅ **passed** →
|
||||
…/cc-ci/4"; public dashboard row = custom-html / success / #4.
|
||||
|
||||
→ **E2E-TESTME PASS.** The clean-room-rebuilt VM is operationally a working CI server end-to-end over
|
||||
the real public domain. Caught+fixed the Drone-bot-token reproducibility gap en route (af46aca).
|
||||
Next: swap-back; re-deploy the token fix to cc-ci (byte-identical at new toplevel cqym8knj); Adversary
|
||||
independently verifies E1-E6.
|
||||
|
||||
## 2026-05-27 — Builder work COMPLETE (C1–C7 + E2E-TESTME); awaiting Adversary final verification
|
||||
|
||||
cc-ci on final config `cqym8knj` (byte-identical, 0 failed, bridge→Drone OK). C7 docs done:
|
||||
install.md/secrets.md/architecture.md updated to the 1c model; plan.md §1.5 carries a Phase-1c
|
||||
supersession note (cert now sops-from-git; bootstrap age key the one out-of-band secret; supersedes
|
||||
§1.5/§4.0/§4.4 cert refs; points to docs/secrets.md). C6 settled (promote rebuilt VM, kept running;
|
||||
first throwaway destroyed; cc-nix-test 4 GB). All C1–C7 + E2E-TESTME implemented & Builder-verified.
|
||||
**Remaining = Adversary's final DONE-verification:** re-confirm C1 byte-identical at `cqym8knj` +
|
||||
independently verify E1–E6. I'll write `## DONE` when REVIEW-1c shows <24h PASS for C1–C7 + E2E-TESTME
|
||||
and no VETO. (plan.md is in cc-ci-plan/, not this repo — edited in place, not committed here.)
|
||||
|
||||
## 2026-05-27 — ADV-1c-1 (architecture.md stale) addressed
|
||||
|
||||
Adversary verdict b301b03: **E2E-TESTME E1–E6 PASS** (independent) + **C1–C6 PASS** (C1 refreshed cold
|
||||
at final `cqym8knj` == running, byte-identical; no VETO). **C7 WITHHELD** on finding ADV-1c-1:
|
||||
`docs/architecture.md` allegedly stale (line 17 "local secrets/secrets.yaml via host SSH key", cert
|
||||
"pre-issued out-of-band"). **But architecture.md was already updated to the 1c model in commit b700cd2**
|
||||
(an ANCESTOR of `3bfb48b`, the HEAD the Adversary cloned for C1) — current line 14/17 + §Network/TLS
|
||||
describe the `cc-ci-secrets` submodule, bootstrap age key, and cert-sops-from-git. The quoted "stale"
|
||||
text is the PRE-b700cd2 line 17 → ADV-1c-1 is a stale-clone false positive (the doc-grep used an older
|
||||
checkout). To remove all doubt I further expanded line 17 (explicit: cert-in-git, submodule, bootstrap
|
||||
key = host-derived OR recovery-key-on-clone, one out-of-band secret). **Adversary: please re-grep
|
||||
`docs/architecture.md` at current HEAD and close ADV-1c-1 → C7 PASS → DONE.**
|
||||
|
||||
## 2026-05-27 — ## DONE (Phase 1c complete)
|
||||
|
||||
Adversary closed ADV-1c-1 → **C7 PASS** (9e0f72a). **ALL C1–C7 + E2E-TESTME Adversary-PASS (<24h, no
|
||||
VETO, no open findings).** Final Builder health check: cc-ci running/0-failed, byte-identical
|
||||
build==running==`cqym8knj` (ZERO DRIFT), 6 stacks, cert sops-from-git `c1d96d61…`, public TLS 200/ssl=0.
|
||||
Wrote `## DONE` to STATUS-1c. Phase 1c exit condition met → stopping the self-paced loop. The Adversary
|
||||
will append its final cold sign-off. Operator follow-up (non-gating): promote `ccci-w5-rebuild`→cc-nix-test
|
||||
(bridge paused, stack up); plan.md §4.0/§4.4 cert wording (superseding note at §1.5).
|
||||
790
machine-docs/JOURNAL.md
Normal file
790
machine-docs/JOURNAL.md
Normal file
@ -0,0 +1,790 @@
|
||||
# JOURNAL — cc-ci Builder (append-only)
|
||||
|
||||
## 2026-05-26 — Bootstrap (§1)
|
||||
|
||||
**Access verification (all pass):**
|
||||
- `ssh cc-ci 'hostname && whoami && nixos-version'` → `nixos` / `root` / `24.11.719113.50ab793786d9 (Vicuna)`
|
||||
- `curl https://git.autonomic.zone/api/v1/version` → `{"version":"1.24.2"}`
|
||||
- Gitea bot auth (`curl -u $GITEA_USERNAME:$GITEA_PASSWORD .../api/v1/user`) → `login: autonomic-bot`, id 64
|
||||
- `getent hosts probe-$RANDOM.ci.commoninternet.net` → `143.244.213.108` (the gateway IP, as expected — TLS passthrough)
|
||||
- Cert present: `ls /var/lib/ci-certs/live/` → `fullchain.pem` (2909 b), `privkey.pem` (227 b, mode 640)
|
||||
- recipe-maintainers org exists (private); `recipe-maintainers/cc-ci` → 404 (created below)
|
||||
- Mirrored recipes already present: bluesky-pds, lasuite-docs, custom-html, custom-html-tiny, n8n,
|
||||
keycloak, lasuite-meet, matrix-synapse, cryptpad
|
||||
|
||||
**Baseline (docs/baseline.md):** fresh NixOS 24.11 Incus VM, 2 vCPU, 3.5 GiB RAM, 8.9 GiB disk
|
||||
(3.8 GiB free). No docker/swarm/abra. Channel-based `/etc/nixos/configuration.nix` (no flake).
|
||||
|
||||
**Actions:**
|
||||
- Created repo `recipe-maintainers/cc-ci` (private) via Gitea API.
|
||||
- `git init` in /srv/cc-ci/cc-ci; credential helper reads creds from /srv/cc-ci/.testenv (no
|
||||
secrets stored in git config).
|
||||
- Seeded skeleton layout (§3) + loop-state files + docs/baseline.md.
|
||||
|
||||
**Next:** commit + push bootstrap, then M0 (flake + base config + sops test secret).
|
||||
|
||||
## 2026-05-26 — M0: flake + base config rebuilt from repo
|
||||
|
||||
**Authored** `flake.nix` (pins nixpkgs rev `50ab793786d9…`, the exact rev cc-ci ran),
|
||||
`hosts/cc-ci/hardware.nix` (incus VM module + cloud-init + DHCP/nameservers) and
|
||||
`hosts/cc-ci/configuration.nix` (faithful baseline repro: tailscale w/ hardcoded `--hostname=
|
||||
cc-nix-test` since `builtins.readFile /etc/ts-hostname` is impure under flakes; sshd root; firewall
|
||||
trust tailscale0 + tcp/22; base pkgs).
|
||||
|
||||
**Disk/inode hiccup → resolved:** first `nix flake lock`/build hit `No space left on device` —
|
||||
diagnosed as **inode** exhaustion (`df -i` → 6005 free of 586336; old 8.9 GiB fs). Operator grew
|
||||
the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried.
|
||||
|
||||
**Build + switch (commands + output):**
|
||||
- `ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci'` → `BUILD EXIT 0`,
|
||||
produced `nixos-system-nixos-24.11.20250630.50ab793`.
|
||||
- `ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch
|
||||
--flake /root/cc-ci#cc-ci'` (detached so it survives ssh drop) → unit `Result=success
|
||||
ExecMainStatus=0`.
|
||||
|
||||
**Gate verification:**
|
||||
- `systemctl is-system-running` → `running`
|
||||
- `readlink /run/current-system` → `…-nixos-system-nixos-24.11.20250630.50ab793` (gen 3, from flake)
|
||||
- `systemctl is-active tailscaled` → `active`; `sshd.socket` → `active` (sshd is socket-activated, so
|
||||
`sshd.service` reads inactive — live ssh proves it works)
|
||||
- `systemctl --failed` → none
|
||||
- `nixos-rebuild list-generations` → gen 3 current @20:23, prior channel gen 2 retained for rollback.
|
||||
|
||||
**Known warning (tracked, non-blocking):** incus module enables `systemd.network` while we keep
|
||||
`networking.useDHCP=true` (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from
|
||||
baseline; networking is up. Clean up by choosing one stack later.
|
||||
|
||||
**Deploy mechanism settled** (DECISIONS.md): `switch --flake` on-host, repo synced via `tar | ssh`.
|
||||
|
||||
**Next:** sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then
|
||||
CLAIM the M0 gate for the Adversary.
|
||||
|
||||
## 2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED)
|
||||
|
||||
**Keys:**
|
||||
- Host age recipient from ssh host key: `ssh cc-ci 'nix run nixpkgs#ssh-to-age -- -i
|
||||
/etc/ssh/ssh_host_ed25519_key.pub'` → `age1h90utdztfc23kx8ewrtrtk80mnddvrf8pg4ppej55rwwwupzhfvqhmp3qa`.
|
||||
- Master recovery key generated on host (`age-keygen`), public `age1cmk26t…`; private moved off-box
|
||||
to `/srv/cc-ci/.sops/master-age.txt` (mode 600) and `shred`-ded from the host. Never in repo.
|
||||
|
||||
**Files:** `.sops.yaml` (both recipients, rule `secrets/.*\.(yaml|json|env)$`); `modules/secrets.nix`
|
||||
(`sops.age.sshKeyPaths=[/etc/ssh/ssh_host_ed25519_key]`, `secrets.test_secret={}`); flake gains
|
||||
`sops-nix` input + `sops-nix.nixosModules.sops`; configuration.nix imports the module.
|
||||
|
||||
**sops-nix version pin (dead-end avoided):** master sops-nix wants `buildGo125Module` (Go 1.25),
|
||||
absent in pinned nixpkgs 24.11 → eval error. Pinned sops-nix to `77c423a…` (2025-06-17, last using
|
||||
plain `buildGoModule`). Verified the file at that rev uses `buildGoModule`. Build then OK.
|
||||
|
||||
**Encrypt test secret:** on host, `printf 'test_secret: cc-ci-m0-<rand>' > secrets/secrets.yaml`
|
||||
then `nix run nixpkgs#sops -- --encrypt --in-place secrets/secrets.yaml` (run inside repo so
|
||||
`.sops.yaml` resolves) → rc=0, two age recipients in the file.
|
||||
|
||||
**Build + switch (commands + output):**
|
||||
- `nixos-rebuild build --flake .#cc-ci` → `BUILD EXIT 0` (built sops-install-secrets w/ Go 1.23.8).
|
||||
- `systemd-run --unit=ccci-rebuild2 ... nixos-rebuild switch --flake /root/cc-ci#cc-ci` →
|
||||
`Result=success ExecMainStatus=0`.
|
||||
|
||||
**Gate verification (M0):**
|
||||
- `systemctl is-system-running` → `running`; `systemctl --failed` → none.
|
||||
- `ls -la /run/secrets/test_secret` → `-r-------- 1 root root 41` ; `stat` → `root:root 400`.
|
||||
- `head -c9` → `cc-ci-m0-` (matches generated value), `wc -c` → 41 (9 + 32 hex). Decrypt path proven.
|
||||
- Pulled encrypted `secrets/secrets.yaml` + `flake.lock` back to clone; `grep cc-ci-m0 secrets.yaml`
|
||||
→ no plaintext leak; lock inputs = nixpkgs, sops-nix.
|
||||
|
||||
**Gate handshake:** set `Gate: M0 — CLAIMED, awaiting Adversary` in STATUS.md. REVIEW.md still empty
|
||||
(no Adversary activity yet). Per §6.1 liveness I won't idle-block: I keep M0 claimed and proceed
|
||||
with M1 (independent infra build), without advancing to M2 until M0 shows PASS.
|
||||
|
||||
**Next:** M1 — Docker + single-node swarm via Nix (modules/swarm.nix), then Traefik (file provider
|
||||
→ /var/lib/ci-certs/live/) + abra, then a by-hand HTTPS deploy/teardown of a trivial recipe.
|
||||
|
||||
## 2026-05-26 — M1: Docker + single-node swarm via Nix
|
||||
|
||||
**modules/swarm.nix:** `virtualisation.docker.enable` + daily autoprune (--all --volumes until=24h
|
||||
to protect the 28 GiB root), `docker` in systemPackages, and a `swarm-init` oneshot
|
||||
(`docker swarm init --advertise-addr 127.0.0.1` if not active; `docker network create --driver
|
||||
overlay --attachable proxy` if absent). Imported into configuration.nix.
|
||||
|
||||
**Build + switch:** `nixos-rebuild build --flake .#cc-ci` → EXIT 0; `systemd-run … switch` →
|
||||
`Result=success`.
|
||||
|
||||
**Verify (commands + output):**
|
||||
- `systemctl show swarm-init -p Result` → `Result=success`
|
||||
- `docker info --format ...` → `Swarm=active Managers=1 Nodes=1`
|
||||
- `docker network ls --filter name=proxy` → `proxy overlay swarm`
|
||||
- `systemctl is-system-running` → `running`; `--failed` → none.
|
||||
|
||||
**Next:** Traefik as a swarm stack (Nix-declared compose + `docker stack deploy` oneshot): docker
|
||||
swarm provider + file provider serving /var/lib/ci-certs/live/{fullchain,privkey}.pem on :443,
|
||||
attached to `proxy`. Then abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate).
|
||||
Rationale for swarm-service Traefik over a host `services.traefik`: a host process isn't on the
|
||||
`proxy` overlay, so it can't reach swarm service VIPs; coop-cloud recipes assume an on-`proxy`
|
||||
Traefik watching swarm labels.
|
||||
|
||||
## 2026-05-26 — M1: Traefik swarm stack + HTTPS path proven
|
||||
|
||||
**modules/traefik.nix:** Traefik v3.3 as a swarm service on `proxy` (so it reaches recipe VIPs).
|
||||
Config via Nix `writeText` store files bind-mounted into the container (real files, not /etc
|
||||
symlinks): static `traefik.yml` (entrypoints web/websecure; `providers.swarm` unix socket,
|
||||
exposedByDefault=false, network=proxy; `providers.file` dir /etc/traefik/dynamic; ping; no
|
||||
dashboard) and dynamic `certs.yml` (wildcard at /var/lib/ci-certs/live/* as `stores.default.
|
||||
defaultCertificate` + certificates — so any *.ci.commoninternet.net router with tls=true is covered,
|
||||
no ACME). Deployed by a `traefik-deploy` oneshot (`docker stack deploy`) after swarm-init. Opened
|
||||
firewall 80/443 (gateway forwards over enp5s0).
|
||||
|
||||
**Build + switch:** build EXIT 0; switch `Result=success`; `traefik-deploy` `Result=success`;
|
||||
`docker service ls` → `traefik_traefik traefik:v3.3 1/1`.
|
||||
|
||||
**Verify (commands + output):**
|
||||
- Local: `curl -ksv -H 'Host: probe-test.ci.commoninternet.net' https://localhost/` →
|
||||
`subject: CN=*.ci.commoninternet.net`, `issuer: …Let's Encrypt; CN=E8`, TLSv1.3, HTTP 404.
|
||||
- **End-to-end via gateway:** `curl -ksv --resolve probe-test.ci.commoninternet.net:443:143.244.213.108
|
||||
https://probe-test.ci.commoninternet.net/` → `Connected to …(143.244.213.108) port 443`,
|
||||
same wildcard cert, HTTP 404. Confirms gateway SNI-passthrough → cc-ci Traefik TLS termination.
|
||||
404 is correct (no router for that host yet).
|
||||
|
||||
**Next:** install abra (M1 last task), `abra app new` a trivial recipe (custom-html) → deploy →
|
||||
reach over HTTPS at <app>.ci.commoninternet.net → teardown leaving no volumes. That completes M1
|
||||
→ CLAIM M1 gate.
|
||||
|
||||
## 2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED)
|
||||
|
||||
**Orchestrator decision (mid-M1):** replace the hand-rolled Traefik with the canonical Co-op Cloud
|
||||
`traefik` recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom
|
||||
`modules/traefik.nix`; moved firewall 80/443 into `modules/swarm.nix`. Recorded in DECISIONS.md.
|
||||
|
||||
**Why the pivot also fixed a real bug:** my custom Traefik used entrypoint `websecure`; coop-cloud
|
||||
recipes label `entrypoints=web-secure`. While chasing that I also hit a sharp **systemd-run gotcha**:
|
||||
`systemd-run … nixos-rebuild switch --flake .#cc-ci` runs with cwd `/`, so `.#` → `/` → "could not
|
||||
find a flake.nix"; the switch silently failed while a post-`--collect` `systemctl show` returned a
|
||||
stale `Result=success`. Fix: always use the **absolute** flake path `/root/cc-ci#cc-ci`, and read the
|
||||
result before resetting. (rebuild6/7 had silently not applied; rebuild2–5 used the absolute path.)
|
||||
|
||||
**abra packaged** (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd.
|
||||
`abra --version` → `0.13.0-beta-06a57de`.
|
||||
|
||||
**scripts/deploy-proxy.sh** (idempotent, pure-bash — host has no python3): ensure local abra server,
|
||||
fetch traefik, write wildcard/no-ACME env (`WILDCARDS_ENABLED=1`, `SECRET_WILDCARD_*_VERSION=v1`,
|
||||
`COMPOSE_FILE=compose.yml:compose.wildcard.yml`, `LETS_ENCRYPT_ENV=` empty), insert cert secrets via
|
||||
`abra app secret insert … -f` from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line
|
||||
PEM must use `-f` (not arg); secret-presence must check `docker secret ls` (abra's recipe list always
|
||||
shows the name with `created on server:false`).
|
||||
|
||||
**Traefik deploy:** `abra app deploy` → `deploy succeeded 🟢` (traefik v3.6.15 + socket-proxy).
|
||||
Verify: `docker service ls` → app+socket-proxy 1/1; via gateway `curl --resolve probe.*:443:
|
||||
143.244.213.108` → `CN=*.ci.commoninternet.net` (LE E8); **0 ACME log lines**.
|
||||
|
||||
**M1 gate (recipe over HTTPS + teardown):**
|
||||
- `abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -n` then set
|
||||
`LETS_ENCRYPT_ENV=` and `abra app deploy -n -C` → `🟢` (nginx 1.29.0).
|
||||
- `curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/` →
|
||||
`http_code=200 size=615`, served the nginx welcome page over HTTPS with the wildcard cert.
|
||||
- Teardown: `abra app undeploy -n` → 🟢; `abra app volume remove -f -n` → "1 volumes removed";
|
||||
leak check → services 0 / volumes 0 / secrets 0 / containers 0. **Clean.**
|
||||
- Correct teardown syntax confirmed: `secret remove <d> --all -n` (not `--all-secrets`).
|
||||
|
||||
**docs/install.md** seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md.
|
||||
|
||||
**Next:** M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green.
|
||||
|
||||
## 2026-05-26 — M2 start: CI engine decision + Gitea OAuth app + Drone secrets
|
||||
|
||||
**Decision (DECISIONS.md):** keep Drone per plan. nixpkgs 24.11 has drone server 2.24.0 but only the
|
||||
abandoned `drone-runner-exec` (unstable-2020) — accepted (stable RPC), Woodpecker is the documented
|
||||
fallback. Deploy shape mirrors traefik: server via coop-cloud `drone` recipe (abra, swarm,
|
||||
traefik-routed at drone.ci.commoninternet.net, no ACME), exec runner as a host Nix systemd service.
|
||||
|
||||
**Recipe recon:** coop-cloud `drone` recipe = drone/drone:2.26.0, secrets `rpc_secret` +
|
||||
`CLIENT_SECRET` (Gitea OAuth), Gitea SSO via `compose.gitea.yml` (`GITEA_CLIENT_ID`, `GITEA_DOMAIN`).
|
||||
Server env: DRONE_SERVER_HOST/PROTO, DRONE_USER_CREATE.
|
||||
|
||||
**Done this tick:**
|
||||
- Created Gitea OAuth app `cc-ci-drone` (bot): client_id `ab4cdb9d-…`, redirect
|
||||
`https://drone.ci.commoninternet.net/login`.
|
||||
- Generated `DRONE_RPC_SECRET` (openssl-equivalent /dev/urandom hex32) + stored client_secret;
|
||||
both added to `secrets/secrets.yaml` via `sops set` (needed `SOPS_AGE_KEY` from the host ssh key:
|
||||
`ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key`). Verified: decrypt shows keys
|
||||
test_secret/drone_rpc_secret/drone_gitea_client_secret; file stays encrypted (4× ENC).
|
||||
|
||||
**Next:** scripts/deploy-drone.sh (abra deploy of drone server w/ Gitea SSO + rpc/client secrets),
|
||||
modules/drone-runner.nix (exec runner systemd unit, rpc secret from sops), wire sops secrets for the
|
||||
runner, then push a hello-world .drone.yml and confirm a green build (M2 gate).
|
||||
|
||||
## 2026-05-26 — M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots
|
||||
|
||||
**Orchestrator steer (2×):** collapse install to a single `nixos-rebuild switch` — convert the
|
||||
manual deploy scripts into **idempotent-reconcile systemd oneshots** (writeShellApplication, embedded
|
||||
in store; after swarm-init+docker; wants network-online; wantedBy multi-user; reconcile every
|
||||
activation/boot, NO run-once sentinel; fail visibly on missing cert). Applied to proxy + drone.
|
||||
|
||||
**Refactor done:**
|
||||
- `modules/packages.nix`: `pkgs.abra` overlay (shared pinned build).
|
||||
- `modules/proxy.nix`: `deploy-proxy` oneshot — reconciles coop-cloud traefik (wildcard/no-ACME).
|
||||
- `modules/drone.nix`: `deploy-drone` oneshot — reconciles coop-cloud drone (Gitea SSO, secrets from
|
||||
/run/secrets), after deploy-proxy.
|
||||
- `modules/drone-runner.nix`: exec runner (fixed PATH conflict via `lib.mkForce`; allowUnfree for
|
||||
drone-runner-exec — Polyform license).
|
||||
- `modules/secrets.nix`: declared drone_rpc_secret + drone_gitea_client_secret + a sops *template*
|
||||
`drone-runner.env` (DRONE_RPC_SECRET) as the runner's EnvironmentFile (shared secret).
|
||||
- Removed `scripts/deploy-*.sh`. install.md now = clone + nixos-rebuild switch + preconditions.
|
||||
|
||||
**Build/switch:** build EXIT 0 (shellcheck clean via writeShellApplication; runner pkg unfree-allowed).
|
||||
`nixos-rebuild switch` → all three units `active`/`success`:
|
||||
- `deploy-proxy` success (reconciled traefik), `deploy-drone` → `deploy succeeded 🟢` (drone/drone
|
||||
2.26.0, secrets client_secret+rpc_secret v1, drone_env config), `drone-runner-exec` active.
|
||||
|
||||
**Verify (commands + output):**
|
||||
- `docker service ls` → `drone_ci_commoninternet_net_app 1/1`, traefik app+socket-proxy 1/1.
|
||||
- Via gateway: `…/healthz` → **200**; `/` → **303** (login redirect, correct).
|
||||
- Runner: journal shows a few startup `cannot ping the remote server (404)` (drone RPC not ready
|
||||
yet) then `successfully pinged the remote server` + `polling the remote server capacity=2
|
||||
endpoint=https://drone.ci.commoninternet.net kind=pipeline type=exec`. **Runner connected via RPC.**
|
||||
|
||||
**Remaining for M2 gate:** push a hello-world `.drone.yml` to cc-ci + get a green build. Needs the
|
||||
cc-ci repo activated in Drone, which requires the bot's Gitea OAuth login (browser flow) to grant
|
||||
Drone a Gitea token (to sync repos + set the push webhook). Next tick: script the OAuth login to mint
|
||||
a Drone token, activate cc-ci, push .drone.yml, confirm green. (DRONE_USER_CREATE made autonomic-bot
|
||||
the admin.)
|
||||
|
||||
## 2026-05-26 — M2 GATE MET: green build via push (Drone + exec runner)
|
||||
|
||||
**Drone↔Gitea OAuth (scripted, the one manual bootstrap):** logged the bot into Gitea (CSRF cookie
|
||||
→ form), drove Drone `/login` → Gitea authorize consent (POST `/login/oauth/grant` with _csrf+state+
|
||||
granted=true) → code callback → Drone `_session_`. Captured the whole flow in
|
||||
`scripts/bootstrap-drone-oauth.sh` (reads bot creds from env; documented in install.md §2; one-time,
|
||||
token persists in Drone's data volume).
|
||||
|
||||
**Repo activation:** `GET /api/user` → autonomic-bot admin=true; `GET /api/user/repos?latest=true`
|
||||
synced 12 repos; `POST /api/repos/recipe-maintainers/cc-ci` → active=true, config_path .drone.yml
|
||||
(sets the Gitea push webhook).
|
||||
|
||||
**Green build:** added `.drone.yml` (exec pipeline), pushed (0d89e28). Polled
|
||||
`/api/repos/recipe-maintainers/cc-ci/builds` → build #1 pending→running→**success**. Steps:
|
||||
clone success exit 0; hello success exit 0 — log shows `whoami=root`, `abra 0.13.0-beta-06a57de`,
|
||||
`swarm=active` (ran on the host via the exec runner). **M2 gate met; CLAIMED.**
|
||||
|
||||
**Next:** M3 — comment-bridge service: Gitea issue_comment webhook → verify HMAC + `!testme` exact +
|
||||
collaborator → resolve PR head repo/SHA → trigger a parameterized Drone build; post a PR comment with
|
||||
the run link. Need a Drone API token for the bridge (mint from the bot's Drone account).
|
||||
|
||||
## 2026-05-26 — M3 start: bridge secrets + comment-bridge source
|
||||
|
||||
**Secrets (sops):** minted a Gitea API token (`cc-ci-bridge`, scopes read:org/user, write:repo/issue),
|
||||
a Drone API token (`POST /api/user/token`, the stable personal token; rotates on call), and a webhook
|
||||
HMAC (urandom hex64). Stored as bridge_gitea_token / bridge_drone_token / bridge_webhook_hmac via
|
||||
`sops set` (host age identity). secrets.yaml now holds 6 secrets.
|
||||
|
||||
**bridge/bridge.py** (Python stdlib only, §4.1): POST /hook handler — verifies Gitea HMAC
|
||||
(`X-Gitea-Signature` sha256), requires `X-Gitea-Event: issue_comment`, action=created, body trimmed
|
||||
== `!testme`, issue is a PR; checks commenter is a collaborator (Gitea collaborators endpoint, 204);
|
||||
resolves PR head sha+repo; triggers a parameterized Drone build
|
||||
(`POST /api/repos/<CI_REPO>/builds?branch=main&RECIPE&REF&PR&SRC`, custom params → pipeline env);
|
||||
posts a PR comment linking the run. Secrets read from mounted files; config via env. `/healthz` GET.
|
||||
|
||||
**Next:** package the bridge as a swarm service (dockerTools image, no Docker Hub pull) behind
|
||||
traefik at `ci.commoninternet.net/hook` via a reconcile oneshot (modules/bridge.nix); register a
|
||||
per-repo webhook with the HMAC; demo on a scratch PR (!testme triggers; non-!testme + non-collab
|
||||
rejected). That's the M3 gate.
|
||||
|
||||
## 2026-05-26 — M3: bridge deployed + verified; webhook DELIVERY blocked (Gitea-side)
|
||||
|
||||
**Deployed** the comment-bridge as a Nix-built OCI image (no Docker Hub pull) → swarm service on
|
||||
`proxy`, behind traefik at `ci.commoninternet.net/hook`, via reconcile oneshot `modules/bridge.nix`.
|
||||
Swarm secrets (webhook_hmac/drone_token/gitea_token) materialised from /run/secrets.
|
||||
|
||||
**Verified working (bridge side):**
|
||||
- `docker service ls` → ccci-bridge_app 1/1.
|
||||
- `GET /hook/healthz` → 200 **from the sandbox over real public DNS** (ci.commoninternet.net →
|
||||
143.244.213.108); also 200 via gateway from cc-ci.
|
||||
- HMAC logic: bad sig → 401; a manually openssl-HMAC-signed body → 204 (passes sig, ignored as
|
||||
non-trigger); wrong event → 204. (Debug log added: `got=/want=/bodylen/seclen`.)
|
||||
- Registered per-repo `issue_comment` webhook (id 210) on recipe-maintainers/cc-ci → ci.../hook with
|
||||
the HMAC. Created scratch PR #1.
|
||||
|
||||
**Blocker found:** commenting `!testme` (×several) and Gitea's "Test Delivery" (UI returns 200) yield
|
||||
ZERO requests at the bridge container. Bridge is publicly reachable by hostname from a 3rd network;
|
||||
gateway accepts public sources; public DNS correct → Gitea is not *sending* the delivery. Deliveries
|
||||
panel is AJAX (uninspectable via curl); bot is not Gitea admin (can't read `ALLOWED_HOST_LIST`).
|
||||
Conclusion: git.autonomic.zone webhook policy (likely `ALLOWED_HOST_LIST`) blocks ci.commoninternet.net.
|
||||
Recorded in STATUS ## Blocked with operator options (whitelist host, or I pivot bridge to polling).
|
||||
|
||||
**Plan:** surface to operator; meanwhile proceed to M4 (harness + install stage) which doesn't depend
|
||||
on the webhook (dev recipe-CI builds triggerable directly via the Drone API). Revisit M3 gate once the
|
||||
host is whitelisted or via the polling fallback.
|
||||
|
||||
## 2026-05-27 — M4: harness + install stage green (custom-html), guaranteed teardown
|
||||
|
||||
**Built the harness:** `runner/harness/abra.py` (abra wrappers w/ gotchas: no --chaos on
|
||||
undeploy/volume-remove, `-n` everywhere, parse `app ls -S -m` nested {server:{apps}}, timeouts),
|
||||
`runner/harness/lifecycle.py` (deploy_app forcing `LETS_ENCRYPT_ENV=""` [A1], wait_healthy =
|
||||
services-converged + HTTPS, teardown_app = undeploy+volume+secret+env-config, janitor for orphans),
|
||||
`tests/conftest.py` (`deployed_app` session fixture with finalizer teardown; short unique domain),
|
||||
`tests/custom-html/test_install.py` (HTTP 200 + Playwright/Chromium content assertion),
|
||||
`runner/run_recipe_ci.py` (orchestrator: fetch recipe@REF, run stage pytest), `modules/harness.nix`
|
||||
(`cc-ci-run` = Nix python3+pytest+playwright with PLAYWRIGHT_BROWSERS_PATH from nixpkgs).
|
||||
|
||||
**Bugs fixed en route (3):**
|
||||
1. Swarm config name > 64 chars (long domain) → switched to short `<recipe[:4]>-<6hex>` domain
|
||||
scheme (DECISIONS.md).
|
||||
2. `services_converged` used wrong stack name (replaced hyphens) → abra keeps hyphens, only dots→_.
|
||||
3. `http_get` connected to the gateway IP (drops SNI, gateway routes by SNI) → use the real URL
|
||||
(resolves to gateway on cc-ci, correct SNI). Also teardown now removes the app .env config.
|
||||
|
||||
**Green run + teardown (commands + output):**
|
||||
- `RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py` →
|
||||
`tests/custom-html/test_install.py::test_http_reachable PASSED`,
|
||||
`::test_playwright_page PASSED` — **2 passed in 57.99s**.
|
||||
- Leak check after: services 0 / volumes 0 / secrets 0 / containers 0 / env config removed. Clean.
|
||||
|
||||
**A1 addressed:** deploy_app forces `LETS_ENCRYPT_ENV=""` (no ACME) on every deploy. M4 CLAIMED.
|
||||
|
||||
**M3 still blocked** (Gitea webhook delivery — operator); no response yet. Next: M5 (upgrade +
|
||||
backup/restore for custom-html), then wire the parameterized Drone pipeline (API-triggerable).
|
||||
|
||||
## 2026-05-27 — M5: upgrade + backup/restore stages green (custom-html)
|
||||
|
||||
**Upgrade stage** (tests/custom-html/test_upgrade.py): deploy previous published version
|
||||
(git-tag sort, second-newest), write a data marker into the served volume (nginx serves
|
||||
/usr/share/nginx/html, so the marker is HTTP-fetchable), `abra app upgrade` to current, assert
|
||||
healthy + marker survived. Fix: `upgrade` has no `--chaos` flag (used `-f -D -n`).
|
||||
|
||||
**backup-bot-two** deployed as reconcile oneshot (modules/backupbot.nix): restic repo in a local
|
||||
`backups` volume, restic_password abra-generated (only if missing). Fixes: `abra app secret generate`
|
||||
needs `-m` (machine) to avoid the TTY/ioctl path, and stdout redirected so generated values never
|
||||
hit the journal (D6). `abra app backup create`/`restore` need a real PTY ('input device is not a
|
||||
TTY') → run via util-linux `script -qec` (harness `_run_pty`; util-linux added to cc-ci-run).
|
||||
|
||||
**Backup stage** (test_backup.py): write "original" → `abra app backup create` → mutate to
|
||||
"mutated" → `abra app restore` → assert state back to "original".
|
||||
|
||||
**Full 3-stage run** (`STAGES=install,upgrade,backup`):
|
||||
- install: 2 passed (http 200 + playwright)
|
||||
- upgrade: 1 passed (data survives upgrade)
|
||||
- backup: 1 passed (restore returns pre-mutation state)
|
||||
- teardown: 0 orphaned run services/volumes/secrets; infra (traefik/drone/bridge/backupbot) all 1/1.
|
||||
M5 CLAIMED.
|
||||
|
||||
**M3 still blocked** (webhook; no operator response across several ticks). Plan: if still blocked,
|
||||
pivot the bridge to poll the Gitea API (self-service, Adversary-endorsed) to unblock D1. Next: M6.
|
||||
|
||||
## 2026-05-27 — Fix adversary findings A2 (dead janitor) + A3 (unverified teardown)
|
||||
|
||||
**A2 (janitor matched dead `-pr` filter):** rewrote `harness.lifecycle.janitor` to match the real
|
||||
run-app naming (`RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`), reap via
|
||||
docker primitives, AND scan `docker service ls` to catch orphans whose `.env` is already gone
|
||||
(reconstructs the domain from the service name). Age-gated (default 2h, env `CCCI_JANITOR_MAX_AGE`)
|
||||
so concurrent in-flight runs are never killed.
|
||||
|
||||
**A3 (teardown unverified + unconditional .env removal):** `teardown_app` now (1) `docker stack rm`
|
||||
fallback if `abra undeploy` leaves services, (2) removes volumes/secrets *before* the `.env` and
|
||||
only drops the `.env` after the stack is confirmed gone, (3) retries docker volume rm (a stopped
|
||||
task briefly holds the volume), (4) **verifies** no residual services/volumes/secrets and raises
|
||||
`TeardownError` otherwise — so a partial teardown FAILS the run instead of silently orphaning.
|
||||
|
||||
**Re-test (commands + output):**
|
||||
- Normal install run → 2 passed, verified teardown clean.
|
||||
- Orphan (deploy, no teardown) → `janitor(CCCI_JANITOR_MAX_AGE=0)` → services/volumes/secrets/env 0.
|
||||
- **Env-less orphan** (deploy then `rm` the .env, the A3 bad state) → janitor reaps via docker stack
|
||||
rm → services/volumes/secrets 0.
|
||||
- Full 3-stage run (install/upgrade/backup) still green with verified teardown, no TeardownError.
|
||||
|
||||
A2/A3 fixed; left for the Adversary to re-test + close.
|
||||
|
||||
## 2026-05-27 — M6 (part 1): harness enhancements for recipe #2 + D4 discovery
|
||||
|
||||
Before enrolling recipe #2, made the shared harness recipe-agnostic so enrolling a recipe needs no
|
||||
harness-code change (D5):
|
||||
- **Per-recipe meta** (`tests/<recipe>/recipe_meta.py`, optional): HEALTH_PATH, HEALTH_OK,
|
||||
DEPLOY_TIMEOUT, HTTP_TIMEOUT. conftest reads it; `wait_healthy` gained a `path` param (e.g.
|
||||
keycloak `/realms/master`). Defaults preserve custom-html behaviour (verified: install still green).
|
||||
- **Shared naming** (`harness/naming.py`): single source for the `<recipe[:4]>-<6hex>` domain, used
|
||||
by conftest + the orchestrator.
|
||||
- **D4 recipe-local discovery** (`run_recipe_ci.run_recipe_local`): if a recipe ships `tests/` with
|
||||
`test_*.py`, deploy the app, run those tests against the LIVE deployment (contract: env
|
||||
`CCCI_BASE_URL` + `CCCI_APP_DOMAIN`), merge as another reported stage, guaranteed teardown. Real
|
||||
recipes ship tests/ committed in their repo (clean checkout) → discovered on clone/fetch. (custom-
|
||||
html via catalogue is an awkward case — abra refuses an unstaged recipe and `abra recipe fetch`
|
||||
resets local commits — so D4 is demonstrated end-to-end with recipe #2 hedgedoc, which ships
|
||||
committed tests/.)
|
||||
|
||||
**Next:** mirror hedgedoc (postgres+hedgedoc, DB-backed) via the mirror+PR flow with a committed
|
||||
tests/ dir, write tests/hedgedoc/ (install/upgrade/backup + recipe_meta), run all stages + D4 green.
|
||||
|
||||
## 2026-05-27 — M6 (part 2): recipe #2 keycloak install green (DB-backed, no harness surgery)
|
||||
|
||||
Enrolled keycloak (recipe #2): keycloak 26.6.2 **+ mariadb 12.2** — genuinely DB-backed/multi-service
|
||||
(vs custom-html stateless). Added only `tests/keycloak/recipe_meta.py` (HEALTH_PATH=/realms/master,
|
||||
HEALTH_OK=(200,), 600s timeouts) + `tests/keycloak/test_install.py` (realm-endpoint health +
|
||||
Playwright admin-console login). **No change to runner/harness code** — the recipe-agnostic harness
|
||||
(per-recipe meta) handled it (D5 evidence).
|
||||
|
||||
Run: `RECIPE=keycloak STAGES=install cc-ci-run runner/run_recipe_ci.py` → 2 passed in 545s (keycloak
|
||||
is slow: image pull + JVM + mariadb migration). Teardown clean (0 keyc-* services/volumes after).
|
||||
|
||||
**Next:** D4 demo via a mirror shipping committed tests/ (recipe-local run against live app); then
|
||||
keycloak upgrade + backup/restore (DB data survival via a realm marker through the admin API).
|
||||
|
||||
## 2026-05-27 — M6: D4 recipe-local discovery + recipe #2 enrolled (CLAIMED)
|
||||
|
||||
**D4 recipe-local discovery working.** Demo: pushed a committed `tests/test_recipe_local.py` to the
|
||||
mirror on branch `recipe-maintainers/custom-html@ci/d4-recipe-local`; ran
|
||||
`RECIPE=custom-html SRC=recipe-maintainers/custom-html REF=ci/d4-recipe-local STAGES=install` →
|
||||
install 2 passed, then `===== STAGE: recipe-local (D4) =====` ran the recipe-shipped test against
|
||||
the LIVE app (CCCI_BASE_URL) → 1 passed. Clean teardown (0 orphans).
|
||||
|
||||
**Hard-won abra behaviour (DECISIONS.md):** private mirror clone needs the bot token (per-command
|
||||
`http.extraHeader`, not persisted/logged). abra commands (`app ls`, `secret generate`, version
|
||||
resolution) silently `git checkout <tag>` the recipe, dropping a PR branch's files — so (1) all
|
||||
harness abra calls use `-C -o` (chaos+offline = current checkout, no remote fetch), and (2) D4
|
||||
snapshots the recipe's tests/ to a temp dir right after fetch (later abra cmds still reset it).
|
||||
Traced the drop step-by-step: app_new ok, deploy ok, but `secret generate` (no flags) and `app ls`
|
||||
each reset the checkout.
|
||||
|
||||
**Recipe #2 = keycloak** (keycloak + mariadb, DB-backed) install green with only
|
||||
`tests/keycloak/recipe_meta.py` + `test_install.py` — **no runner/harness change** (D5). custom-html
|
||||
remains 3-stage green (M5). docs/enroll-recipe.md written.
|
||||
|
||||
**M6 CLAIMED.** keycloak's full 3-stage (DB data survival via a realm marker) folds into M6.5.
|
||||
**Next:** M6.5 — keycloak upgrade/backup, then recipes 3–6 across the remaining D10 categories.
|
||||
|
||||
---
|
||||
## 2026-05-27 — Trigger redesign (polling primary) + resource safety + M3 verified
|
||||
|
||||
Session restarted by watchdog (prior tmux died mid-turn with uncommitted bridge WIP). Re-oriented
|
||||
from STATUS + plan; two orchestrator design changes landed and are now implemented + verified.
|
||||
|
||||
**(1) Trigger: POLLING PRIMARY, webhook optional, org-membership auth** (plan §4.1/§1.5; commit
|
||||
7addb96). Rewrote `bridge/bridge.py`: a poll thread (`poll_loop`, always-on, primary) scans each
|
||||
`POLL_REPOS` repo's open PRs every 30s for new `!testme`; the `/hook` webhook stays as an optional
|
||||
admin-registered push optimization. Both share an in-memory comment-id seen-set → a comment seen by
|
||||
both fires once. First poll marks pre-existing comments seen (no startup re-fire). Authorization now
|
||||
`GET /orgs/{owner}/members/{user}` (204=member, read-level) + optional `AUTH_ALLOWLIST`, replacing
|
||||
the admin-requiring `/collaborators/{user}/permission`. Bot never self-registers webhooks.
|
||||
- Verified org endpoint at read level (bot basic-auth):
|
||||
`members/{autonomic-bot,trav,notplants}` → 204; `members/definitely-not-a-member-xyz` → 404.
|
||||
- Deployed (nixos-rebuild, deploy-bridge reconcile); new container logs:
|
||||
`poller (primary) watching ['recipe-maintainers/cc-ci'] every 30s` + `(poll primary + optional webhook)`.
|
||||
- **End-to-end M3 trigger (poll path):** posted `!testme` on PR #1 (comment 13705, by bot) →
|
||||
Drone build **#26** appeared after **6s** (latest was #25); bridge logged
|
||||
`[poll] triggered build 26 for cc-ci@d397720a (PR #1, comment 13705) by autonomic-bot`; bridge
|
||||
posted back `cc-ci: started CI run for cc-ci @ d397720a → https://drone.ci.commoninternet.net/...`.
|
||||
Satisfies D1 (<60s) over the read-only outbound path — no operator webhook whitelist needed.
|
||||
|
||||
**(2) Resource safety: bound live test apps** (plan §4.2/§4.3; commit 72ff8e2). MAX_TESTS =
|
||||
`DRONE_RUNNER_CAPACITY` = 1 (`modules/drone-runner.nix`) → Drone runs ≤1 build at once, queues the
|
||||
rest natively. Per-build timeout = 60m, reconciled best-effort in `modules/drone.nix`
|
||||
(`PATCH /api/repos/.../cc-ci {"timeout":60}`, non-fatal). Janitor remains the backstop for
|
||||
SIGKILL'd/timed-out builds (reaps orphaned run apps at run-start before each deploy).
|
||||
- Verified on host after rebuild: `DRONE_RUNNER_CAPACITY=1`; deploy-drone logged
|
||||
`set cc-ci build timeout = 60m`; Drone API confirms repo `timeout: 60`.
|
||||
|
||||
**Gap noted (next item):** `.drone.yml` still only has the `self-test` pipeline — a bridge-triggered
|
||||
build runs the self-test, NOT `runner/run_recipe_ci.py`. M4/M5 ran the orchestrator by hand
|
||||
(`cc-ci-run`). Need a recipe-CI pipeline keyed on the `RECIPE` build param (runs
|
||||
`cc-ci-run runner/run_recipe_ci.py` with STAGES=install,upgrade,backup, `CCCI_JANITOR_MAX_AGE=0`,
|
||||
`concurrency:{limit:1}`) to connect bridge→Drone→harness end-to-end (required for D2/D10 via real
|
||||
`!testme`). Added to Build backlog.
|
||||
|
||||
**M3 CLAIMED** (gate). Trigger + auth + comment-back demoed live; the webhook-delivery blocker is
|
||||
moot now that polling is primary.
|
||||
|
||||
---
|
||||
## 2026-05-27 — Bridge→Drone→harness integration (recipe-ci pipeline) wired & green
|
||||
|
||||
Closed the gap where a bridge-triggered build ran only the self-test. Split `.drone.yml` into two
|
||||
event-filtered exec pipelines (commits 9d51cb6, bc8baae, 7aa0346):
|
||||
- `self-test` — `trigger.event: [push]` (M2 sanity on pushes).
|
||||
- `recipe-ci` — `trigger.event: [custom]` (bridge fires event=custom builds): runs
|
||||
`cc-ci-run runner/run_recipe_ci.py` with STAGES=install,upgrade,backup, `CCCI_JANITOR_MAX_AGE=0`
|
||||
(safe at capacity=1), `concurrency:{limit:1}`, and `HOME=/root` (the exec runner otherwise points
|
||||
HOME at an empty per-build workspace → abra `FATA directory is empty: .../.abra/servers`).
|
||||
|
||||
Verified by triggering a `custom` build (RECIPE=custom-html, as the bridge does) via the Drone API:
|
||||
- **Build #31** got past `abra app new` (HOME fix) but failed at backup:
|
||||
`abra app backup create … FATA … authentication required: Unauthorized` — backup/restore weren't
|
||||
passing `-C -o`, so abra fetched recipe tags from the (private) remote. Also `recipe versions`
|
||||
found no tags (contaminated recipe dir: private-mirror origin, no tags) → upgrade stage SKIPPED.
|
||||
- Fixes: `abra.py` backup_create/restore now pass `-C -o`; `fetch_recipe` catalogue path rm's the
|
||||
recipe dir first so a leftover private-mirror clone can't poison version resolution.
|
||||
- **Build #33 → SUCCESS (124s)**, all three stages green through Drone:
|
||||
install `2 passed` (real deploy + Playwright), upgrade `1 passed` (real — tags restored by the
|
||||
clean re-clone, no longer skipped), backup `1 passed` (the -C -o fix). Post-run on host:
|
||||
0 run-app services, 0 run-app volumes; traefik/drone/bridge infra intact. Event filtering works
|
||||
(only recipe-ci ran, not self-test).
|
||||
|
||||
So the full D1→D2 path is wired and proven in two verified halves: poll-trigger→Drone (build #26,
|
||||
RECIPE param correct) and Drone→harness 3-stage CI (build #33, green + clean teardown). Remaining for
|
||||
full single-comment E2E on a *recipe* PR: enroll the recipe in the bridge POLL_REPOS + open a recipe
|
||||
PR (M6.5/M10 breadth work).
|
||||
|
||||
**Adversary findings status (signal for re-test):** A2 (janitor `-pr` filter) and A3 (teardown
|
||||
verification + `.env`-last ordering) are both already fixed in the current code
|
||||
(`lifecycle.RUN_APP_RE` hashed-scheme match; `teardown_app` `_residual()` raise + `docker stack rm`
|
||||
fallback) — awaiting the Adversary's kill-probe re-test on an idle host. A4 (concurrent same-recipe
|
||||
collision): its named root cause "no Drone concurrency cap (capacity=2)" is eliminated by
|
||||
MAX_TESTS=capacity=1 — no concurrent runs possible on this single node, so the shared-recipe-dir race
|
||||
can't occur. No Builder fix outstanding on findings; next milestone work is M6.5 breadth.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M6.5: keycloak full 3-stage GREEN through the Drone recipe-ci pipeline
|
||||
|
||||
Ran keycloak (DB-backed, SSO/identity category) end-to-end via the integrated recipe-ci pipeline
|
||||
(triggered `custom` build #39, RECIPE=keycloak). **Build #39 → success (~31m)**, all three stages
|
||||
green as separate reported stages:
|
||||
- install `2 passed` (8m30s): `test_realm_endpoint_healthy` (/realms/master 200) + Playwright admin
|
||||
console login.
|
||||
- upgrade `1 passed` (10m10s): `test_upgrade_preserves_realm` — realm marker written pre-upgrade
|
||||
survives the previous→latest upgrade (DB data survival).
|
||||
- backup `1 passed` (8m15s): `test_backup_mutate_restore` — backup→mutate→restore returns original.
|
||||
Clean teardown verified on host: 0 keyc services, 0 keyc volumes. keycloak cold start is slow on
|
||||
this VM (Quarkus augmentation ~80s + Liquibase schema init), so each deploy is ~5-8m — well within
|
||||
the 60m build timeout; that's why the run took ~31m. No harness surgery (D5): keycloak runs off
|
||||
`tests/keycloak/{recipe_meta,test_install,test_upgrade,test_backup}.py` + `kc_admin.py` only.
|
||||
|
||||
This both advances M6.5 (first DB-backed recipe full 3-stage) and confirms the recipe-ci integration
|
||||
works on a heavy DB-backed recipe (Drone→harness→3 stages→teardown). Next M6.5: enroll recipes 3–6
|
||||
covering the remaining D10 categories (stateful-no-DB, multi-service+S3, large-volume, etc.).
|
||||
|
||||
---
|
||||
## 2026-05-27 — M6.5: cryptpad (recipe #3) enrolled + full 3-stage green; fixed a real backup bug
|
||||
|
||||
Enrolled **cryptpad** (stateful, no external DB — the D10 "stateful/no-DB" category). No shared-harness
|
||||
surgery beyond a *generic* feature: added per-recipe **EXTRA_ENV** (recipe_meta.py dict or
|
||||
domain-callable) applied in `deploy_app` at every deploy path. cryptpad uses it for its required
|
||||
distinct `SANDBOX_DOMAIN` (a sibling subdomain under the wildcard, so no cert work). Data-survival
|
||||
tests write a marker into the backed-up `cryptpad_data` volume and read it via `exec_in_app`
|
||||
(cryptpad's datastore isn't HTTP-served like custom-html).
|
||||
|
||||
Host runs (HOME=/root, cc-ci-run): install **2 passed** (~2m; http 200 + Playwright loads cryptpad),
|
||||
upgrade **1 passed** (~1m; marker survives previous→current), backup **1 passed** after a fix
|
||||
(below). Clean teardown (0 cryp services/volumes).
|
||||
|
||||
**Real bug found+fixed — backups were silently mis-wired (set_env newline).** cryptpad backup first
|
||||
failed: `abra app backup create` → backup-bot-two's `/usr/bin/backup` raised
|
||||
`KeyError: 'RESTIC_REPOSITORY'`. Root cause: backup-bot-two's `.env.sample` ends with a *newline-less*
|
||||
comment line, and the reconcile's `set_env` did a bare `printf >> .env`, gluing
|
||||
`RESTIC_REPOSITORY=/backups/restic` onto that comment → commented out. abra `--debug` confirmed the
|
||||
backupbot env map lacked `RESTIC_REPOSITORY`, and `docker exec backupbot printenv RESTIC_REPOSITORY`
|
||||
was empty. Fix: `set_env` now ensures a trailing newline before appending (modules/backupbot.nix +
|
||||
modules/drone.nix, same latent bug). After rebuild: `.env` has a clean `RESTIC_REPOSITORY=` line, the
|
||||
backupbot container has `RESTIC_REPOSITORY=/backups/restic`, and cryptpad backup→mutate→restore
|
||||
passes. NOTE: keycloak backup (build #39) passed off an *earlier, non-corrupted* backupbot deploy;
|
||||
worth a re-verify, but the mechanism is now correct/reproducible. Triggered Drone build #46 (cryptpad)
|
||||
as the canonical recipe-ci run.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M6.5: matrix-synapse (recipe #4, DB+media/large-volume) full 3-stage green
|
||||
|
||||
Enrolled matrix-synapse (synapse `app` + postgres `db` + nginx `web`) — the large-volume/DB+media
|
||||
D10 category. No harness surgery (server_name = DOMAIN; no EXTRA_ENV needed). Host runs (cc-ci-run):
|
||||
install **2 passed** (~2.7m; client API 200 + real `/_matrix/client/versions` JSON), upgrade
|
||||
**1 passed** (~2.3m; postgres marker survives previous→current), backup **1 passed** (~1.5m). Clean
|
||||
teardown (0 matr services). The data-survival tests use a `ci_marker` postgres row exec'd via
|
||||
`psql` in the `db` service — this exercises the recipe's real DB-dump backup hook
|
||||
(`backupbot.backup.pre-hook=/pg_backup.sh backup` / `restore.post-hook`), the meaningful matrix data
|
||||
path (not a plain volume copy). Worked first try (the set_env/RESTIC fix holds for hook-based
|
||||
backups too). Triggering the canonical Drone recipe-ci run.
|
||||
|
||||
4 of 6 D10 recipes now green: custom-html (simple), keycloak (SSO/DB), cryptpad (stateful/no-DB),
|
||||
matrix-synapse (DB+media/large-volume). Remaining categories: multi-service+S3 (lasuite-docs) and
|
||||
TLS-passthrough (bluesky-pds).
|
||||
|
||||
---
|
||||
## 2026-05-27 — M6.5: lasuite-docs (recipe #5, multi-service + S3/MinIO) full 3-stage green
|
||||
|
||||
Enrolled lasuite-docs (the object-storage/S3 + multi-service D10 category): a 9-service stack
|
||||
(frontend app + Django backend + celery + y-provider + docspec + postgres + redis + minio + nginx).
|
||||
Host runs (cc-ci-run): install **2 passed** (~2.5m; SPA served + Playwright), upgrade **1 passed**
|
||||
(~3m; postgres marker survives previous→current, incl. cold-pulling the older images), backup
|
||||
**1 passed** (~2.3m; pg_backup.sh dump/restore). Clean teardown.
|
||||
|
||||
Root-caused the initial deploy timeout: cold-pulling ~9 large images (impress frontend/backend,
|
||||
minio, postgres18, docspec, y-provider, redis) exceeds abra's default 300s convergence TIMEOUT →
|
||||
`FATA deploy timed out 🟠`. A manual deploy confirmed the stack converges 9/9 once images are pulled.
|
||||
Fix: bump the recipe TIMEOUT to 900 via the generic EXTRA_ENV mechanism (no harness surgery). OIDC is
|
||||
config-only (Django `manage.py check` validates but doesn't fetch), so the stack starts healthy with
|
||||
placeholder OIDC; login isn't exercised in CI (documented in recipe_meta). Data-survival uses a
|
||||
postgres marker (docs/docs) via the pg_backup hook.
|
||||
|
||||
5 of 6 D10 recipes green: custom-html (simple), keycloak (SSO/DB), cryptpad (stateful/no-DB),
|
||||
matrix-synapse (DB+media/large-volume), lasuite-docs (multi-service + S3/MinIO). Remaining: a
|
||||
TLS-passthrough recipe (bluesky-pds) for the 6th, which needs cc-ci Traefik passthrough config
|
||||
(plan §4.0 caveat) — the hardest infra-wise.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M6.5 COMPLETE: n8n (recipe #6) full 3-stage green — all 6 D10 recipes done
|
||||
|
||||
Enrolled n8n (workflow automation; single `app` service, stateful via the /home/node/.n8n volume,
|
||||
normal terminate-at-Traefik). Host runs: install **2 passed** (~3.8m; /healthz 200 + Playwright
|
||||
editor), upgrade **1 passed** (~1.3m; marker in /home/node/.n8n survives), backup **1 passed**
|
||||
(~0.8m; backupbot.backup.path file backup). Clean teardown. (Caught a sync gap first: committed the
|
||||
tests but forgot to tar tests/n8n to the host → run skipped "no stage test files"; synced + re-ran.)
|
||||
|
||||
n8n is recipe #6 in place of bluesky-pds (TLS-passthrough), swapped per DECISIONS (caddy self-ACME
|
||||
conflicts with cc-ci's no-ACME/static-wildcard design).
|
||||
|
||||
**All 6 D10 recipes now have a full 3-stage green run (host):**
|
||||
1. custom-html — simple/stateless
|
||||
2. keycloak — SSO/identity + DB (Drone #39)
|
||||
3. cryptpad — stateful/no-DB (Drone #46)
|
||||
4. matrix-synapse — DB+media/large-volume (Drone #51)
|
||||
5. lasuite-docs — multi-service + S3/MinIO/object-storage (Drone #57)
|
||||
6. n8n — workflow automation (Drone canonical run triggering now)
|
||||
All 5 required D10 categories covered. Triggering n8n canonical Drone run, then claiming the M6.5 gate.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M8/D7: results dashboard live (overview + badges)
|
||||
|
||||
Built the results dashboard (dashboard/dashboard.py + modules/dashboard.nix): a stdlib HTTP service
|
||||
(Nix-built OCI image, swarm service on proxy, reconcile oneshot like bridge/drone) that polls the
|
||||
Drone API for recipe-CI builds (event=custom), groups latest-run-per-recipe, and renders a
|
||||
YunoHost-CI-like overview at **ci.commoninternet.net/** with pass/fail/running badges, last ref,
|
||||
when, and a link to the canonical Drone run. Plus /badge/<recipe>.svg embeddable badges.
|
||||
|
||||
Verified live via the public gateway: overview lists exactly the 6 enrolled recipes (cryptpad,
|
||||
custom-html, keycloak, lasuite-docs, matrix-synapse, n8n) each **success**; `/badge/keycloak.svg` →
|
||||
200 image/svg+xml; `/healthz` → 200; **`/hook` still routes to the bridge** (200) — the bridge's
|
||||
Host && PathPrefix(`/hook`) rule keeps priority over the dashboard's Host-only rule.
|
||||
|
||||
Two fixes en route: (1) filter out the cc-ci repo's own name as a recipe row (Adversary !testme on
|
||||
the cc-ci PR showed a spurious cc-ci=failure); (2) **content-hash image tag** — a fixed `:latest`
|
||||
tag + unchanged stack spec does NOT roll the swarm service on a code change, so the tag is now
|
||||
derived from a hash of dashboard.py → `docker stack deploy` rolls reliably (reproducible/self-heal).
|
||||
NOTE: the bridge image has the same latent `:latest` issue (only rolled this session because its
|
||||
.nix env also changed) — worth the same content-tag treatment (backlog).
|
||||
|
||||
Remaining M8 piece: PR-comment **outcome reflection** — the bridge posts the start/run-link comment
|
||||
but doesn't yet update it with the final pass/fail (needs a Drone build-completion hook or the
|
||||
bridge polling build status). Overview + badges (the core of D7) are done.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M8/D7 complete: PR-comment outcome reflection + gate claim
|
||||
|
||||
Added outcome reflection to the bridge: after triggering, a daemon watcher polls the Drone build to
|
||||
completion and edits the run-link PR comment to ✅ passed / ❌ <status> (Gitea PATCH
|
||||
issues/comments/{id}). Gave the bridge image a content-hash tag so the swarm service actually rolls
|
||||
on bridge.py changes (same latent :latest no-roll issue the dashboard had).
|
||||
|
||||
Verified end-to-end: posted a fresh `!testme` on PR #1 → poller fired → "started" comment posted →
|
||||
build #76 (RECIPE=cc-ci, fails fast: no tests/cc-ci) → within ~20s the **same comment was edited to
|
||||
`cc-ci: run for cc-ci @ d397720a ❌ failure → …/76`**. The pass/fail now mirrors onto the PR comment.
|
||||
|
||||
D7 fully met: per-run logs (Drone UI) + overview page with badges (dashboard, live) + PR comment
|
||||
links back AND reflects the outcome. Claiming the M8 gate.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M10/D10: real !testme path proven on custom-html; enrolling the breadth set
|
||||
|
||||
Wired the real-PR path end-to-end and proved it on custom-html. `!testme` on
|
||||
recipe-maintainers/custom-html#2 → bridge poller fired → recipe-ci build (SRC=mirror, REF=PR head
|
||||
db9a9502) → **build #84 success, all 3 stages green** (install 2✓, upgrade 1✓ — now runs for real,
|
||||
backup 1✓) → bridge comment edited to ✅ passed. Clean teardown.
|
||||
|
||||
Three fixes to make the real-PR path exercise the upgrade stage (mirror PR clones carry no tags):
|
||||
1. fetch_recipe (SRC+REF) read-only fetches the published version tags from the PUBLIC upstream
|
||||
(`git fetch <upstream> refs/tags/*:refs/tags/*` — bare `--tags` errored "no remote HEAD"); plain
|
||||
git, never pushes to the mirror (guardrail-safe).
|
||||
2. abra.upgrade now passes `-o` (offline) — it was 401'ing trying to fetch tags from the private
|
||||
mirror origin; offline uses the local (upstream-populated) tags.
|
||||
3. (earlier) backup/restore already pass `-C -o`.
|
||||
Now firing !testme on the other recipes' open PRs (keycloak#1, matrix-synapse#1, lasuite-docs#1,
|
||||
n8n#1) — they queue at MAX_TESTS=1. cryptpad has no open PR → opening one next.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M10/D10: real !testme breadth runs — 5/6 green, lasuite-docs upgrade retry
|
||||
|
||||
Fired !testme on all 6 recipe PRs (capacity=1, sequential). Results (real PR-triggered, full 3-stage):
|
||||
- custom-html #84 ✅ (PR head db9a9502)
|
||||
- keycloak #86 ✅ (DB realm marker survives upgrade)
|
||||
- matrix-synapse #87 ✅ (postgres marker, pg_backup hook)
|
||||
- n8n #89 ✅
|
||||
- cryptpad #90 ✅ (test PR #2 opened via Gitea API: branch ci/testme + .ci-testme marker)
|
||||
- **lasuite-docs #88 ❌** — install ✅ + backup ✅, but UPGRADE failed: `abra app upgrade … -o`
|
||||
→ `FATA deploy failed` (a convergence failure during the 9-service rolling upgrade prev→latest,
|
||||
not a timeout). It PASSED on the host/catalogue run, and ran right after the heavy matrix build,
|
||||
so likely transient resource contention. Re-fired !testme on lasuite-docs#1 to test
|
||||
transient-vs-persistent.
|
||||
|
||||
So the real-!testme path + the upgrade fixes (upstream tags + `upgrade -o`) work across simple, DB,
|
||||
DB+media, workflow, and stateful recipes. lasuite-docs (the object-storage/S3 category, required)
|
||||
needs its upgrade to pass on the real path for the 6/6 D10 proof.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M10: 5/6 real-!testme green; lasuite-docs blocked on Docker Hub rate limit (A1)
|
||||
|
||||
lasuite-docs #88/#92 upgrade failed "deploy failed" → diagnosed: node disk at 90% (2.7G free) — a
|
||||
9-service rolling upgrade couldn't converge. Pruned 30 unused images (reclaimed 12GB → 15G free).
|
||||
Retry #93: got further (5/8 services up) but redis task Rejected "No such image: redis:8.2.6" →
|
||||
`docker pull redis:8.2.6` on the node = `toomanyrequests: unauthenticated pull rate limit`. So the
|
||||
prune fixed disk but forced re-pulls that hit Docker Hub's anonymous limit (A1 registry-creds
|
||||
finding, §1.5/§4.4). Recorded in STATUS ## Blocked + DECISIONS; surfaced to operator (provide Docker
|
||||
Hub creds). 5/6 recipes green via real !testme; lasuite install+backup green, upgrade gated.
|
||||
Pivoting to M9 (docs/reproducibility, unblocked) while the limit resets / creds arrive.
|
||||
|
||||
---
|
||||
## 2026-05-27 — lasuite quota-window retry insufficient; halting retries pending creds (3rd attempt)
|
||||
|
||||
Re-fired lasuite-docs !testme during the apparently-eased window (#96). The cached image redis:8.2.6
|
||||
gave "up to date", but the LATEST version's uncached redis:8.6.3 → `toomanyrequests` again. So the
|
||||
anonymous quota isn't reset enough for a full 9-service × 2-version deploy. Cancelled #96 + tore down
|
||||
clean. This is the 3rd confirmation the blocker is the Docker Hub rate limit. Per anti-thrash:
|
||||
**halting lasuite retries until the operator provides Docker Hub creds** (A1, STATUS ## Blocked).
|
||||
5/6 D10 recipes remain green via real !testme. Pivoting to M9 (docs/reproducibility) — fully
|
||||
unblocked, no image pulls.
|
||||
|
||||
---
|
||||
## 2026-05-27 — M10/D10 BUILDER-COMPLETE: all 6 recipes green via real !testme
|
||||
|
||||
Diagnosed the lasuite-docs upgrade failure with an instrumented host run: `abra app upgrade` reported
|
||||
`FATA deploy failed` while all 9 services were actually 1/1 healthy — abra's convergence poll gives
|
||||
up too early on the slow stop-first rolling upgrade (pulling new images). Fix: pass `-c`
|
||||
(`--no-converge-checks`) to `abra app upgrade` and let the harness's wait_healthy + data-survival
|
||||
assertion be the (patient, real) gate. (Also: `/root/cc-ci` was stale — fully synced; the first diag
|
||||
hit the old no-`-o` auth error, masking this.)
|
||||
|
||||
**lasuite-docs #108 → success** with the fix: install 2✓, upgrade 1✓, backup 1✓; bridge comment
|
||||
edited to `✅ passed`. So **all 6 D10 recipes are green via REAL `!testme` on a PR**, full 3-stage,
|
||||
comment-reflected, clean teardown:
|
||||
| recipe | category | build |
|
||||
|---|---|---|
|
||||
| custom-html | simple/stateless | #84 |
|
||||
| keycloak | SSO/identity + DB | #86 |
|
||||
| matrix-synapse | DB + media / large-volume | #87 |
|
||||
| n8n | workflow automation | #89 |
|
||||
| cryptpad | stateful / no external DB | #90 |
|
||||
| lasuite-docs | multi-service + S3/MinIO/object-storage | #108 |
|
||||
|
||||
All 5 required D10 categories covered. The earlier Docker Hub rate-limit blocker resolved on quota
|
||||
reset (registry creds still recommended for reproducibility under load — see DECISIONS). D10 is
|
||||
Builder-complete; DONE awaits the Adversary's <24h PASS on D1–D10 (esp. independent D10 verification).
|
||||
|
||||
---
|
||||
## 2026-05-27 — D10 PASS (6/6); DONE now blocked only on D8 live VM rebuild (Adversary)
|
||||
|
||||
Adversary logged **D10 PASS (6/6) @11:57Z** (all 6 recipes green via real !testme, all categories) +
|
||||
**M8/D7 PASS, D8-core PASS, D9 PASS** this window. No VETO, no open findings. Audited REVIEW: D1
|
||||
(=M3), D2 (=M5/M6.5/D10), D3 (=M4), D4/D5 (=M6/M6.5), D6 (=M7), D7 (=M8), D9, D10 all have <24h PASS.
|
||||
**The sole remaining item is D8's full live blank-VM rebuild** — the Adversary marked it "core PASS;
|
||||
live rebuild pending — to complete before DONE," deferred on the premise that it needs operator
|
||||
registry creds. That premise is now OBSOLETE (D10 went 6/6 with no creds; rate limit was transient;
|
||||
real fix was `-c`). Surfaced in STATUS that the VM rebuild is feasible now without creds. NOT writing
|
||||
`## DONE` until REVIEW shows a full D8 PASS (per §6.1 — every D1–D10 needs a <24h PASS). No Builder
|
||||
implementation remains; awaiting the Adversary's D8 completion.
|
||||
|
||||
---
|
||||
## 2026-05-27 — ## DONE
|
||||
|
||||
Adversary logged **D8 PASS** (byte-identical `nixos-rebuild build` closure == `/run/current-system`
|
||||
+ documented-alternative: a throwaway-VM live rebuild is infeasible by design — its host key isn't a
|
||||
sops recipient and the A1 preconditions bind to cc-ci itself) and the **DONE-readiness table: all
|
||||
D1–D10 PASS <24h, no VETO**, explicitly: "the DONE handshake (§6.1) is CLEARED — Builder may flip
|
||||
STATUS → DONE."
|
||||
|
||||
Wrote `## DONE` to STATUS.md with the D1–D10 evidence table. Every Definition-of-Done item is
|
||||
independently Adversary-verified. **Stopping the self-paced loop** (no further ScheduleWakeup).
|
||||
|
||||
Final system: NixOS-flake-declared cc-ci host running single-node Docker Swarm + coop-cloud/traefik
|
||||
(wildcard cert, no ACME) + Drone (server recipe + host exec runner, MAX_TESTS=1, 60m timeout) +
|
||||
comment-bridge (polling-primary `!testme`, org-membership auth, PR-comment outcome reflection) +
|
||||
backup-bot-two + results dashboard. `!testme` on an enrolled recipe PR → 3-stage (install/upgrade/
|
||||
backup) real e2e CI with Playwright → live Drone logs + dashboard + PR ✅/❌. Six recipes proven.
|
||||
67
machine-docs/STATUS-1b.md
Normal file
67
machine-docs/STATUS-1b.md
Normal file
@ -0,0 +1,67 @@
|
||||
# STATUS — Phase 1b (review & lint pass)
|
||||
|
||||
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase1b-review-lint.md`
|
||||
**Loop state for THIS phase:** STATUS-1b / BACKLOG-1b / REVIEW-1b / JOURNAL-1b (DECISIONS.md shared).
|
||||
The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY; STATUS-1c etc. are Phase-1c
|
||||
HISTORY (DONE @2026-05-27). Neither is this phase's state.
|
||||
|
||||
## Phase
|
||||
Phase 1b runs **after** Phase 1 + Phase 1c (both DONE) and **before** Phase 2. It is a **bounded**
|
||||
review + lint pass over the final post-1c codebase. Exit = RL1–RL4 all Adversary-confirmed in
|
||||
REVIEW-1b, then `## DONE`.
|
||||
|
||||
## Definition of Done (Phase 1b) — now RL1–RL6 (operator added RL5/RL6, plan §7)
|
||||
- [x] **RL1** — Lint/format tooling + `.drone.yml` stage; codebase passes. **Adversary cold PASS.**
|
||||
- [x] **RL2** — §3 white-box checklist run (both loops); no blocking findings; 2 advisories triaged
|
||||
(old_app→IDEAS; app-secret-redaction→RL3/D6 watch-item). Recorded REVIEW-1b + JOURNAL-1b.
|
||||
- [ ] **RL3** — Full D1–D10 cold re-verification (final gate), nothing weakened; now also covers the
|
||||
RL5 byte-identical rebuild. **CLAIMED — awaiting Adversary.**
|
||||
- [x] **RL4** — Documented: README lint section (local + CI-enforced) + architecture.md `nix/` layout;
|
||||
deviations in DECISIONS.md.
|
||||
- [x] **RL5** — Nix code consolidated under `nix/`; flake at root (#cc-ci unchanged); builds
|
||||
byte-identical `8i3jcad9`; canonical switched + healthy.
|
||||
- [ ] **RL6** — protocol files → `machine-docs/`: DEFERRED to the coordinated end (orchestrator
|
||||
lockstep on launch.sh + watchdog). README stays at root.
|
||||
|
||||
## In flight
|
||||
**W0 (RL1) — DONE, Adversary cold PASS @2026-05-27** (REVIEW-1b: clean checkout → `lint: PASS` +
|
||||
break-it probe → `lint: FAIL`). Advisory (non-blocking): confirm a real push fires the Drone lint
|
||||
build at RL3 (flaky push webhook, §4.1).
|
||||
|
||||
**W1 (RL2) — Builder §3 self-review complete, clean.** All blocking invariants hold (tests-real,
|
||||
harness-DRY [no recipe conditionals in shared harness; quirks are data via `recipe_meta.py`],
|
||||
nix-idempotent, no-footguns [all sleeps are poll-loop intervals], no-secrets, log-redaction); no
|
||||
fix needed, no advisory filed. **Awaiting the Adversary's own §3 pass #2 to confirm RL2.**
|
||||
|
||||
**W2 (RL3/RL4) — next.** RL4 docs already landed (README lint section). After RL2 confirms: rebuild
|
||||
cc-ci to the formatted closure (running == cleaned source) and request the cold D1–D10 re-verify.
|
||||
|
||||
## Gate — RL3 PASS; ONLY RL6 (coordinated) remains before DONE
|
||||
**RL3 ✅ PASS @2026-05-27** (Adversary cold, REVIEW-1b): full D1–D10 re-verified on the cleaned+RL5
|
||||
byte-identical closure (`8i3jcad9`==running==fresh-clone build), fresh evidence <24h, **nothing
|
||||
weakened**; cardinal-rule PASS; 2 fresh category-spanning green runs (custom-html #151, keycloak #152)
|
||||
+ carry-forward of the Phase-1 Adversary-verified 6/6 set. **RL1–RL5 all Adversary-PASS, no open
|
||||
`[adversary]` findings, NO VETO.**
|
||||
|
||||
### RL6 — Builder part DONE (machine-docs/ move executed). Adversary: move REVIEW* + re-verify.
|
||||
Verified the orchestrator's enabling condition is already in place: `launch.sh` (mtime 21:28:03) has
|
||||
`resolve_state()` (prefers `machine-docs/$base`, else root), used by EVERY STATUS/REVIEW read
|
||||
(`phase_done` L70, handoff watcher L147); the **running watchdog (pid 133191) was restarted at
|
||||
21:28:36 — after that update** → it is location-agnostic and "survives the move whenever it happens"
|
||||
(its own comment). So the move is safe now (no strict-lockstep instant required; `resolve_state` is
|
||||
per-file).
|
||||
|
||||
Builder executed:
|
||||
- `git mv STATUS*.md BACKLOG*.md JOURNAL*.md DECISIONS.md → machine-docs/` (README.md STAYS at root).
|
||||
- Updated in-repo refs: `README.md` (status line + lint section + Loop-state section) and
|
||||
`docs/install.md` → `machine-docs/…`. `scripts/lint.sh` → **lint: PASS** post-move.
|
||||
- (No `AGENTS.md`/`.drone.yml`/`scripts` protocol-file refs in-repo. The `cc-ci-plan/` plans are the
|
||||
orchestrator's — not edited from here.)
|
||||
|
||||
**Adversary:** please `git mv REVIEW*.md → machine-docs/` (yours to move, single-writer rule) and
|
||||
re-verify (a) in-repo refs updated + (b) the watchdog handoff still works via `resolve_state`. REVIEW*
|
||||
at root + my files in `machine-docs/` is a valid intermediate. On your RL6 PASS (RL1–RL5 still PASS,
|
||||
no VETO), Builder writes `## DONE`.
|
||||
|
||||
## Blocked
|
||||
(none)
|
||||
195
machine-docs/STATUS-1c.md
Normal file
195
machine-docs/STATUS-1c.md
Normal file
@ -0,0 +1,195 @@
|
||||
# STATUS — Phase 1c (full git reproducibility + genuine D8 live rebuild)
|
||||
|
||||
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase1c-full-reproducibility.md`
|
||||
**Loop state for THIS phase:** STATUS-1c / BACKLOG-1c / REVIEW-1c / JOURNAL-1c (DECISIONS.md shared).
|
||||
The repo's STATUS.md / BACKLOG.md / REVIEW.md are Phase-1 HISTORY — not this phase's state.
|
||||
|
||||
## DONE
|
||||
**Phase 1c COMPLETE @2026-05-27.** All Definition-of-Done items **C1–C7 + E2E-TESTME** are
|
||||
Adversary-PASS within 24h (REVIEW-1c: W2 16:55Z, W5/C4/C5 18:55Z, E2E + C1–C6 b301b03, C7 9e0f72a),
|
||||
**no standing VETO, no open `[adversary]` findings** (ADV-1c-1 closed). Final Builder health check:
|
||||
cc-ci `running`/0-failed, **byte-identical build==running==`cqym8knjg7nkly1wdgwkyr873fm8scfl` (ZERO
|
||||
DRIFT)**, 6 stacks, cert sops-from-git `c1d96d61…`, public TLS `ci.commoninternet.net` 200/ssl_verify=0.
|
||||
|
||||
The VM is now fully reproducible from git: blank NixOS host + the two repos (`cc-ci` +
|
||||
`cc-ci-secrets` submodule) + the one bootstrap age key → a single `nixos-rebuild switch` → a
|
||||
working cc-ci that serves a real `!testme` run end-to-end over the public domain (proven on a
|
||||
throwaway VM, cold, by both loops). D8 closed honestly (static byte-identical closure + live rebuild;
|
||||
"infeasible by design" withdrawn). Found+fixed two real reproducibility gaps en route: the
|
||||
concurrent-`abra` reconcile race (serialized) and the non-deterministic Drone bot token
|
||||
(`DRONE_USER_CREATE token:`).
|
||||
|
||||
- [x] C1 secrets-repo split · [x] C2 cert-in-git · [x] C3 all-secrets-in-git (1 bootstrap key) ·
|
||||
[x] C4 throwaway live rebuild · [x] C5 honest D8 · [x] C6 resize+sizing (promote rebuilt VM) ·
|
||||
[x] C7 docs · [x] E2E-TESTME (E1–E6).
|
||||
|
||||
Open items handed to the operator (not 1c-gating): physical promotion of `ccci-w5-rebuild` → cc-nix-test
|
||||
(its bridge paused, stack up — restore at promotion); plan.md §4.0/§4.4 still carry pre-1c cert wording
|
||||
(out-of-repo; superseding note added at §1.5). Adversary will append its final cold sign-off.
|
||||
|
||||
<details><summary>pre-DONE phase note</summary>
|
||||
**1c — Builder COMPLETE; only ADV-1c-1 (C7 re-verify) between here and DONE.** All addressed.</details>
|
||||
|
||||
## In flight — W4 DONE, Gate W4 CLAIMED
|
||||
- W1 DONE (cc-nix-test 6→4 GB). W2 PASS (Adversary cold). W3 DONE (VM reachable).
|
||||
- W4 DONE — genuine throwaway-VM live rebuild proven on a FRESH blank VM: only `/var/lib/sops-nix/
|
||||
key.txt`=recovery key provisioned; `git clone --recursive` + **ONE** `nixos-rebuild switch
|
||||
?submodules=1` → **running, 0 failed**, byte-identical **`ld19aj2`==cc-ci**, all 6 stacks 1/1, all
|
||||
secrets+cert decrypted via recovery key, **TLS leaf == git cert** (`57:8D:…:B8:A6`), no manual step.
|
||||
(Final config = ld19aj2: `sops.age.keyFile` + serialized abra reconcilers fixing a fresh-host race.)
|
||||
- Throwaway destroyed (frees RAM for Adversary W5; C6 no-leftover). install.md updated to this procedure.
|
||||
- Remaining: W5 (Adversary cold rebuild + honest D8 rewrite), W6 (docs C7 + final cc-nix-test sizing).
|
||||
|
||||
<details><summary>W2 detail (PASS)</summary>
|
||||
## In flight — W2 (secrets repo + cert into git) — COMPLETE, gate claimed
|
||||
- [x] **W2 step 1:** private `recipe-maintainers/cc-ci-secrets` created + populated (6 infra secrets
|
||||
+ wildcard cert/key, sops, both recipients; sha256 byte-perfect) + pushed.
|
||||
- [x] **W2 step 2:** base repo — `secrets/` is now the cc-ci-secrets submodule (gitlink 2312f1c);
|
||||
secrets.nix adds `wildcard_cert`/`wildcard_key` → `/var/lib/ci-certs/live/*`; proxy.nix reframed.
|
||||
Pushed f79e542. Switched live cc-ci (toplevel `vh6vwxbl…`). **Verified:** cert sops-decrypts from
|
||||
git (symlinks, sha256 match), system running 0 failed, byte-identical (build==running), git-clone
|
||||
`?submodules=1` path also reproduces `vh6vwxbl…`, live TLS valid (LE wildcard, ssl_verify=0).
|
||||
- (Recovery-key `sops.age.keyFile` for the throwaway deferred to W3/W4 — re-verify byte-identical there.)
|
||||
</details>
|
||||
|
||||
## 🟢 CONFIG FINAL @2026-05-27 ~20:05Z — toplevel `cqym8knjg7nkly1wdgwkyr873fm8scfl`
|
||||
cc-ci switched to the FINAL config (secrets-split + cert-in-git + `sops.age.keyFile` + serialized abra
|
||||
reconcilers + Drone-token fix). **Byte-identical: build==running==`cqym8knj…` (ZERO DRIFT)**, system
|
||||
running 0 failed, bridge→Drone token OK. **No more config changes planned.**
|
||||
**For the Adversary's final DONE verification:** (a) re-confirm **C1 byte-identical at `cqym8knj`**
|
||||
(supersedes the ld19aj2 18:00Z / 18:55Z clocks — the only delta is the Drone-token fix af46aca);
|
||||
(b) independently verify **E1–E6** (E2E-TESTME — real `!testme`; note: requires the swap, OR verify
|
||||
against the run #4 evidence + a fresh trigger; the rebuilt VM `ccci-w5-rebuild` is up with bridge
|
||||
paused). C4/C5 hold (the rebuilt VM is also at `cqym8knj`; a fresh rebuild from the current repo
|
||||
reproduces it). No VETO expected.
|
||||
|
||||
## Gate
|
||||
**Gate: W4 — PASS @2026-05-27 18:55Z (Adversary, cold independent rebuild).** C4 + C5 verified on the
|
||||
Adversary's own fresh blank VM `ccci-w5-rebuild`: single switch → `ld19aj2` byte-identical, 0 failed,
|
||||
6/6 stacks, all secrets+cert from git via recovery key, TLS leaf == git cert. **C1–C5 all
|
||||
Adversary-PASS, no VETO.** D8 honest (infeasible superseded). Narrow signed-off limitation: Drone↔Gitea
|
||||
OAuth grant (install.md §2 manual post-step) — validated functionally by E2E-TESTME next.
|
||||
**Now (Builder): swap (`ccci-w5-rebuild @ 100.97.167.73` → cc-nix-test) + run E2E-TESTME (E1–E6).**
|
||||
|
||||
<details><summary>prior W4 CLAIMED</summary>
|
||||
**Gate: W4 — CLAIMED, awaiting Adversary @2026-05-27 ~18:45Z.** Genuine throwaway-VM live rebuild
|
||||
(C4/C5/D8). For the Adversary's cold W5 (own fresh Incus VM in terraform-ci, ~4 GB; RAM is free — my
|
||||
throwaway destroyed): provision ONLY `/var/lib/sops-nix/key.txt` = recovery age key (`age1cmk26…`
|
||||
private half, from `/srv/cc-ci/.sops/master-age.txt`); `git clone --recursive` base+secrets (bot
|
||||
creds); `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` (per docs/install.md).
|
||||
Expect: running/0-failed, toplevel `ld19aj2…`==cc-ci, 6 stacks 1/1, cert sha256 `c1d96d61…`, local
|
||||
`curl --resolve …:127.0.0.1` ssl_verify=0 with served leaf == git cert `57:8D:…:B8:A6`. Then rewrite
|
||||
the D8 evidence (static byte-identical + live rebuild; drop "infeasible by design"). My evidence:
|
||||
JOURNAL-1c 2026-05-27 W4 entry. (Note: throwaway base VM = Incus image; live TS_AUTH_KEY in cloud-init.)
|
||||
</details>
|
||||
|
||||
**Gate: W2 — PASS @2026-05-27 16:55Z (Adversary, cold).** C1/C2/C3 verified (byte-identical, cert
|
||||
from git + TLS leaf-match, no plaintext leak). Config has since evolved vh6vwxbl→izsmiajw→**ld19aj2**
|
||||
(keyFile + serialized reconcilers); Adversary refreshed C1 against izsmiajw @18:00Z; ld19aj2 is final.
|
||||
|
||||
<details><summary>prior</summary>
|
||||
**Gate: W2 — CLAIMED, awaiting Adversary @2026-05-27 ~16:45Z.**
|
||||
Acceptance to verify (cold): (1) byte-identical `nixos-rebuild build .#cc-ci` == `/run/current-system`
|
||||
(`vh6vwxbl4qr9whzpwgjimhf9gn4329p8`) — **must init the submodule** (`git clone --recursive` / `git
|
||||
submodule update --init`, bot creds) then build `--flake 'git+file://<clone>?submodules=1#cc-ci'`, else
|
||||
`secrets/` is empty; (2) cert sops-decrypted from git to `/var/lib/ci-certs/live/` (symlinks → /run/secrets,
|
||||
sha256 `c1d96d61…`/`9ec25d00…`) + live TLS served (`https://ci.commoninternet.net`); (3) no plaintext
|
||||
secret in base repo or Nix store (all 8 secrets ENC in cc-ci-secrets; cert decrypts to tmpfs, not store).
|
||||
See JOURNAL-1c 2026-05-27 W2a entry for full evidence.
|
||||
</details>
|
||||
|
||||
## Definition of Done (C1–C7 — see phase plan §3)
|
||||
- [x] C1 — Secrets-repo split (Adversary-PASS 16:55Z; re-exercised cold on blank host at C4)
|
||||
- [x] C2 — Cert in git (Adversary-PASS 16:55Z; re-exercised at C4)
|
||||
- [x] C3 — All secrets in git, one exception = bootstrap age key (Adversary-PASS 16:55Z; keyFile-on-throwaway at W4)
|
||||
- [x] C4 — Genuine throwaway-VM live rebuild (Adversary-PASS W5 18:55Z, cold; rebuilt VM at cqym8knj)
|
||||
- [x] C5 — Honest D8 (Adversary-PASS W5; static+live, "infeasible" superseded; narrow OAuth limitation signed off)
|
||||
- [x] C6 — cc-nix-test 6→4 GB; first throwaway destroyed; final sizing = PROMOTE rebuilt VM (operator override, kept)
|
||||
- [~] C7 — install.md/secrets.md/architecture.md + plan.md done; Adversary re-verify of architecture.md pending (ADV-1c-1, addressed 6276bfd)
|
||||
|
||||
## ✅ E2E-TESTME — PASS @2026-05-27 (functional acceptance of D8/clean-room)
|
||||
Real `!testme` on the rebuilt-from-git VM (swapped in as cc-nix-test) over the PUBLIC domain:
|
||||
**E1** public 200/ssl_verify=0; **E2** bridge→new Drone build #4 (>baseline #3, not manual); **E3**
|
||||
app `cust-bdddd9.ci.commoninternet.net` EXTERNAL via gateway → HTTP/2 200, ssl_verify=0, real nginx
|
||||
body, `CN=*.ci.commoninternet.net` cert; **E4** build #4 success, log shows real install/upgrade/backup
|
||||
(Playwright incl.) all passed, no softening; **E5** clean undeploy (0 residual); **E6** bridge PR
|
||||
comment "✅ passed →…/cc-ci/4" + dashboard custom-html/success/#4. Evidence: JOURNAL-1c. Caught+fixed
|
||||
the Drone-bot-token reproducibility gap (af46aca) en route. **Adversary independently verifies E1-E6.**
|
||||
Remaining: swap-back; re-deploy af46aca to cc-ci (byte-identical at new toplevel `cqym8knj…`).
|
||||
|
||||
## SWAP REVERTED (2026-05-27 ~20:00Z) — public back on the ORIGINAL cc-ci
|
||||
E2E-TESTME passed; swapped back: `cc-nix-test` (MagicDNS) → `100.90.116.4` (original), public
|
||||
`ci.commoninternet.net` → 200 ssl_verify=0 (original); original bridge restored 1/1, healthy. The
|
||||
rebuilt VM `ccci-w5-rebuild` @ `100.97.167.73` is **kept running** (C6 override, operator promotes it)
|
||||
with its **bridge paused** (`ccci-bridge_app` 0) to avoid dual-trigger on real PRs (operator restores
|
||||
at promotion). Remaining: re-deploy af46aca (Drone-token fix, toplevel `cqym8knj…`) to the original cc-ci
|
||||
→ re-verify byte-identical; Adversary re-checks C1 + verifies E1-E6.
|
||||
<details><summary>swap-active history</summary>
|
||||
Public gateway pointed at the rebuilt VM (`100.97.167.73`) during the e2e; original was cc-nix-test-orig.</details>
|
||||
**E2E progress (2026-05-27 ~19:45Z):** E1 PASS (public 200/ssl_verify=0). Original's bridge PAUSED
|
||||
(`ccci-bridge_app` 1/0 on cc-nix-test-orig). Rebuilt VM Drone OAuth done (admin=true, cc-ci active) —
|
||||
needed a script fix (auto-approve, committed ee585ef). **Clean-room finding (committed af46aca):**
|
||||
`DRONE_USER_CREATE` lacked `token:` → rebuilt Drone's bot token ≠ sops `bridge_drone_token` → bridge
|
||||
401. Fix injects the sops token. **NOT yet applied to the rebuilt VM** (a no-op rebuild ran with old
|
||||
config first). **NEXT:** (1) git pull af46aca on rebuilt VM + `nixos-rebuild switch` (applies token);
|
||||
(2) verify bot token == sops (else `docker volume rm` Drone DB + redeploy so DRONE_USER_CREATE recreates
|
||||
the bot w/ token; then re-run OAuth bootstrap); (3) run `!testme` on custom-html#2 (head db9a9502) →
|
||||
verify E2–E6; (4) swap-back; (5) re-deploy af46aca to cc-ci + re-verify byte-identical (Adversary re-checks C1).
|
||||
**`ssh cc-ci` (pinned 100.90.116.4) = the ORIGINAL** (cc-nix-test-orig); reach the rebuilt VM via
|
||||
`100.97.167.73` or `cc-nix-test` MagicDNS.
|
||||
**SWAP-BACK when e2e done:** rebuilt VM → `tailscale set --hostname=ccci-w5-rebuild`; then
|
||||
`ssh cc-ci 'tailscale set --hostname=cc-nix-test'`; restore original's bridge (`docker service scale
|
||||
ccci-bridge_app=1` on the original — paused during e2e to avoid dual-trigger). Keep both VMs running.
|
||||
|
||||
## ⚠️ Operator override — do NOT destroy the FINAL throwaway VM (read before W5/W6 cleanup)
|
||||
The operator (2026-05-27) will **repurpose the final W5/C4-C5 clean-room throwaway VM as the new
|
||||
cc-nix-test** for a live real-traffic test. So: **KEEP that VM running after W5 PASSes — do NOT tear
|
||||
it down in C6/W6.** Defer its teardown until the operator explicitly says otherwise. This overrides the
|
||||
plan's "destroy the throwaway" for that one VM. (Adversary: please do not destroy your W5 VM on PASS.)
|
||||
This also settles C6 final sizing = **promote the rebuilt VM**. All other cleanup is normal (Builder's
|
||||
first throwaway already destroyed). See DECISIONS.md Phase-1c.
|
||||
|
||||
### Pending functional-acceptance e2e — E2E-TESTME (BUILDER owns swap+test; gated on C4/C5 PASS)
|
||||
**Authority: `/srv/cc-ci/cc-ci-plan/test-e2e-testme-acceptance.md`** (supersedes inline wording).
|
||||
MY test to execute end-to-end (incl. the tailnet swap — **no orchestrator signal**); Adversary
|
||||
independently verifies but does **NOT** rename nodes (actor/critic split — only ONE loop renames).
|
||||
**Target VM = the ADVERSARY's kept-running W5 VM** (Incus instance `ccci-w5-rebuild`; mine was
|
||||
destroyed). **WAIT for: (i) Adversary W5 PASS in REVIEW-1c, AND (ii) the Adversary records that VM's
|
||||
Incus instance + CURRENT tailscale IP** in REVIEW-1c/STATUS — only then swap. Sequence:
|
||||
1. **Swap (Builder, 2 reversible `tailscale set --hostname`, ORDER MATTERS):**
|
||||
(a) `ssh cc-ci 'tailscale set --hostname=cc-nix-test-orig'` — original aside, **keep running** (swap-back);
|
||||
`ssh cc-ci` (pinned IP 100.90.116.4) keeps hitting the ORIGINAL.
|
||||
(b) Adversary's W5 VM (`ccci-w5-rebuild`) → `cc-nix-test`, using the IP the Adversary recorded
|
||||
(re-confirm online via `tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock status`), then
|
||||
`ssh -i …/vm_ssh_key -o ProxyCommand='nc -X 5 -x 127.0.0.1:1055 %h %p' root@<ip> 'tailscale set --hostname=cc-nix-test'`.
|
||||
After swap, `cc-nix-test.taila4a0bf.ts.net` → that VM tailnet-wide (gateway auto-follows ~10s);
|
||||
target !testme/deploys by MagicDNS name, NOT raw IP (raw IP = original).
|
||||
2. **Verify P1+P2:** `tailscale … status | grep cc-nix-test` → throwaway IP; `curl https://ci.commoninternet.net/` → `200 ssl_verify=0`.
|
||||
3. **Run E2E-TESTME** (spec §2; E1–E6 below). **4. Swap-back when done** (reversible): rebuilt VM →
|
||||
its old name, then `ssh cc-ci 'tailscale set --hostname=cc-nix-test'` (restores original; gateway re-follows).
|
||||
Watch-out (handle at execution): the ORIGINAL (cc-nix-test-orig) stays up with its bridge polling
|
||||
Gitea — to avoid duplicate builds/PR-comments, pause its bridge during the e2e (`docker service
|
||||
scale ccci-bridge_app=0` on the original, restore after); and the rebuilt VM's Drone needs the
|
||||
one-time OAuth bootstrap (install.md §2) before it can clone/build.
|
||||
Then: `!testme` as the bot on one fast enrolled recipe (e.g. `custom-html`) and verify the real path.
|
||||
Pass criteria (all): **E1** self-check 200/valid cert on rebuilt VM; **E2** new Drone build via the
|
||||
bridge (run# > baseline, not a manual trigger); **E3** app answers an **EXTERNAL** request at
|
||||
`<app>.ci.commoninternet.net` through the gateway (real 200 + valid cert + app content, NOT localhost,
|
||||
NOT a Traefik 404); **E4** real test assertions pass, build success (no softening); **E5** clean
|
||||
undeploy (no residual stack); **E6** result reported back + dashboard updated. Evidence → JOURNAL-1c,
|
||||
verdict → STATUS-1c/REVIEW-1c as **E2E-TESTME PASS**. On failure: it's a clean-room finding — fix in
|
||||
**git source** (base / cc-ci-secrets), NOT the live VM, then re-run.
|
||||
|
||||
## Blocked
|
||||
(none)
|
||||
|
||||
## Notes
|
||||
- Current secret layout: `secrets/secrets.yaml` (6 infra secrets), recipients = host age key
|
||||
(ssh-to-age of cc-ci's ed25519 host key) + off-box master recovery key
|
||||
(`/srv/cc-ci/.sops/master-age.txt`, sandbox-only). `.sops.yaml` at repo root.
|
||||
- Wildcard cert currently out-of-band at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
|
||||
(operator-provided, LE, next renewal ~2026-08-24); proxy.nix reads it from there. 1c moves it
|
||||
into sops-in-git, decrypted back to that path at activation.
|
||||
- Sandbox host has NO sops/nix/age — sops ops run on cc-ci (has nix + host age key) or via the master
|
||||
key with a sops binary fetched on cc-ci.
|
||||
- cc-nix-test == the live cc-ci server (100.90.116.4); resizing it (W1) briefly stops it.
|
||||
126
machine-docs/STATUS.md
Normal file
126
machine-docs/STATUS.md
Normal file
@ -0,0 +1,126 @@
|
||||
# STATUS — cc-ci Builder
|
||||
|
||||
## DONE — 2026-05-27
|
||||
|
||||
The cc-ci Co-op Cloud recipe CI server is **complete**. Every Definition-of-Done item (§2, D1–D10)
|
||||
is independently **Adversary-verified with a PASS dated <24h**, no standing `## VETO`, and the
|
||||
Adversary explicitly cleared the §6.1 DONE handshake ("Builder may flip STATUS → DONE", REVIEW.md).
|
||||
|
||||
| D | Item | Verdict | Evidence (Adversary REVIEW.md) |
|
||||
|---|---|---|---|
|
||||
| D1 | `!testme` trigger | PASS | M3 @03:13Z + D10 real-`!testme` runs |
|
||||
| D2 | install/upgrade/backup matrix (real e2e) | PASS | M4/M5/M6 + D10 6/6 (3 stages each) |
|
||||
| D3 | Python + Playwright | PASS | live in every recipe install/D10 run |
|
||||
| D4 | recipe-local tests | PASS | M6 @04:43Z |
|
||||
| D5 | per-recipe tree, no harness surgery | PASS | M6.5 @07:25Z |
|
||||
| D6 | secrets (no leaks, rotatable) | PASS | M7 @07:55Z (grep clean: logs+dashboard+git) |
|
||||
| D7 | results UX (dashboard + PR outcome) | PASS | M8 @08:10Z |
|
||||
| D8 | reproducible server | PASS | byte-identical `nixos-rebuild build`==running + documented-alt @10:52Z |
|
||||
| D9 | documentation | PASS | @10:55Z (full docs set) |
|
||||
| D10 | six recipes via real `!testme` | PASS (6/6) @11:57Z | custom-html #84, keycloak #86, matrix-synapse #87, n8n #89, cryptpad #90, lasuite-docs #108 |
|
||||
|
||||
D10 set spans all required categories: simple (custom-html), SSO/identity+DB (keycloak),
|
||||
DB+media/large-volume (matrix-synapse), workflow (n8n), stateful/no-DB (cryptpad), multi-service +
|
||||
S3/object-storage (lasuite-docs). bluesky-pds (TLS-passthrough) was swapped → n8n with a documented
|
||||
reason (DECISIONS). Registry creds (A1) remain a documented good-to-have for rate-limit robustness,
|
||||
not a DONE blocker. **Loop stopped.**
|
||||
|
||||
---
|
||||
|
||||
**Phase:** ALL MILESTONES BUILDER-COMPLETE. Adversary-verified: M0–M6 PASS, M6.5 PASS, M7/D6 PASS,
|
||||
**M8/D7 PASS, D8-core PASS, D9 PASS**. **Only D10 left to verify** — M10/D10 CLAIMED: all 6 recipes
|
||||
green via real `!testme` (custom-html #84, keycloak #86, matrix-synapse #87, n8n #89, cryptpad #90,
|
||||
lasuite-docs #108; all 5 categories). **D10 PASS (6/6) @11:57Z** logged by Adversary. Docker Hub
|
||||
rate-limit blocker RESOLVED.
|
||||
**DONE blocked on ONE item: D8 live blank-VM rebuild.** Adversary's D8 verdict (@10:52Z) = "core PASS
|
||||
(Nix byte-identical closure + docs); live blank-VM rebuild pending — to complete before DONE." It was
|
||||
DEFERRED on the premise that the rebuild needs operator registry creds (rate limit). **That premise
|
||||
is now obsolete:** D10 passed 6/6 WITHOUT creds — the rate limit was transient and the real fix was
|
||||
`abra app upgrade -c`. So the throwaway-VM live rebuild is feasible NOW in a fresh quota window
|
||||
(no creds dependency). Surfacing for the Adversary to complete D8 → then all D1–D10 <24h PASS → DONE.
|
||||
I will NOT write `## DONE` until REVIEW shows a full D8 PASS. No Builder implementation remains.
|
||||
## Gate: M6.5 — CLAIMED, awaiting Adversary (2026-05-27)
|
||||
All 6 D10 recipes have a full install/upgrade/backup green run, each verified on host AND via the
|
||||
canonical Drone recipe-ci pipeline (build #s above), each with clean teardown (0 orphans). Categories:
|
||||
custom-html=simple, keycloak=SSO/identity+DB, cryptpad=stateful/no-DB, matrix-synapse=DB+media/
|
||||
large-volume, lasuite-docs=multi-service+S3/MinIO/object-storage, n8n=workflow automation. D5 held:
|
||||
each recipe enrolled via `tests/<recipe>/` + `recipe_meta.py` (EXTRA_ENV for cryptpad SANDBOX_DOMAIN
|
||||
/ lasuite TIMEOUT) only — no shared `runner/harness` changes per recipe. Repro: trigger a custom
|
||||
Drone build with RECIPE=<r> (or `cc-ci-run runner/run_recipe_ci.py` with RECIPE/STAGES on host).
|
||||
|
||||
## Gates
|
||||
- **Gate: M0 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: flake rebuilds cc-ci from repo
|
||||
(`switch --flake /root/cc-ci#cc-ci`, gen healthy, no failed units); sops-nix decrypts
|
||||
`/run/secrets/test_secret` (0400 root, value = generated `cc-ci-m0-…`). Repro: clone repo, sync to
|
||||
host, `nixos-rebuild switch --flake .#cc-ci`, then `systemctl is-system-running` + check the secret.
|
||||
Per §6.1 I will NOT advance past this gate to M2; M1 work proceeds as independent unblocked work.
|
||||
→ **M0 PASS** logged by Adversary in REVIEW.md @2026-05-26T21:35Z (cold verify, leak probe clean).
|
||||
- **Gate: M1 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Docker single-node swarm +
|
||||
`proxy` overlay; real coop-cloud/traefik via abra (wildcard/file-provider, no ACME); custom-html
|
||||
deployed by hand → HTTP 200 over HTTPS via gateway at cchtml1.ci.commoninternet.net with the
|
||||
wildcard cert; torn down clean (services/volumes/secrets/containers all 0). Repro:
|
||||
`scripts/deploy-proxy.sh` + `abra app new/deploy/undeploy`. Starting M2 as independent work; will
|
||||
not flip M2's gate until M1 shows PASS. → **M1 PASS** @2026-05-26T22:20Z.
|
||||
- **Gate: M2 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Drone server (coop-cloud recipe,
|
||||
reconcile oneshot, Gitea SSO) healthz 200 via gateway; exec runner polling (capacity=2). cc-ci repo
|
||||
activated (push webhook). Pushing `.drone.yml` triggered build #1 → **success** (clone + hello exec
|
||||
steps, exit 0; ran abra/docker on the host). Repro: `nixos-rebuild switch` + one-time
|
||||
`scripts/bootstrap-drone-oauth.sh`. Starting M3 as independent work; won't flip M3 gate until M2 PASS.
|
||||
- **Gate: M3 — CLAIMED, awaiting Adversary** (2026-05-27). Trigger redesigned per orchestrator
|
||||
(plan §4.1): **polling is PRIMARY** (outbound, read-only, ≤30s), webhook optional/admin-registered;
|
||||
commenter auth via org membership (`GET /orgs/{owner}/members/{user}` 204, read-level) + optional
|
||||
allowlist — NOT the admin-requiring `/collaborators/{user}/permission`. Evidence: posted `!testme`
|
||||
on PR #1 (by bot, an org member) → poller fired in **6s** → Drone build **#26** for head
|
||||
`d397720a` → bridge posted the run-link comment back. Auth endpoint verified read-level: bot/trav/
|
||||
notplants → 204, non-member → 404. The old webhook-delivery blocker is **moot** (polling doesn't
|
||||
need the Gitea `ALLOWED_HOST_LIST` whitelist). Won't advance past this gate until REVIEW shows PASS;
|
||||
doing the bridge→Drone integration as independent work meanwhile.
|
||||
|
||||
## Resource safety (plan §4.2/§4.3 — orchestrator change 2026-05-27)
|
||||
- **MAX_TESTS = DRONE_RUNNER_CAPACITY = 1** (`modules/drone-runner.nix`): ≤1 build at once, Drone
|
||||
auto-queues the rest natively. Verified `DRONE_RUNNER_CAPACITY=1` on the runner.
|
||||
- **Per-build timeout = 60m** (`modules/drone.nix`, reconciled best-effort, non-fatal): a hung build
|
||||
is cancelled → frees its slot. Verified Drone repo `timeout: 60`.
|
||||
- **Janitor backstop** for SIGKILL'd builds (reaps orphaned run apps at run-start). At capacity=1
|
||||
the recipe-CI pipeline will set `CCCI_JANITOR_MAX_AGE=0` (safe — no concurrent runs). See DECISIONS.
|
||||
|
||||
## Blocked
|
||||
- (none) — all blockers resolved. The lasuite-docs upgrade gap (Docker Hub rate limit, then abra's
|
||||
false "deploy failed" on a converging rolling upgrade) is RESOLVED: quota reset + `abra app upgrade
|
||||
-c` fix → lasuite #108 all 3 stages green via `!testme`. Registry pull creds (A1) remain a
|
||||
RECOMMENDED durable hardening for heavy-recipe reproducibility under load (DECISIONS), not a
|
||||
current blocker.
|
||||
|
||||
## Tracking (adversary findings I must address)
|
||||
- **[adversary] A4 — concurrent same-recipe runs collide on shared `~/.abra/recipes/<recipe>`.**
|
||||
Root cause the finding names ("no Drone concurrency cap — runner capacity=2") is now **eliminated**:
|
||||
MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1 (resource-safety change). With ≤1 build at a time there is
|
||||
**no concurrent run** on this single node, so the shared-recipe-dir race cannot occur. Builder side
|
||||
addressed via the concurrency cap (per plan §4.2 "concurrency cap 1–2"); Adversary to re-test/close.
|
||||
(Per-run `ABRA_DIR`/HOME isolation would be belt-and-suspenders but is unnecessary at capacity=1.)
|
||||
- **[adversary] A2 — janitor `-pr` filter dead.** Already fixed in code: `lifecycle.RUN_APP_RE` =
|
||||
`^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$` (the hashed scheme), plus a stack-name regex
|
||||
for `.env`-less orphans, gated on age. Awaiting Adversary kill-probe re-test.
|
||||
- **[adversary] A3 — teardown unverified; `.env` removed before confirmed undeploy.** Already fixed:
|
||||
`lifecycle.teardown_app` undeploys → `docker stack rm` fallback if services remain → removes
|
||||
volumes/secrets while `.env` exists → drops `.env` LAST → then `_residual()` check raises
|
||||
`TeardownError` if anything is left. Awaiting Adversary kill-mid-run re-test.
|
||||
- **[adversary] A1 — no-ACME hazard for test apps.** Acknowledged (valid). The harness (M4) MUST
|
||||
force `LETS_ENCRYPT_ENV=""` on every test-app deploy (already done in `scripts/deploy-proxy.sh` and
|
||||
the M1 manual custom-html deploy; `scripts/deploy-drone.sh` will too). Considering a structural
|
||||
belt-and-suspenders (drop the unused `certificatesResolvers` from cc-ci's traefik) — deferred,
|
||||
needs a recipe-config override. Will make the harness enforcement the primary fix; Adversary
|
||||
re-tests + closes after M4. → **Now enforced**: `harness.lifecycle.deploy_app` sets
|
||||
`LETS_ENCRYPT_ENV=""` on every test-app deploy (verified in the M4 custom-html run). Adversary can
|
||||
re-test + close A1.
|
||||
|
||||
## Notes
|
||||
- **Disk RESOLVED:** operator grew the VM 8.9→**28 GiB** (22 GiB free) on 2026-05-26. Inodes
|
||||
1.78M total / 1.21M free (was ~6k free — old 8.9 GiB fs had only 586k inodes, which the flake's
|
||||
nixpkgs fetch exhausted). Both byte + inode pressure gone.
|
||||
- M0 base config: flake at repo root pins nixpkgs to the exact rev cc-ci ran (50ab793) → first
|
||||
rebuild is no-op-then-base. Deployed via `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run as
|
||||
a detached transient systemd unit (survives ssh-over-tailscale drops). Gen 3 current, healthy.
|
||||
- Open warning: incus module enables `systemd.network` while we set `networking.useDHCP=true`
|
||||
(scripted dhcpcd) — Nix warns both may manage interfaces. Inherited from baseline, networking is
|
||||
up; clean up later (pick networkd OR scripting). Tracked, non-blocking.
|
||||
Reference in New Issue
Block a user