All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
170 lines
12 KiB
Markdown
170 lines
12 KiB
Markdown
# JOURNAL — Phase 1c (Builder)
|
||
|
||
Append-only. Each entry: what I did + verifying command/output + next.
|
||
|
||
## 2026-05-27 — Phase 1c kickoff / orientation
|
||
|
||
**Context:** Phase 1 closed DONE (1c10fa5; all D1–D10 Adversary-PASS, no VETO; Adversary loop
|
||
terminated for Phase 1). Starting Phase 1c per `plan-phase1c-full-reproducibility.md`.
|
||
|
||
**Did:**
|
||
- Read the phase plan in full + plan.md §6.1/§7/§9. Verified access & surveyed current state.
|
||
- `ssh cc-ci 'hostname && nixos-version && systemctl is-system-running'`
|
||
→ `nixos` / `24.11.20250630.50ab793 (Vicuna)` / `running`. Healthy.
|
||
- Surveyed secrets layout:
|
||
- `secrets/secrets.yaml` holds 6 infra secrets: test_secret, drone_rpc_secret,
|
||
drone_gitea_client_secret, bridge_drone_token, bridge_gitea_token, bridge_webhook_hmac.
|
||
- `.sops.yaml`: recipients = `&host` (ssh-to-age of cc-ci ed25519 host key) + `&master`
|
||
(off-box recovery, private at `/srv/cc-ci/.sops/master-age.txt`, sandbox-only).
|
||
- `modules/secrets.nix`: `defaultSopsFile = ../secrets/secrets.yaml`,
|
||
`age.sshKeyPaths = ["/etc/ssh/ssh_host_ed25519_key"]`.
|
||
- Wildcard cert OUT-OF-BAND at `cc-ci:/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
|
||
(fullchain 2909 B, privkey 227 B → privkey looks like an EC key). `modules/proxy.nix` reconcile
|
||
reads it from there and inserts as traefik `ssl_cert`/`ssl_key` swarm secrets; treats a missing
|
||
cert as FATAL (operator precondition).
|
||
- `flake.nix`: nixpkgs pinned `50ab793`, sops-nix pinned; single local `nixosConfigurations.cc-ci`.
|
||
- Tooling note: sandbox host has NO sops/nix/age/ssh-to-age (`which` → only git). sops/age work
|
||
must run on cc-ci (has nix + host age key) or via a sops binary fetched there with the master key.
|
||
- Bootstrapped Phase-1c state: STATUS-1c.md, BACKLOG-1c.md, JOURNAL-1c.md (this file). REVIEW-1c.md
|
||
left for the Adversary (its file per §6.1). Appended Phase-1c decisions to DECISIONS.md.
|
||
|
||
**Decisions recorded (DECISIONS.md):** secrets linkage = **git submodule** (deviates from the
|
||
flake-input default — rationale: no private-repo fetch credential needed at nix-eval on every
|
||
rebuild, keeps `defaultSopsFile` a local path = minimal change + trivially byte-identical);
|
||
bootstrap key for throwaway = **recovery age key via `sops.age.keyFile`**.
|
||
|
||
**Next (W2):** create private `recipe-maintainers/cc-ci-secrets`; move secrets + wildcard cert into
|
||
sops there as a submodule of the base; wire secrets.nix (cert→`/var/lib/ci-certs/live` via `path=`);
|
||
prove byte-identical build + clean switch with TLS from the git cert. Then claim Gate W2.
|
||
|
||
## 2026-05-27 — W2 step 1: cc-ci-secrets repo created + populated (DONE)
|
||
|
||
**Did:**
|
||
- Created private `recipe-maintainers/cc-ci-secrets` via Gitea API (bot, org admin). HTTP 201, private=True.
|
||
- Confirmed cc-ci host SSH key → age identity == `&host` recipient `age1h90utd…`:
|
||
`ssh cc-ci 'nix shell nixpkgs#ssh-to-age --command ssh-to-age -i /etc/ssh/ssh_host_ed25519_key.pub'`
|
||
→ exact match. So I can decrypt/re-encrypt on cc-ci with the host key (master stays sandbox-only).
|
||
- Built `secrets.yaml` on cc-ci (script with file redirections, no key material in argv):
|
||
`sops -d` existing 6 secrets → append `wildcard_cert`/`wildcard_key` as YAML block scalars from
|
||
`/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` → `sops -e`. Verified round-trip:
|
||
- recipients: 2 (host+master)
|
||
- keys: test_secret, drone_rpc_secret, drone_gitea_client_secret, bridge_drone_token,
|
||
bridge_gitea_token, bridge_webhook_hmac, wildcard_cert, wildcard_key
|
||
- cert sha256 file==decrypt `c1d96d61…`; key sha256 file==decrypt `9ec25d00…`; test_secret decrypts OK
|
||
- Retrieved ciphertext (7219 B) to sandbox; created cc-ci-secrets repo (root `secrets.yaml`, own
|
||
`.sops.yaml` w/ `path_regex: secrets\.yaml$`, README). Pushed to main (auth via per-command
|
||
http.extraHeader; verified `.git/config` has NO creds). Remote lists .sops.yaml/README.md/secrets.yaml.
|
||
- Cleaned `/root/cc-ci-secrets.yaml` + build script off cc-ci.
|
||
|
||
**Layout decision:** cc-ci-secrets has `secrets.yaml` at ROOT → submodule mounts at base `secrets/`
|
||
→ base sees `secrets/secrets.yaml`, so `defaultSopsFile = ../secrets/secrets.yaml` is UNCHANGED.
|
||
|
||
**Next (W2 step 2):** in base repo — replace tracked `secrets/` with the submodule; add
|
||
`wildcard_cert`/`wildcard_key` sops secrets in secrets.nix (path= → /var/lib/ci-certs/live, + recovery
|
||
keyFile); adjust proxy.nix framing; switch cc-ci to new config via
|
||
`nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'`; prove byte-identical +
|
||
TLS-from-git-cert; then claim Gate W2. (Riskier — touches live server config; fresh iteration.)
|
||
|
||
## 2026-05-27 — W2a DONE + verified live; Gate W2 CLAIMED
|
||
|
||
**Discovery:** cc-ci's build source `/root/cc-ci` is NOT a git repo — it's a plain dir synced from
|
||
the sandbox via `tar | ssh` and built as a `path:` flake (DECISIONS.md:126). So cc-ci's deploy needs
|
||
NO submodule fetch / `?submodules=1` (the rsync'd dir already contains `secrets/`). The git-clone
|
||
`--recursive` + `?submodules=1` path is only for the documented install / throwaway (W4).
|
||
|
||
**Did (W2a — secrets split + cert into git, deployed to live cc-ci):**
|
||
- secrets.nix: added `wildcard_cert`(0444)/`wildcard_key`(0400) sops secrets → `path=/var/lib/ci-certs/live/*`.
|
||
- proxy.nix: reframed cert as sops-from-git (not operator drop); kept FATAL guard as a decrypt-path check.
|
||
- Base repo: `git rm secrets/secrets.yaml`; `git submodule add cc-ci-secrets secrets` (gitlink 2312f1c,
|
||
`.gitmodules` has NO creds). Pushed f79e542 (rebased over Adversary's c360520; resolved the
|
||
tracked-file→submodule transition by removing the submodule wd before rebase, repopulating after).
|
||
- Synced to cc-ci via `tar | ssh` (excluded .git). `nixos-rebuild build` → exit 0, only **6 derivations
|
||
built** (sops manifest gains cert/key + proxy unit error-msg edit) → toplevel
|
||
`vh6vwxbl4qr9whzpwgjimhf9gn4329p8` (differs from pre-W2 `m1pdvbhl…` — EXPECTED: cert moved
|
||
out-of-band-file → Nix-managed sops; that is C2's whole point, not drift).
|
||
- Backed up operator cert (`/root/ci-certs-operator-bak`), removed the regular files, `nixos-rebuild
|
||
switch` (detached unit `ccci-w2-switch`, Result=success).
|
||
|
||
**Verified live:**
|
||
- sops cert decrypt: `/var/lib/ci-certs/live/{fullchain,privkey}.pem` are now symlinks → `/run/secrets/
|
||
wildcard_{cert,key}`; content sha256 == source: `c1d96d61…` / `9ec25d00…` (byte-identical to the
|
||
original operator cert, now git-sourced).
|
||
- `systemctl is-system-running` → running, 0 failed. `deploy-proxy` active/success.
|
||
- **Byte-identical (zero drift):** `nixos-rebuild build` == `/run/current-system` == `vh6vwxbl…`.
|
||
- **Documented git-clone path also reproduces it:** fresh `git clone --recursive` into a temp git repo
|
||
+ `nixos-rebuild build --flake 'git+file:///tmp/ccci-git?submodules=1#cc-ci'` → **vh6vwxbl… (MATCH)**.
|
||
Proves the install/throwaway path works and equals running.
|
||
- **Live TLS from git cert:** `https://ci.commoninternet.net` http=200 ssl_verify=0; random
|
||
`probe-*.ci.commoninternet.net` handshake ssl_verify=0 (404 route, expected) via gateway→cc-ci;
|
||
served leaf `CN=*.ci.commoninternet.net`, LE issuer, valid to Aug 24 2026.
|
||
|
||
**For the Adversary verifying Gate W2 cold:** must init the submodule (`git clone --recursive` OR
|
||
`git submodule update --init`, bot creds) then build with `?submodules=1`, else `secrets/` is empty.
|
||
Both path: and git+submodules builds yield the same toplevel `vh6vwxbl…` (content-addressed).
|
||
|
||
**Deferred to W3/W4 prep (NOT in W2):** the recovery-key `sops.age.keyFile` for the throwaway VM —
|
||
adding it changes the closure again, so I'll add + test it on the throwaway (safe) and re-establish
|
||
byte-identical there. cc-ci stays on its proven host-key decrypt path for now.
|
||
|
||
**Next:** Gate W2 CLAIMED → await Adversary PASS on byte-identical + cert-in-git/TLS. Meanwhile prep W1
|
||
(resize) / W3 (throwaway VM) — read the incus skill.
|
||
|
||
## 2026-05-27 — W3 recon (read-only; while parked at Gate W2)
|
||
|
||
Incus skill read. b1 = 100.117.251.31:8443, project terraform-ci, mTLS certs at
|
||
/srv/incus-terraform-nix-vm-creator/terraform-secrets/{terraform.crt,terraform.key}. **b1 reachable
|
||
via the EXISTING cc-ci proxy** (`curl --proxy socks5h://127.0.0.1:1055 --cert/--key -k …`) — no
|
||
separate tailscaled needed (skill's own 1055 proxy would collide; reuse cc-ci's).
|
||
|
||
terraform-ci instances + RAM:
|
||
- cc-nix-test Running 6GB VM ← this IS the live cc-ci; W1 resizes 6→4 (stop→set→start, hotplug times out)
|
||
- lichen-staging Running 4GB container (leave alone)
|
||
- kube-base / kube-base-test Stopped 4GB VMs
|
||
- release-runner Stopped 8GB VM
|
||
Running total now = 10GB. After W1 + throwaway(4GB): 4+4+4 = 12GB ≤ 16 physical (phase-plan ~12GB
|
||
doc-only guideline; terraform-ci has no enforced limits.memory). VM create = `projects/incus-base`
|
||
Terraform template (NixOS base image, cloud-init+tailscale+nix flakes), set instance_name + limits.memory=4GB.
|
||
|
||
## 2026-05-27 — W1 DONE: cc-nix-test resized 6→4 GB (verified)
|
||
|
||
Gate W2 PASSED (Adversary, cold) → proceeded. No active CI run (only 5 permanent stacks). Resized via
|
||
Incus API on b1 (mTLS certs through the existing 1055 proxy): PUT state stop (op Success, Stopped) →
|
||
PATCH `limits.memory=4GB` (http 200) → PUT state start (op Success, Running).
|
||
**Verified after reboot:**
|
||
- SSH back in ~30s; `systemctl is-system-running` → running after ~104s (swarm/reconcile converge), 0 failed units.
|
||
- `free -h` total 3.5Gi (≈4 GB, down from 6). All stacks 1/1 (traefik app+socket-proxy, drone, bridge, dashboard, backups).
|
||
- **Cert survived reboot via sops:** `/var/lib/ci-certs/live/{fullchain,privkey}.pem` still symlinks →
|
||
/run/secrets/* (sops re-decrypted on cold boot). current-system still `vh6vwxbl…`.
|
||
- TLS: `https://ci.commoninternet.net/` http=200 ssl_verify=0 (dashboard served from git cert).
|
||
Running RAM now: cc-nix-test 4 + lichen-staging 4 = 8 GB; throwaway 4 → 12 GB ≤ 16 physical (guideline OK).
|
||
|
||
**Next: W3** — create blank 4 GB NixOS VM in terraform-ci, provision ONLY the bootstrap (recovery) age key.
|
||
|
||
## 2026-05-27 — W3: throwaway VM created (booting) + W4 design notes
|
||
|
||
**W3:** Created `ccci-throwaway` in terraform-ci via the **Incus REST API** (curl through the 1055
|
||
proxy — terraform/nix absent on sandbox; replicated `projects/incus-base/main.tf`): image
|
||
`incus-base-vm` (fp 3a0c4160), 4 GB RAM / 2 cpu / **20 GB disk** (>10 GB default, to dodge cc-ci's old
|
||
ENOSPC), cloud-init writes /etc/nixos/{configuration,incus-base}.nix + setup.sh + /etc/ts-auth-key
|
||
(incus workspace reusable key) + /etc/ts-hostname=ccci-throwaway; runcmd setup.sh (nix-channel
|
||
nixos-24.11, `nixos-rebuild boot`, sysrq reboot → tailscale auto-joins). ssh_authorized_keys = vm_ssh_key
|
||
(I hold private) + mfowler + cc-ci-root key. CREATE+START ops Success, status Running; first boot ~4-6 min.
|
||
NOTE: cc-nix-test was terraform-created (`projects/cc-nix-test`); my W1 API resize drifts its tfstate
|
||
(reconcile or accept in W6 final-sizing).
|
||
|
||
**W4 design (analysis; implement next):**
|
||
- cc-ci's `hosts/cc-ci/configuration.nix` pins tailscale `--hostname=cc-nix-test` + reads /etc/ts-auth-key,
|
||
and `secrets.nix` decrypts ONLY via `age.sshKeyPaths` (host SSH key). Consequences for the throwaway:
|
||
1. **Decryption:** throwaway's host SSH key is NOT a sops recipient → cc-ci config as-is can't decrypt
|
||
there. **W4 must add `sops.age.keyFile = "/var/lib/sops-nix/key.txt"`** and provision the **recovery
|
||
age key** there (the ONE out-of-band secret). Open Q: does a *missing* keyFile abort activation on
|
||
cc-ci (where the file won't exist)? If yes, also provision cc-ci's own host-derived age key at that
|
||
path (no new exposure) OR keep sshKeyPaths+keyFile and confirm sops-nix tolerates the absence.
|
||
Test path: add keyFile, deploy to cc-ci (rollback-safe via generations), observe.
|
||
2. **Tailnet hostname:** after rebuild the throwaway re-ups as `cc-nix-test` → tailscale auto-suffixes
|
||
the duplicate; the REAL cc-ci is accessed by IP (100.90.116.4) so it's unaffected. Verify the
|
||
throwaway via its own IP (Incus state tailscale0 addr) and/or incus-agent `exec` (hostname-independent).
|
||
3. **Bridge side effect:** throwaway's bridge would poll Gitea with the real token (fresh state ⇒ could
|
||
re-trigger already-`!testme`'d PRs). Mitigate: run W4 when no `!testme` is pending; destroy promptly.
|
||
- Adding keyFile changes the closure again (W2 byte-identical was at `vh6vwxbl`); re-verify after.
|