1c/W4 DONE: genuine throwaway-VM live rebuild (single switch, 0 failed, byte-identical, TLS leaf==git cert); Gate W4 CLAIMED + install.md updated
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -1,20 +1,37 @@
|
||||
# Installing cc-ci from scratch
|
||||
|
||||
> WORK IN PROGRESS — grows with each milestone; the full from-scratch rebuild is verified at M9 (D8).
|
||||
> The full from-scratch rebuild is **verified** (Phase-1c / D8): a blank NixOS Incus VM, given the two
|
||||
> repos + the single bootstrap age key, becomes a fully-converged cc-ci via one `nixos-rebuild switch`.
|
||||
|
||||
cc-ci is declared **entirely** as a NixOS flake (this repo). Bringing up the box is just
|
||||
**clone + `nixos-rebuild switch`** + the operator preconditions — no manual post-steps. The proxy
|
||||
(traefik) and Drone server are deployed by **idempotent-reconcile systemd oneshots** (`modules/
|
||||
proxy.nix`, `modules/drone.nix`) that converge the swarm to the desired state on every activation
|
||||
and boot (and self-heal drift), mirroring `swarm-init`. Target: a NixOS 24.11 host reachable as
|
||||
`cc-ci` over SSH (root).
|
||||
cc-ci is declared **entirely** as a NixOS flake — base config in this repo (`cc-ci`) and **all
|
||||
secrets (incl. the wildcard TLS cert) sops-encrypted in a private companion repo `cc-ci-secrets`,
|
||||
mounted as a git submodule at `secrets/`**. Bringing up the box is: **clone `--recursive` + provision
|
||||
the one bootstrap age key + `nixos-rebuild switch`** + the external DNS/gateway — no manual
|
||||
post-steps. The proxy (traefik), Drone, comment-bridge, dashboard and backupbot are deployed by
|
||||
**idempotent-reconcile systemd oneshots** that converge the swarm on every activation/boot (and
|
||||
self-heal drift), mirroring `swarm-init`; they are **serialized** (proxy→drone→bridge→dashboard→
|
||||
backupbot) so a single switch converges on a blank host. Target: a NixOS 24.11 host reachable over SSH (root).
|
||||
*(Verified on a throwaway Incus VM: blank host + the two repos + the age key → one `nixos-rebuild
|
||||
switch` → fully converged cc-ci, 0 failed units — see DECISIONS.md Phase-1c / D8.)*
|
||||
|
||||
## Operator preconditions (class-A1, see DECISIONS.md / docs/baseline.md)
|
||||
## Preconditions
|
||||
|
||||
- Wildcard TLS cert at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}`
|
||||
(`*.ci.commoninternet.net` + `ci.commoninternet.net`). **Renewed out-of-band; never ACME here.**
|
||||
**The one out-of-band secret (provision before the first rebuild):**
|
||||
- The **bootstrap age key** at `/var/lib/sops-nix/key.txt` (mode 0600). It must be a sops recipient
|
||||
of `cc-ci-secrets/secrets.yaml`. Two cases:
|
||||
- **Canonical cc-ci:** its SSH host key is already a recipient — also works via `age.sshKeyPaths`;
|
||||
the keyFile holds the host-derived age identity (`ssh-to-age -private-key -i
|
||||
/etc/ssh/ssh_host_ed25519_key`).
|
||||
- **A fresh/cloned host** (different SSH host key, not a recipient): provision the **off-box
|
||||
recovery age key** (`age1cmk26…`'s private half) there — it decrypts every secret incl. the cert.
|
||||
Everything else (cert, Drone OAuth/RPC, webhook HMAC) is sops-encrypted **in git** — nothing else
|
||||
is provisioned out-of-band.
|
||||
|
||||
**External infra (operator-owned, not on the box — class-A1):**
|
||||
- DNS: `*.ci.commoninternet.net` (+ bare) → the **gateway**, which TLS-passthroughs (SNI) to cc-ci.
|
||||
- Firewall path: gateway reaches cc-ci on tcp/80+443 (opened by `modules/swarm.nix`).
|
||||
- The wildcard cert is **renewed out-of-band** by the operator, who then re-encrypts it into
|
||||
`cc-ci-secrets` (sops) and rebuilds — the Gandi DNS token never touches the box; **never ACME here.**
|
||||
|
||||
## 1. Apply the NixOS flake (this is the whole install)
|
||||
|
||||
@ -25,29 +42,39 @@ host SSH key), Docker + single-node Swarm + the `proxy` overlay + firewall 80/44
|
||||
**Drone exec runner** (`modules/drone-runner.nix`).
|
||||
|
||||
```sh
|
||||
# materialise the repo on the host (the build runs on cc-ci itself — see DECISIONS.md deploy mech)
|
||||
# e.g. git clone <repo> /root/cc-ci (or sync it)
|
||||
nixos-rebuild switch --flake /root/cc-ci#cc-ci
|
||||
# 1. Clone base + the private secrets submodule (bot/deploy creds for cc-ci-secrets).
|
||||
# The submodule provides secrets/secrets.yaml (sops). Use a credential that can read
|
||||
# recipe-maintainers/cc-ci-secrets, e.g. a per-command header (never persisted):
|
||||
git clone --recursive https://git.autonomic.zone/recipe-maintainers/cc-ci.git /root/cc-ci
|
||||
# (if cloned non-recursively: git -C /root/cc-ci submodule update --init)
|
||||
|
||||
# 2. Provision the bootstrap age key (see Preconditions) — the ONE out-of-band secret:
|
||||
install -m700 -d /var/lib/sops-nix
|
||||
install -m600 /path/to/bootstrap-age-key /var/lib/sops-nix/key.txt
|
||||
|
||||
# 3. One nixos-rebuild switch. NOTE: ?submodules=1 so the git flake includes secrets/.
|
||||
nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'
|
||||
```
|
||||
|
||||
On activation, the reconcile oneshots (`deploy-proxy`, `deploy-drone`) run automatically and converge
|
||||
the swarm. Verify:
|
||||
On activation sops-nix decrypts every secret (incl. the wildcard cert → `/var/lib/ci-certs/live/`),
|
||||
then the serialized reconcile oneshots converge the swarm. Verify:
|
||||
|
||||
```sh
|
||||
systemctl is-system-running # -> running
|
||||
docker info --format '{{.Swarm.LocalNodeState}}' # -> active
|
||||
docker service ls # traefik (app+socket-proxy) + drone, all 1/1
|
||||
systemctl is-active deploy-proxy deploy-drone drone-runner-exec # -> active x3
|
||||
# wildcard cert served end-to-end via the gateway:
|
||||
curl -ksv --resolve probe.ci.commoninternet.net:443:<gateway-ip> https://probe.ci.commoninternet.net/ \
|
||||
2>&1 | grep -E 'subject:|HTTP/' # -> CN=*.ci.commoninternet.net, HTTP 404 (no app router yet)
|
||||
curl -ks --resolve drone.ci.commoninternet.net:443:<gateway-ip> \
|
||||
-o /dev/null -w '%{http_code}\n' https://drone.ci.commoninternet.net/healthz # -> 200
|
||||
systemctl is-system-running # -> running (0 failed units)
|
||||
docker service ls # traefik app+socket-proxy, drone, bridge, dashboard, backups — all 1/1
|
||||
# cert is sops-decrypted FROM GIT to the path traefik serves:
|
||||
sha256sum /var/lib/ci-certs/live/fullchain.pem # symlink -> /run/secrets/wildcard_cert
|
||||
# TLS served from the git cert, verified locally on the host (SNI ci.commoninternet.net):
|
||||
curl -s --resolve probe.ci.commoninternet.net:443:127.0.0.1 \
|
||||
-o /dev/null -w 'ssl_verify=%{ssl_verify_result}\n' https://probe.ci.commoninternet.net/ # -> 0
|
||||
# (the served leaf fingerprint == the cert in cc-ci-secrets)
|
||||
```
|
||||
|
||||
> Tip: when driving the switch over an SSH session that rides Tailscale, run it as a detached unit so
|
||||
> it survives a momentary drop, and **use the absolute flake path** (systemd units run with cwd `/`):
|
||||
> `systemd-run --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci`
|
||||
> it survives the tailscale restart during activation, and use the absolute flake ref:
|
||||
> `systemd-run --no-block --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'`
|
||||
> *(On the canonical cc-ci the build source is synced from the admin's clone via `tar | ssh` and built
|
||||
> as a `path:` flake — no submodule fetch needed there; the `?submodules=1` form is for a git clone.)*
|
||||
|
||||
## 2. One-time: link Drone ↔ Gitea (OAuth grant)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user