From 9bffb55b28a24de29e5d6edfd0d71a696508e476 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Tue, 26 May 2026 21:25:48 +0100 Subject: [PATCH] M0: flake + base NixOS config, rebuilt from repo on cc-ci Pins nixpkgs to the rev cc-ci already ran (no-op-then-base); deploy via switch --flake on-host. System healthy (gen 3) post-switch. Co-Authored-By: Claude Opus 4.7 (1M context) --- BACKLOG.md | 7 +++--- DECISIONS.md | 23 +++++++++++++------ JOURNAL.md | 36 ++++++++++++++++++++++++++++++ STATUS.md | 19 ++++++++++------ flake.lock | 27 ++++++++++++++++++++++ flake.nix | 28 +++++++++++++++++++++++ hosts/cc-ci/.gitkeep | 0 hosts/cc-ci/configuration.nix | 42 +++++++++++++++++++++++++++++++++++ hosts/cc-ci/hardware.nix | 21 ++++++++++++++++++ 9 files changed, 186 insertions(+), 17 deletions(-) create mode 100644 flake.lock create mode 100644 flake.nix delete mode 100644 hosts/cc-ci/.gitkeep create mode 100644 hosts/cc-ci/configuration.nix create mode 100644 hosts/cc-ci/hardware.nix diff --git a/BACKLOG.md b/BACKLOG.md index f65d2e5..4ea21fc 100644 --- a/BACKLOG.md +++ b/BACKLOG.md @@ -6,10 +6,11 @@ Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adver ## Build backlog ### M0 — Foundations -- [ ] Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline -- [ ] Deploy mechanism decision + first rebuild from repo (DECISIONS.md) +- [x] Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline +- [x] Deploy mechanism decision + first rebuild from repo (DECISIONS.md) — switch --flake on host - [ ] sops-nix wiring: host age key, secrets/secrets.yaml, decrypt a test secret on host -- [ ] Gate: M0 — `ssh cc-ci 'systemctl is-system-running'` healthy after rebuild from repo +- [ ] Gate: M0 — `ssh cc-ci 'systemctl is-system-running'` healthy after rebuild from repo (base + rebuild verified healthy 2026-05-26; will CLAIM gate once sops test-secret also lands) ### M1 — Swarm + abra target - [ ] Docker + single-node swarm via Nix diff --git a/DECISIONS.md b/DECISIONS.md index eddcb64..4049f12 100644 --- a/DECISIONS.md +++ b/DECISIONS.md @@ -12,10 +12,17 @@ Architecture decisions and dead-ends. One line of rationale each. (§0, §8) ## Open (defaults from §8, to confirm as reality lands) -- **Deploy mechanism:** TBD in M0. Leaning `nixos-rebuild switch --flake` run *on cc-ci itself* - (repo cloned on host) rather than `--target-host`/deploy-rs from the sandbox, to avoid copying - large Nix closures over the userspace-tailscaled SOCKS proxy. Atomic-rollback is preserved by - Nix generations. Will record final choice + rationale when M0 lands. +- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on + cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over + `--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS + proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`). + The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild + --collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build + loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host); + source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo + on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`). + - **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild + is a true no-op-then-base. Bump deliberately, never drift. - **Webhook scope:** default per-repo via enroll script. - **Drone runner type:** default exec (must drive host abra). - **Secret tool:** default sops-nix. @@ -25,9 +32,11 @@ Architecture decisions and dead-ends. One line of rationale each. (§0, §8) ## Risks -- **Disk:** cc-ci has only ~3.8 GiB free on an 8.9 GiB root. Multiple recipe images + volumes may - exhaust it during M6.5 breadth. Mitigation: aggressive teardown + image prune; if insufficient, - request operator grow the VM disk (Incus, recreatable per the incus skill). Not yet blocking. +- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard + **inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on + inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free); + the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown + + periodic `docker image prune` to avoid regressing during M6.5 breadth. ## Dead-ends - (none yet) diff --git a/JOURNAL.md b/JOURNAL.md index 83410c7..a9e715b 100644 --- a/JOURNAL.md +++ b/JOURNAL.md @@ -22,3 +22,39 @@ - Seeded skeleton layout (§3) + loop-state files + docs/baseline.md. **Next:** commit + push bootstrap, then M0 (flake + base config + sops test secret). + +## 2026-05-26 — M0: flake + base config rebuilt from repo + +**Authored** `flake.nix` (pins nixpkgs rev `50ab793786d9…`, the exact rev cc-ci ran), +`hosts/cc-ci/hardware.nix` (incus VM module + cloud-init + DHCP/nameservers) and +`hosts/cc-ci/configuration.nix` (faithful baseline repro: tailscale w/ hardcoded `--hostname= +cc-nix-test` since `builtins.readFile /etc/ts-hostname` is impure under flakes; sshd root; firewall +trust tailscale0 + tcp/22; base pkgs). + +**Disk/inode hiccup → resolved:** first `nix flake lock`/build hit `No space left on device` — +diagnosed as **inode** exhaustion (`df -i` → 6005 free of 586336; old 8.9 GiB fs). Operator grew +the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried. + +**Build + switch (commands + output):** +- `ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci'` → `BUILD EXIT 0`, + produced `nixos-system-nixos-24.11.20250630.50ab793`. +- `ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch + --flake /root/cc-ci#cc-ci'` (detached so it survives ssh drop) → unit `Result=success + ExecMainStatus=0`. + +**Gate verification:** +- `systemctl is-system-running` → `running` +- `readlink /run/current-system` → `…-nixos-system-nixos-24.11.20250630.50ab793` (gen 3, from flake) +- `systemctl is-active tailscaled` → `active`; `sshd.socket` → `active` (sshd is socket-activated, so + `sshd.service` reads inactive — live ssh proves it works) +- `systemctl --failed` → none +- `nixos-rebuild list-generations` → gen 3 current @20:23, prior channel gen 2 retained for rollback. + +**Known warning (tracked, non-blocking):** incus module enables `systemd.network` while we keep +`networking.useDHCP=true` (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from +baseline; networking is up. Clean up by choosing one stack later. + +**Deploy mechanism settled** (DECISIONS.md): `switch --flake` on-host, repo synced via `tar | ssh`. + +**Next:** sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then +CLAIM the M0 gate for the Adversary. diff --git a/STATUS.md b/STATUS.md index 7ab8747..f3a45a3 100644 --- a/STATUS.md +++ b/STATUS.md @@ -1,17 +1,22 @@ # STATUS — cc-ci Builder **Phase:** M0 — Foundations -**In-flight:** Bootstrap complete; starting M0 (flake + base config + sops test secret). -**Last updated:** 2026-05-26 (bootstrap) +**In-flight:** Base flake config deployed + verified. Next M0 task: sops-nix + decrypt a test secret. +**Last updated:** 2026-05-26 (M0 base config live) ## Gates -- (none claimed yet) +- (none claimed yet — M0 gate pends sops wiring) ## Blocked - (none) ## Notes -- cc-ci baseline: Incus VM, 2 vCPU, 3.5 GiB RAM, **3.8 GiB free disk** — tight for multi-recipe - docker deploys; watch disk pressure, may need operator to grow the VM disk before M6.5 breadth. -- Server config is currently channel-based `/etc/nixos/configuration.nix` (no flake). M0 converts - to a flake checked out from this repo on the host. +- **Disk RESOLVED:** operator grew the VM 8.9→**28 GiB** (22 GiB free) on 2026-05-26. Inodes + 1.78M total / 1.21M free (was ~6k free — old 8.9 GiB fs had only 586k inodes, which the flake's + nixpkgs fetch exhausted). Both byte + inode pressure gone. +- M0 base config: flake at repo root pins nixpkgs to the exact rev cc-ci ran (50ab793) → first + rebuild is no-op-then-base. Deployed via `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run as + a detached transient systemd unit (survives ssh-over-tailscale drops). Gen 3 current, healthy. +- Open warning: incus module enables `systemd.network` while we set `networking.useDHCP=true` + (scripted dhcpcd) — Nix warns both may manage interfaces. Inherited from baseline, networking is + up; clean up later (pick networkd OR scripting). Tracked, non-blocking. diff --git a/flake.lock b/flake.lock new file mode 100644 index 0000000..8eeab1b --- /dev/null +++ b/flake.lock @@ -0,0 +1,27 @@ +{ + "nodes": { + "nixpkgs": { + "locked": { + "lastModified": 1751274312, + "narHash": "sha256-/bVBlRpECLVzjV19t5KMdMFWSwKLtb5RyXdjz3LJT+g=", + "owner": "NixOS", + "repo": "nixpkgs", + "rev": "50ab793786d9de88ee30ec4e4c24fb4236fc2674", + "type": "github" + }, + "original": { + "owner": "NixOS", + "repo": "nixpkgs", + "rev": "50ab793786d9de88ee30ec4e4c24fb4236fc2674", + "type": "github" + } + }, + "root": { + "inputs": { + "nixpkgs": "nixpkgs" + } + } + }, + "root": "root", + "version": 7 +} diff --git a/flake.nix b/flake.nix new file mode 100644 index 0000000..a8fddfb --- /dev/null +++ b/flake.nix @@ -0,0 +1,28 @@ +{ + description = "cc-ci — Co-op Cloud recipe CI server (NixOS)"; + + inputs = { + # Pinned to the exact revision cc-ci already runs, so the first rebuild from + # this repo is a true no-op-then-base (M0). Bump deliberately, not drift. + nixpkgs.url = "github:NixOS/nixpkgs/50ab793786d9de88ee30ec4e4c24fb4236fc2674"; + }; + + outputs = { self, nixpkgs }: + let + system = "x86_64-linux"; + pkgs = nixpkgs.legacyPackages.${system}; + in + { + nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem { + inherit system; + modules = [ ./hosts/cc-ci/configuration.nix ]; + }; + + # Devshell for working on the harness/bridge locally. + devShells.${system}.default = pkgs.mkShell { + packages = with pkgs; [ git jq curl nixpkgs-fmt ]; + }; + + formatter.${system} = pkgs.nixpkgs-fmt; + }; +} diff --git a/hosts/cc-ci/.gitkeep b/hosts/cc-ci/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/hosts/cc-ci/configuration.nix b/hosts/cc-ci/configuration.nix new file mode 100644 index 0000000..ab5605c --- /dev/null +++ b/hosts/cc-ci/configuration.nix @@ -0,0 +1,42 @@ +# cc-ci machine config. M0 = faithful reproduction of the baseline (docs/baseline.md) +# so the first flake rebuild is a no-op-then-base. Services (swarm/Traefik/Drone/ +# bridge/dashboard) are layered in via ./modules/* in later milestones. +{ pkgs, lib, ... }: +{ + imports = [ + ./hardware.nix + ]; + + # --- Tailscale (ACCESS-CRITICAL: do not break, this is the only route in) --- + # Baseline read the hostname from /etc/ts-hostname at eval time; that is impure + # under flakes, so we pin the known hostname. The reusable auth-key file persists. + services.tailscale = { + enable = true; + authKeyFile = "/etc/ts-auth-key"; + extraUpFlags = [ "--hostname=cc-nix-test" ]; + }; + + # --- SSH (root login over tailscale) --- + services.openssh = { + enable = true; + settings.PermitRootLogin = "yes"; + }; + + # --- Firewall: trust tailscale, allow SSH --- + networking.firewall = { + enable = true; + trustedInterfaces = [ "tailscale0" ]; + allowedTCPPorts = [ 22 ]; + }; + + environment.systemPackages = with pkgs; [ + curl + git + jq + openssh + ]; + + nix.settings.experimental-features = [ "nix-command" "flakes" ]; + + system.stateVersion = "24.11"; +} diff --git a/hosts/cc-ci/hardware.nix b/hosts/cc-ci/hardware.nix new file mode 100644 index 0000000..c95a82b --- /dev/null +++ b/hosts/cc-ci/hardware.nix @@ -0,0 +1,21 @@ +# Hardware / platform for cc-ci: an Incus VM (x86_64) on the autonomic infra. +# Mirrors the pre-flake baseline (docs/baseline.md). +{ modulesPath, ... }: +{ + imports = [ + "${modulesPath}/virtualisation/incus-virtual-machine.nix" + ]; + + # incus-agent for `incus exec` + virtualisation.incus.agent.enable = true; + + # cloud-init seeded the VM (network + /etc/ts-* files); keep it enabled. + services.cloud-init = { + enable = true; + network.enable = true; + }; + + # DHCP from the incus bridge; bridge provides no resolver, so set our own. + networking.useDHCP = true; + networking.nameservers = [ "1.1.1.1" "8.8.8.8" ]; +}