Files
cc-ci/machine-docs/JOURNAL-nixenv.md

6.2 KiB

JOURNAL — phase nixenv (Builder)

2026-06-17 — M1: single-source the harness runtime env

Why this design

The phase plan §2 wants ONE definition of "what's needed to run a recipe test", referenced from three places, so DEFECT-3 (a dep present for one path, missing for another) becomes structurally impossible. I put the single source in nix/modules/packages.nix because it is the existing "shared pkgs" overlay module already imported by both host configs — so pkgs.ccciRuntimeTools and pkgs.cc-ci-run are reachable from every module/host without a fragile cross-module let.

Three overlay defs:

  • ccciPyEnv (let-bound, internal) — python3.withPackages [pytest playwright], the ONLY pyEnv now.
  • ccciRuntimeTools (overlay attr) — the union tool set.
  • cc-ci-run (overlay attr) — writeShellApplication with runtimeInputs = [ccciPyEnv] ++ ccciRuntimeTools.

Consumers:

  • harness.nixenvironment.systemPackages = [ pkgs.cc-ci-run ] (installs the entrypoint).
  • nightly-sweep.nix → wrapper execs cc-ci-run (same binary the Drone pipeline runs), so pyEnv + tooling + PLAYWRIGHT env are identical to the Drone path by construction. Dropped: the duplicate pyEnv, the parallel runtimeInputs tool list, and the DEFECT-3 export PATH=/run/current-system/sw/bin… prepend — git-lfs/bash/util-linux/openssl now come from cc-ci-run's runtimeInputs.
  • both host configuration.nixsystemPackages = pkgs.ccciRuntimeTools ++ [ pkgs.openssh ].

Why the union is a superset (nothing dropped)

  • old cc-ci-run: abra docker git coreutils util-linux ⊂ set.
  • old sweep: bash abra docker git curl jq gnused gnugrep gnutar coreutils util-linux procps ⊂ set; its host-PATH-derived git-lfs/openssl are now EXPLICIT in the set.
  • old host PATH: curl git jq (+ git-lfs on hetzner only) ⊂ set; openssh kept as host-only add.
  • pyEnv (python3+pytest+playwright) + playwright browsers (via PLAYWRIGHT_BROWSERS_PATH) preserved. Additions vs any single prior list: git-lfs, openssl (plan §2). The cc-ci host GAINS git-lfs, killing the one-off hetzner-only divergence — both host configs now byte-identical.

Why writeShellApplication makes this work

writeShellApplication emits export PATH="<runtimeInputs>:$PATH" (confirmed on the live wrapper). So cc-ci-run's full tool set is the PATH prefix regardless of caller. Under Drone the inherited suffix is /run/current-system/sw/bin:/run/wrappers/bin; under the sweep it's the systemd-minimal PATH — but the harness tools all resolve from the shared prefix either way, which is the parity the plan wants. The host systemPackages reference is the belt-and-suspenders path for direct .drone.yml shell-outs (abra --version, docker info) that don't go through cc-ci-run.

buildEnv collision watch (resolved)

Worry: adding coreutils/util-linux/procps/bash/gnu* to host systemPackages could collide with the NixOS base requiredPackages. It did not — base requiredPackages are lowPrio, so the normal-prio additions override cleanly. Both #cc-ci and #cc-ci-hetzner built with no collision error.

Note on other modules' tool lists

backupbot/docker-prune/drone/proxy/warm-keycloak.nix still list gnused/gnugrep/etc. in their OWN runtimeInputs — those are independent reconcile-service scripts, never part of the harness/recipe -test env, never part of the DEFECT-3 divergence. Single-sourcing is scoped to the harness env (pyEnv + recipe-test tooling consumed by cc-ci-run / sweep / host PATH), which is now packages.nix only.

Verification (local, dirty tree needs ?submodules=1secrets/ is a submodule)

  • nixos-rebuild build --flake '.?submodules=1#cc-ci-hetzner' → built nixos-system-…dhmpm232….
  • nixos-rebuild build --flake '.?submodules=1#cc-ci' → built OK.
  • cc-ci-run store zxlx9jnylh7la5m48bsqb1wfm5l9r0bd; PATH carries all 15 tools incl git-lfs-3.6.1 + openssl-3.3.3.
  • sweep wrapper gh02w1kc… execs the SAME zxlx9j…/bin/cc-ci-run.
  • cc-ci host sw/bin now lists git-lfs + openssl (was missing git-lfs pre-refactor).
  • grep -rn withPackages nix/ → 1 hit (packages.nix:17).

2026-06-17T18:17Z — M2 claim (both live parity witnesses green)

Drone-path witness (build #871)

Why REF=357926f2 PR=1 SRC=recipe-maintainers/gitea: this is the lfs-plain-gitea capstone ref (the gtea-phase Build #685 ref). PR #1 is now merged so compose.lfs.yml is also on main, but pinning the PR head guarantees _lfs_enabled() is true (compose.lfs.yml in checkout + RECIPE=gitea) so the LFS test RUNS rather than skips. fetch_recipe takes the SRC+REF mirror-clone path; EXTRA_ENV adds compose.lfs.yml to install+custom tiers so the deployed gitea has LFS on for the round-trip. Triggered via the Drone API with the bridge's drone token (kept on-host). Build went green in ~3 min; test_lfs_roundtrip PASSED. This is the SAME cc-ci-run store path the timer sweep execs, so the two witnesses prove parity by both construction (M1) and observation (M2).

Why the timer fire is the harder witness

The systemd unit PATH is systemd-minimal (coreutils/findutils/gnugrep/gnused/systemd) — NO git-lfs, NO /run/current-system/sw/bin. So a green LFS test there can ONLY come from cc-ci-run's runtimeInputs prepending git-lfs-3.6.1 to PATH. Confirmed by reading /proc/<run_recipe_ci pid>/environ live: PATH starts with the cc-ci-run tool prefix incl git-lfs. This is exactly the DEFECT-3 condition the phase set out to make structurally impossible.

GREEN-BUT-PROMOTE-FAILED is not mine

Spent effort confirming the gitea promote-fail (abra app deploy warm-gitea -o -n → "already deployed") is pre-existing: it appears identically in the two pre-deploy sweep fires (14:28Z, 15:56Z, OLD env) and the promote path (runner/nightly_sweep.py) is unchanged by nixenv (last touched canon f94de22). It's an abra deploy-idempotency limitation on the persistent warm canonical (warm-gitea up since 08:39Z), non-fatal, known-good unchanged. discourse/mattermost-lts reds are likewise recipe-level and pre-existing (mattermost: postgres restore marker assertion; docker resolved fine → not a dropped tool). nixenv changes only WHICH tools are on PATH; it dropped nothing (M1 superset proof), so it cannot have caused an app-level red.