Files
cc-ci/machine-docs/STATUS-nixenv.md

10 KiB

STATUS — phase nixenv (Builder)

Phase plan: /srv/cc-ci/cc-ci-plan/plan-phase-nixenv-shared-runtime-env.md

Phase

Single-source the harness/recipe-test runtime env so the Drone runner, the nightly/weekly sweep timer, and host systemPackages share ONE declaration (no duplicate pyEnv, no divergent runtimeInputs, DEFECT-3 host-PATH patch removed/subsumed).

M1 — PASS @ 2026-06-17T17:40Z (REVIEW-nixenv.md, claim 8b8fc1f). No VETO.

Gate: M2 — CLAIMED @ 2026-06-17T18:17Z, awaiting Adversary (claim commit below)

WHAT (M2 DoD). (1) Deployed via nixos-rebuild switch, host verified healthy. (2) Live parity: gitea test_lfs_roundtrip GREEN under BOTH a real timer fire AND the Drone path, from the shared env (git-lfs resolves on both — DEFECT-3 condition met live). (3) A canon-style sweep still promotes/SKIPs correctly under the unified env — no regression to canon's result.

WHERE (inputs). Deployed system from /etc/cc-ci @ d11f8f5 (= M1-reviewed tree). nixenv diff dd6712c..d11f8f5 = nix/ modules + machine-docs ONLY; zero runner//tests/ changes (verify: git diff --name-only dd6712c..d11f8f5 | grep -E 'runner/|tests/' → empty). runner/nightly_sweep.py (the promote path) last touched by canon commit f94de22 — byte-identical to canon.

M2 result summary (both witnesses PASS, host healthy, no regression)

  • (2a) Drone-path witness — PASS. Drone build #871 (event=custom, RECIPE=gitea REF=357926f2 PR=1 SRC=recipe-maintainers/gitea), status=success, 18:11→18:14Z. The Drone exec pipeline runs cc-ci-run runner/run_recipe_ci.py (.drone.yml:83). compose.lfs.yml present at that ref → _lfs_enabled() true → LFS test RAN (not skipped): tests/gitea/custom/test_lfs_roundtrip.py:: test_lfs_roundtrip PASSED; all install/upgrade/backup/restore/custom tiers PASSED.
    • HOW (Adversary re-run): ssh cc-ci 'TOK=$(cat /run/secrets/bridge_drone_token); curl -s -H "Authorization: Bearer $TOK" https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds/871/logs/1/2 | jq -r ".[].out"' | grep test_lfs_roundtrip. EXPECTED: test_lfs_roundtrip PASSED. (Or trigger your OWN build with the same params and re-run.)
  • (2b) Real timer fire witness — PASS (details retained in the block below): test_lfs_roundtrip PASSED @17:57:54Z under systemctl start nightly-sweep.service, git-lfs resolved from cc-ci-run's runtimeInputs while the systemd unit PATH has NO git-lfs / no /run/current-system/sw/bin.
  • (3) No regression. Sweep (PID 2743890, 17:35→18:0xZ) completed all 20 enrolled recipes; SKIPs all correct (cryptpad/ghost/drone/hedgedoc/immich/lasuite-*/mailu/matrix-synapse/n8n/plausible/ uptime-kuma no-new-version SKIP), promotes correct (custom-html→1.13.0+1.31.1, mumble→1.0.0+v1.6.870-0). Three results need explicit non-regression context, ALL pre-existing (identical in the pre-deploy fires PID 2149231@14:xx / 2248547@15:xx, OLD env):
    • gitea rc=0 GREEN-BUT-PROMOTE-FAILED — tests green; WC5 promote fails FATA warm-gitea… is already deployed (abra deploy-idempotency on the persistent warm canonical, up since 08:39Z; non-fatal). promote path = canon nightly_sweep.py f94de22, unchanged by nixenv.
    • discourse rc=1 and mattermost-lts rc=1 — recipe-level red (mattermost: test_restore_returns_statedocker exec … postgres … relation "ci_marker" does not exist; docker resolved fine → NOT a missing-tool/dropped-dep failure). Both failed identically pre-deploy → not caused by the env change.
  • Host health (re-verified post-sweep @18:16Z). systemctl --failed empty; nightly-sweep.timer
    • deploy-proxy/deploy-drone/deploy-bridge/drone-runner-exec/swarm-init/warm-keycloak all active; drone /healthz 200, ci.commoninternet.net 200; live cc-ci-run = zxlx9jnylh7la5m48bsqb1wfm5l9r0bd (M1-reviewed path).

M2 deploy + timer-fire details (retained for the record)

Deploy DONE @ 2026-06-17T17:34Z. nixos-rebuild switch --flake 'git+file:///etc/cc-ci?submodules=1#cc-ci-hetzner' (live host = hetzner; /etc/cc-ci @ d11f8f5). Deployed system /nix/store/dhmpm232r6m0sq3s7y5r5jpyv5kxgzwi-nixos-system-… is BYTE-IDENTICAL to the M1-reviewed local build. Health: systemctl --failed empty; deploy-proxy / warm-keycloak / swarm-init / drone-runner-exec all active; nightly-sweep.timer active; drone healthz + ci.commoninternet.net → 200. Live cc-ci-run = zxlx9jnylh7la5m48bsqb1wfm5l9r0bd (the M1-reviewed path); git-lfs/openssl/script/bash resolve on host PATH (openssl was MISSING pre-deploy).

Live parity witness — BOTH paths GREEN (Drone #871 + timer fire; summarised above). Diff scope: ONLY nix/ changed (dd6712c..d11f8f5: 5 nix files, zero runner/tests) → sweep SKIP/promote logic byte-identical to canon's PASSed sweep.

  • Real timer fire — PASS @ 2026-06-17T17:57:54Z. systemctl start nightly-sweep.service @ 17:35:38Z (PID 2743890; child run_recipe_ci PID 2808444). The unit's systemd PATH contains ONLY coreutils/findutils/gnugrep/gnused/systemd — NOT git-lfs, NOT /run/current-system/sw/bin — so git-lfs resolved from cc-ci-run's runtimeInputs (the DEFECT-3 condition). Verified live: the running run_recipe_ci process PATH (/proc/<pid>/environ) carries …-git-lfs-3.6.1/bin from cc-ci-run. gitea RUN (canonical 3.5.3+1.24.2 < tag 3.6.0+1.24.2) exercised LFS (upgrade-env COMPOSE_FILE includes compose.lfs.yml) → tests/gitea/custom/test_lfs_roundtrip.py::test_lfs_roundtrip PASSED (18.66s); all other gitea tiers PASSED.
    • HOW (Adversary re-run): ssh cc-ci 'journalctl -u nightly-sweep.service -o short-iso --since "2026-06-17 17:55:57" --until "2026-06-17 17:58:07"' | grep -iE "lfs_roundtrip|PASSED|rc=". EXPECTED: test_lfs_roundtrip PASSED then sweep: gitea rc=0.
    • NOTE (not a regression): the sweep line reads rc=0 GREEN-BUT-PROMOTE-FAILED — all TESTS green; the WC5 promote (abra app deploy warm-gitea… -o -n) fails with FATA warm-gitea… is already deployed. This is an abra deploy-idempotency quirk on the warm canonical (already running, volume retained), NON-FATAL (known-good unchanged), and it occurred IDENTICALLY in the pre-deploy runs (PID 2149231 @ 14:28Z, PID 2248547 @ 15:56Z) — orthogonal to the runtime-env refactor (abra is on PATH unchanged in both). SKIPs in this fire are all correct (cryptpad/ghost/drone/hedgedoc/immich no-new-version SKIP; custom-html RUN→promoted 1.13.0+1.31.1).
  • Drone-path gitea witness: DONE — build #871 PASS (see "(2a)" above).

(prior M1 claim block retained below for the record)

M1 details — PASS

WHAT (M1 DoD). The harness/recipe-test runtime env is declared ONCE and referenced by all consumers; nixos-rebuild build succeeds for both hosts; the shared set is superset-or-equal of every prior list (nothing dropped); the sweep and the Drone runner resolve the same tooling; a future dep added to the shared set reaches all consumers.

WHERE (inputs). All changes at the tip of main (commit pushed with this claim).

  • Single source: nix/modules/packages.nix — overlay defines ccciPyEnv (let), ccciRuntimeTools (overlay attr), cc-ci-run (overlay attr, runtimeInputs = [ccciPyEnv] ++ ccciRuntimeTools).
  • Consumers: nix/modules/harness.nix (systemPackages = [ pkgs.cc-ci-run ]), nix/modules/nightly-sweep.nix (wrapper execs cc-ci-run), nix/hosts/cc-ci/configuration.nix + nix/hosts/cc-ci-hetzner/configuration.nix (systemPackages = pkgs.ccciRuntimeTools ++ [ pkgs.openssh ]).
  • nix/modules/drone-runner.nix unchanged (still PATH=/run/current-system/sw/bin:/run/wrappers/bin; it consumes the host PATH, which now references the shared set).

HOW + EXPECTED (cold-verifiable; secrets/ is a git submodule → use ?submodules=1 for a dirty tree, or build from a git clone --recursive).

  1. Builds succeed (both hosts):

    • nixos-rebuild build --flake '.?submodules=1#cc-ci-hetzner' → builds nixos-system-nixos-24.11.… (locally: /nix/store/dhmpm232r6m0sq3s7y5r5jpyv5kxgzwi-nixos-system-nixos-24.11.20250630.50ab793; store hash may differ on a fresh clone if paths differ, but it MUST build with no collision error).
    • nixos-rebuild build --flake '.?submodules=1#cc-ci' → builds OK (no collision error).
  2. Single source (grep proofs):

    • grep -rn withPackages nix/ → EXACTLY 1 hit: nix/modules/packages.nix (ccciPyEnv).
    • grep -rn "pytest playwright" nix/ → EXACTLY 1 hit: same line. (No duplicate pyEnv.)
    • grep -rn ccciRuntimeTools nix/ → defined once (packages.nix), referenced by both host configs.
    • nightly-sweep.nix contains NO withPackages, NO python3, NO /run/current-system/sw/bin PATH prepend, and its runtimeInputs = [ pkgs.cc-ci-run ] only; it exec cc-ci-run ….
  3. Superset-or-equal — cc-ci-run carries every tool (inspect the built wrapper's PATH):

    • CCRUN=$(nix eval --raw '.?submodules=1#nixosConfigurations.cc-ci-hetzner.pkgs.cc-ci-run'); grep '^export PATH' "$CCRUN/bin/cc-ci-run"
    • EXPECTED store dirs on PATH (15): python3-3.12.8-env, abra-0.13.0-beta, docker-27.5.1, git-2.47.2, git-lfs-3.6.1, bash-5.2p37, coreutils-9.5, util-linux-2.39.4, curl-8.12.1, jq-1.7.1, gnused-4.9, gnugrep-3.11, gnutar-1.35, openssl-3.3.3, procps-4.0.4.
    • git-lfs + openssl are the additions vs prior lists; nothing from any prior list is dropped.
  4. Sweep ≡ Drone entrypoint (parity by construction):

    • The built cc-ci-nightly-sweep wrapper exec cc-ci-run … resolves the BYTE-IDENTICAL cc-ci-run store path that the .drone.yml cc-ci-run runner/run_recipe_ci.py step runs (locally /nix/store/zxlx9jnylh7la5m48bsqb1wfm5l9r0bd-cc-ci-run). Same store path ⇒ same pyEnv, same tooling, same PLAYWRIGHT_BROWSERS_PATH.
  5. Host divergence removed:

    • Both host configuration.nix systemPackages lines are textually identical (pkgs.ccciRuntimeTools ++ [ pkgs.openssh ]). The cc-ci host now GAINS git-lfs+openssl on its system PATH (ls $(nix eval --raw '.?submodules=1#nixosConfigurations.cc-ci.config.system.build.toplevel')/sw/bin/ | grep -E '^(git-lfs|openssl)$' → both present; pre-refactor cc-ci lacked git-lfs).
  6. Future-dep propagation: adding a pkg to ccciRuntimeTools in packages.nix lands in cc-ci-run's runtimeInputs (Drone + sweep) AND both hosts' systemPackages from the single edit.

Build backlog

See BACKLOG-nixenv.md. M2 (deploy + live parity witness) is gated behind the M1 PASS.