Commit Graph

16 Commits

Author SHA1 Message Date
fa56f6bcaa feat(3 U2.3): serve per-run artifacts at /runs/<id>/<file> (whitelisted, traversal-guarded) + bind-mount runs dir RO into dashboard 2026-05-31 07:12:32 +00:00
a2163951e9 fix(cc-ci-hetzner): drop empty IPv6 gateway/route (network-addresses-eth0 failure)
nixos-infect emitted defaultGateway6.address="" and ipv6.routes=[{address="";
prefixLength=128}] for this v4-only Hetzner instance, so network-addresses-eth0.service
failed at boot ("ip route add  /128 ... any valid prefix is expected rather than /128").
The box has no real IPv6 (link-local only, kernel-managed), so remove the empty IPv6
gateway, address, and route. IPv4 unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 03:58:08 +00:00
4237cc03f5 nix: add cc-ci-hetzner host (cpx32, nixos-infect hardware, all root SSH keys)
Port from terraform-hetzner branch. Adds the Hetzner cc-ci flake host with
all 3 root authorized keys so nixos-rebuild doesn't lock out SSH access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 03:00:36 +00:00
3bde76f239 fix(2): cc-ci host — declare /etc/timezone (gitea + Debian-image recipes bind it)
gitea (drone's SCM dep) binds /etc/timezone:ro; NixOS time.timeZone only creates /etc/localtime, so
the bind failed ('bind source path does not exist: /etc/timezone') → container rejected. Declare
environment.etc.timezone=UTC. Enables drone Q4.10's gitea dep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 22:16:24 +01:00
d4eae4ee49 fix(2): set time.timeZone=UTC on cc-ci → create /etc/localtime (immich bind-mount)
immich's compose bind-mounts the host /etc/localtime into the app container; NixOS without a set
timezone leaves /etc/localtime absent → 'bind source path does not exist: /etc/localtime' → app
service rejected (never converges). time.timeZone=UTC creates /etc/localtime (UTC = deterministic CI
timestamps). Nix-declared, reversible; helps any recipe binding /etc/localtime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 14:51:33 +01:00
b9bbd253eb fix(2pc): rename unit docker-prune -> ci-docker-prune (NixOS docker module reserves docker-prune)
The committed module used systemd.services.docker-prune, which conflicts with the NixOS docker
module's own docker-prune unit (`nixos-rebuild build` error: conflicting definition values). The
deployed+verified host already runs ci-docker-prune; this syncs the repo so a cold build matches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:43:09 +01:00
16d177e73a feat(2pc): PC1 conservative prune — drop autoPrune --all, add gated surgical docker-prune
Removes virtualisation.docker.autoPrune (daily `docker system prune --all` evicted in-use base
images → cold re-pull → Hub rate-limit churn, JOURNAL-2). Adds modules/docker-prune.nix: daily
timer + oneshot that prunes only dangling+until=24h, gated on disk pressure (>=80%) AND no run-app
live AND no swarm service converging; never --all, never --volumes. Teardown unchanged (never
removes images). Registry pull-through cache dropped per operator scope correction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:30:07 +01:00
465e1059b0 claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live
canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:33:08 +01:00
e678d2e006 claim(2w): W0.10a traefik WC1.1 migrated onto shared health-gated reconciler — no-op converge proven; destructive rollback = Adversary cold proof
warm_reconcile.py: per-spec setup hook + health_domain; SPECS[traefik]
(stateful=False, version-rollback-only, _traefik_setup preserves wildcard-cert/
file-provider config, health on routed dashboard host). keycloak path unchanged.
proxy.nix: deploy-proxy.service now execs warm_reconcile.py traefik. ZERO-disruption
migration (traefik already at latest 5.1.1+v3.6.15; pre-seeded TYPE+last_good →
clean no-op converge; traefik 200 + keycloak-through-traefik 200 + 0 failed).
65 unit pass. Per operator out: code+converge delivered; destructive rollback
(brief TLS blip) = Adversary's required cold proof. Closes the W0.10a tracked-open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:50:32 +01:00
e73e4393ed fix(2w): docker autoPrune drop --volumes (was failing daily + would wipe warm vols) [WC8]
The autoPrune flags passed '--volumes' WITH '--filter until=24h', which docker
rejects ('until filter not supported with --volumes') — so docker-prune.service
FAILED every day (system 'degraded') and never reclaimed anything (a cause of the
disk creeping to 96%). Worse, '--volumes' prunes volumes with no running
container — which would DELETE Phase-2w DATA-WARM canonical volumes (undeployed by
design). Removed '--volumes': now prunes images/containers/networks/build-cache
older than 24h only; warm volumes survive and are pruned deliberately by the warm
reconcilers (WC8).

Verified: nixos-rebuild switch -> docker-prune.service runs clean, system
'running' (0 failed units), warm keycloak still 200.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 01:26:24 +01:00
a044abb298 feat(2w): W0.6 unpinned warm reconciler + WC1.2 safety gate + WC1.1 scaffold
runner/warm_reconcile.py (python, packaged into nix store, replaces bash
reconcile): UNPIN keycloak (deploy latest published version TAG; recipe fetched
at runtime -> D8 closure byte-identical). WC1.2 pre-deploy safety gate (runs
FIRST): major recipe/app-version bump OR releaseNotes manual-migration marker
-> hold-on-current + alert sentinel (no deploy churn). WC1.1 health-gated
upgrade-with-rollback: record last-good -> [keycloak: undeploy->warmsnap.snapshot
->deploy latest] -> health-gate -> commit-or-(restore+redeploy-prior+alert).
Alerts = /var/lib/ci-warm/alerts/*.json (Builder loop relays). current version
read from abra TYPE=<recipe>:<version>. CCCI_SKIP_FETCH test hook.
+8 unit tests for the version gate (56 unit pass).

Proven on cc-ci: nixos-rebuild switch -> warm-keycloak.service runs the python
reconciler -> noop-healthy (system 0-failed, /realms/master=200). WC1.2 holds
proven live: MAJOR bump -> held-major (keycloak untouched); minor+manual-
migration notes -> held-manual-migration (alert carries notes); no deploy churn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:42:02 +01:00
88c11142de fix(2w): W0.3 warm-keycloak reconciler — newline bite + skip-if-healthy
- set_env: ensure trailing newline before append (keycloak .env.sample ends
  with a newline-less #COMPOSE_FILE comment, so a bare append glued DOMAIN onto
  it -> DOMAIN unset -> KC_HOSTNAME=https:// -> crash-loop). Same bite fixed in
  backupbot.nix.
- converge skips the (forced) redeploy when keycloak already serves 200, so an
  activation/boot is a true no-op (no JVM-restart blip) and only redeploys when
  down/crash-looping. Health-wait extended to 15min.

Verified on cc-ci: nixos-rebuild switch -> warm-keycloak.service active,
'no-op converge', system running (0 failed), /realms/master=200.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:52:01 +01:00
c8e9ddb681 feat(2w): W0.3 declarative warm-keycloak reconciler (WC1)
nix/modules/warm-keycloak.nix: idempotent systemd oneshot (like deploy-proxy)
that converges a live-warm shared keycloak at warm-keycloak.ci.commoninternet.net
pinned to  10.7.1+26.6.2, secrets generated only-if-missing (never
rotate a live provider), waits /realms/master=200. Re-warmable from scratch
(D8/WC8). Wired into hosts/cc-ci/configuration.nix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:28:44 +01:00
5e14963d51 feat(2): declarative Docker Hub auth — sops dockerhub_auth + config.json template (rate-limit fix)
- secrets submodule -> cdd5e0a (adds sops dockerhub_auth = base64 nptest2:PAT).
- nix/modules/secrets.nix: sops.secrets.dockerhub_auth + sops.templates."docker-config.json"
  renders /root/.docker/config.json (0600 root) so abra/docker pulls authenticate (200/6h
  per-account) instead of the exhausted 100/6h shared-IP anon limit. Survives 1c rebuild.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 22:05:10 +01:00
8262912015 feat(1d): enroll hedgedoc in bridge POLL_REPOS (DG6 unconfigured-recipe target)
All checks were successful
continuous-integration/drone Build is passing
hedgedoc mirrored to recipe-maintainers/hedgedoc with probe PR #1; add it to the bridge poll list so
!testme triggers the full generic suite (no cc-ci/repo-local overlay -> pure generic). Rebuild pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:47:29 +01:00
433ec9de30 refactor(1b): RL5 — consolidate Nix code under nix/ (modules->nix/modules, hosts->nix/hosts)
flake.nix/flake.lock STAY at root so the build ref #cc-ci is unchanged; only flake's internal
configuration.nix path updated. Root-relative refs inside moved modules re-based ../X -> ../../X
(secrets/bridge/dashboard); configuration.nix's ../../modules imports unchanged (both dirs under nix/).
Living docs (README, architecture/install/secrets/enroll) + .drone.yml comment updated to nix/...;
append-only history logs left as-is. DECISIONS.md records RL5 + the deferred-coordinated RL6.

Verified on cc-ci: nixos-rebuild build 'path:#cc-ci' -> toplevel 8i3jcad9 (BYTE-IDENTICAL to the
pre-move build — store derivations are content-addressed on file contents, module .nix not in the
runtime closure); scripts/lint.sh -> lint: PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 21:19:09 +01:00