Files
cc-ci/nix/modules/swarm.nix
autonomic-bot e6349a9dfe
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M1): proxy /16 patch + maintenance plan ready
Patch nix/modules/swarm.nix to create the `proxy` overlay with
--subnet 10.10.0.0/16 (~65k VIPs, 258× headroom over the exhausted /24).

Live host survey confirms 10.10.0.0/16 is clear of all existing
Docker networks (ingress 10.0.0.0/24, existing per-stack overlays
10.0.1-4.0/24, host routes). Exact maintenance procedure in
STATUS-pvfix.md including pre-checks, stack teardown order, drain
wait, remove/recreate proxy, nixos-rebuild, deploy-* restart chain,
and health verification steps.

Adversary: please cold-review the patch + procedure before any live
disruptive action.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:31:21 +00:00

52 lines
2.5 KiB
Nix
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Docker + single-node Swarm — the deploy target for recipes under test (M1).
# Traefik (modules/traefik.nix) and abra layer on top; recipes attach to the `proxy`
# overlay network, exactly as a real Co-op Cloud host expects.
{ pkgs, ... }:
{
virtualisation.docker = {
enable = true;
# Image pruning is handled by modules/docker-prune.nix (Phase 2pc / PC1), NOT by
# `virtualisation.docker.autoPrune`. The old autoPrune ran `docker system prune --all` daily;
# `--all` evicts every image not used by a *running* container — between runs that wiped the
# cached recipe base images and forced a cold re-pull → the Docker-Hub-rate-limit churn in
# JOURNAL-2. The replacement keeps Docker's local store warm (it IS our cache on this single
# host) and prunes only dangling+old layers, gated on genuine disk pressure and nothing in
# flight. NEVER --volumes either: Phase-2w keeps DATA-WARM undeployed canonical volumes, reaped
# only by the warm reconcilers. autoPrune left OFF (the default) on purpose.
};
environment.systemPackages = [ pkgs.docker ];
# Gateway forwards 80/443 to cc-ci over the public interface (enp5s0); the coop-cloud
# traefik stack (deployed via abra, see docs/install.md) publishes these ports.
networking.firewall.allowedTCPPorts = [ 80 443 ];
# Bring up a single-node swarm + the shared `proxy` overlay network. Idempotent:
# safe to re-run every boot/rebuild. advertise-addr 127.0.0.1 is fine for a lone node.
systemd.services.swarm-init = {
description = "Initialise single-node Docker Swarm + proxy overlay network";
after = [ "docker.service" ];
requires = [ "docker.service" ];
wantedBy = [ "multi-user.target" ];
path = [ pkgs.docker ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
};
script = ''
set -eu
state="$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null || echo error)"
if [ "$state" != "active" ]; then
docker swarm init --advertise-addr 127.0.0.1
fi
if ! docker network inspect proxy >/dev/null 2>&1; then
# Explicit /16 (~65 534 VIPs) prevents the /24-exhaustion class seen 2026-06-12:
# leaked endpoints from concurrent stack GC race exhausted the default 254-VIP pool.
# 10.10.0.0/16 is clear of ingress (10.0.0.0/24) and existing per-stack overlays
# (10.0.14.0/24). Runbook: cc-ci-plan/plan-proxy-vip-exhaustion-fix.md
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
fi
'';
};
}