claim(2w): W0.10a traefik WC1.1 migrated onto shared health-gated reconciler — no-op converge proven; destructive rollback = Adversary cold proof

warm_reconcile.py: per-spec setup hook + health_domain; SPECS[traefik]
(stateful=False, version-rollback-only, _traefik_setup preserves wildcard-cert/
file-provider config, health on routed dashboard host). keycloak path unchanged.
proxy.nix: deploy-proxy.service now execs warm_reconcile.py traefik. ZERO-disruption
migration (traefik already at latest 5.1.1+v3.6.15; pre-seeded TYPE+last_good →
clean no-op converge; traefik 200 + keycloak-through-traefik 200 + 0 failed).
65 unit pass. Per operator out: code+converge delivered; destructive rollback
(brief TLS blip) = Adversary's required cold proof. Closes the W0.10a tracked-open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 03:50:32 +01:00
parent aec6911c68
commit e678d2e006
5 changed files with 145 additions and 45 deletions

View File

@ -308,3 +308,24 @@ Plan for next: (a) W0.10a traefik health-gated reconciler migration (quiet windo
serves all TLS); (b) W3 WC5 promote-on-green-cold (extend cold-run teardown to re-seed the canonical
on green-latest, reusing seed_canonical); (c) W3 WC6 nightly sweep (systemd timer: rebuild-then-cold-
sweep). traefik first (use the window) or interleave; W0.10b alert-relay is a small loop step.
## 2026-05-29 — W0.10a traefik WC1.1 migrated (quiet window) — code + no-op converge; rollback = Adversary proof
Used the post-W2 quiet window (Adversary idle) for the tracked traefik WC1.1 migration. Generalized
warm_reconcile.py: per-spec `setup` hook + `health_domain`; added SPECS["traefik"] (stateful=False →
stateless version-rollback-only, NO snapshot; setup=_traefik_setup preserving the wildcard-cert/
file-provider config EXACTLY via the proven newline-safe abra.env_set; health on the routed dashboard
host). keycloak's path is unchanged (no `setup` key → default). proxy.nix migrated:
deploy-proxy.service now execs `warm_reconcile.py traefik` (runner/ packaged in the store, D8-clean).
ZERO-DISRUPTION migration: traefik was already at the latest tag (5.1.1+v3.6.15, image v3.6.15, chaos
commit 005f023 = the tag commit). I pre-seeded the .env TYPE + last_good to 5.1.1+v3.6.15 (accurate —
traefik IS at that version), so the health-gated reconcile is a clean no-op (current==latest==healthy)
→ NO redeploy, NO TLS blip. Verified via nixos-rebuild switch: deploy-proxy.service → "no-op",
traefik 200 + keycloak-through-traefik 200 + 0 failed units. 65 unit pass.
Per the operator's explicit out (a destructive traefik test risks ALL TLS), I delivered the code +
safe no-op converge and left the DESTRUCTIVE rollback as the Adversary's required cold proof (staged
broken traefik tag → reconcile → rollback to last-good, brief TLS blip + manual recovery ready). The
rollback logic is the proven keycloak pattern, stateless variant. Claiming W0.10a so the Adversary
runs that cold proof. After this clears, WC1.1 is fully closed (keycloak + traefik).

View File

@ -15,9 +15,10 @@ nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversa
- [x] **WC1** — Live-warm UNPINNED keycloak; per-run namespaced realms (create+delete); concurrent
distinct realms; orphan realms reaped. **Adversary PASS @2026-05-29** (REVIEW-2w, gate 985686f).
- [~] **WC1.1** — Health-gated deploy-with-rollback. **keycloak (stateful) — Adversary PASS
@2026-05-29** (marquee: broken latest → snapshot→restore→prior, data intact, last_good held,
alert). **traefik (stateless, version-rollback-only) — NOT yet migrated = W0.10**, MUST close
before Phase-2w DONE (Adversary will require a cold proof).
@2026-05-29** (marquee). **traefik (stateless, version-rollback-only) — reconciler MIGRATED
(W0.10a): proxy.nix now drives `warm_reconcile.py traefik` (shared health-gated path, no
snapshot; cert/file-provider setup preserved); no-op converge proven live (traefik 200,
keycloak-through-traefik 200, 0 failed). CLAIMED — destructive rollback = Adversary cold proof.**
- [x] **WC1.2** — Pre-deploy safety gate (major / manual-migration → hold + alert with notes, no
churn, short-circuits before WC1.1). **Adversary PASS @2026-05-29**.
- [x] **WC2** — Data-warm canonical model: per-recipe canonical at stable domain `warm-<recipe>`,
@ -125,6 +126,38 @@ headline e2e is green (below). No recipe/harness change needed.
## Gate
### Gate: W0.10a traefik WC1.1 — CLAIMED, awaiting Adversary (@2026-05-29)
**WHAT.** traefik migrated onto the shared health-gated reconciler (WC1.1, stateless =
version-rollback-only, NO snapshot): record last-good → deploy latest tag → health-gate (routed host
ci.commoninternet.net = 200) → healthy commit / unhealthy roll back to last-good + alert. Closes the
W0.10a tracked-open item from the W0 gate. traefik's wildcard-cert/file-provider config preserved.
**WHERE.** `runner/warm_reconcile.py` (SPECS["traefik"] stateful=False + `_traefik_setup` + health_domain;
reconcile() per-app setup hook; the stateless path skips snapshot/restore — version rollback only),
`nix/modules/proxy.nix` (deploy-proxy.service now execs `python3 …/warm_reconcile.py traefik`).
**HOW + EXPECTED (cold):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q`**65 passed** (incl. test_warm_reconcile traefik
spec: stateful=False, callable setup, health_domain=ci.commoninternet.net; keycloak unchanged).
2. **No-op converge (delivered, proven live):** `systemctl is-active deploy-proxy.service` → active;
`journalctl -u deploy-proxy.service``[traefik] already on latest 5.1.1+v3.6.15 and healthy —
no-op`; traefik serving (ci.commoninternet.net=200) + keycloak-through-traefik=200 + system
`running` (0 failed). The migration was zero-disruption (traefik was already at the latest tag; I
pre-seeded TYPE+last_good to 5.1.1+v3.6.15 so the reconcile is a clean no-op).
3. **Destructive rollback (the Adversary's required cold proof):** stage a fake newer traefik tag with
a broken config → `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py traefik` → broken deploy
fails health → reconciler rolls back to last-good 5.1.1+v3.6.15 (version-only, no snapshot — traefik
is stateless) → traefik healthy again + a `*-rollback.json` alert. NOTE: a destructive traefik test
briefly drops TLS for ALL routes during the broken-deploy window until rollback — run it knowing
that + with manual recovery ready (`abra app deploy traefik.ci.commoninternet.net 5.1.1+v3.6.15
-o -n -f`). The rollback logic is the SAME proven keycloak pattern, stateless variant (no snapshot).
Per operator guidance, I delivered the code + the safe no-op converge this iteration and left the
destructive rollback as the Adversary's cold proof (a live destructive traefik test risks all TLS).
---
### Gate: WC4 + WC7 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 31f0e42, gate 3ff2bf6)
Cold-verified from the Adversary's own clone: 64 units; WC7 adversarial trigger battery (all negatives
rejected, live bridge); WC4 never-promote (snapshot byte-identical, registry unchanged); WC4

View File

@ -4,55 +4,31 @@
# Phase-1c: the cert at CERT_DIR is sops-decrypted from git (cc-ci-secrets) at activation
# (modules/secrets.nix wildcard_cert/wildcard_key), NOT an out-of-band operator file drop.
#
# Declared as an idempotent-RECONCILE systemd oneshot (like swarm-init): it inspects current
# state and converges every activation/boot, self-healing drift (redeploys if the stack is gone,
# re-inserts secrets if missing). No run-once sentinel. So a from-scratch install is just
# `nixos-rebuild switch` + operator preconditions (D8) — no manual post-steps.
# Phase-2w / WC1.1: traefik is now UNPINNED + health-gated like keycloak — the deploy is driven by
# the shared `runner/warm_reconcile.py traefik` (STATELESS = version-rollback-only, NO snapshot):
# record last-good version → deploy latest tag → health-gate (a ROUTED host, the dashboard
# ci.commoninternet.net, returns 200) → healthy commits last-good / unhealthy rolls back to last-good
# + alert. traefik's wildcard-cert/file-provider config (ssl_cert/ssl_key secrets, WILDCARDS_ENABLED,
# COMPOSE_FILE) is preserved EXACTLY by the spec's `setup` (warm_reconcile._traefik_setup). The
# runner/ tree is copied into the nix store → D8-clean; recipe fetched at runtime → closure stable.
#
# Idempotent-RECONCILE systemd oneshot (unchanged unit name `deploy-proxy` — other modules order
# after it): converges every activation/boot, self-healing drift. No run-once sentinel.
{ pkgs, ... }:
let
runnerSrc = ../../runner;
reconcile = pkgs.writeShellApplication {
name = "cc-ci-reconcile-proxy";
runtimeInputs = with pkgs; [ abra docker jq gnused gnugrep coreutils git ];
runtimeInputs = with pkgs; [ abra docker git curl jq gnused gnugrep gnutar coreutils ];
text = ''
PROXY_DOMAIN="traefik.ci.commoninternet.net"
CERT_DIR="/var/lib/ci-certs/live"
ENV_FILE="$HOME/.abra/servers/default/$PROXY_DOMAIN.env"
# Fail visibly (failed unit) if the cert is missing do NOT silently skip. It is
# sops-decrypted from git (cc-ci-secrets) at activation; a miss here means the sops decrypt
# path is broken (e.g. age identity not present), which must surface, not be papered over.
if [ ! -r "$CERT_DIR/fullchain.pem" ] || [ ! -r "$CERT_DIR/privkey.pem" ]; then
echo "FATAL: wildcard cert missing at $CERT_DIR (sops decrypt from cc-ci-secrets failed?)" >&2
exit 1
fi
abra server ls -m -n >/dev/null 2>&1 || abra server add --local -n || true
abra recipe fetch traefik -n >/dev/null
[ -f "$ENV_FILE" ] || abra app new traefik -s default -D "$PROXY_DOMAIN" -n
set_env() {
sed -i -E "/^[[:space:]]*#?[[:space:]]*$1=/d" "$ENV_FILE"
printf '%s=%s\n' "$1" "$2" >> "$ENV_FILE"
}
set_env LETS_ENCRYPT_ENV ""
set_env WILDCARDS_ENABLED "1"
set_env SECRET_WILDCARD_CERT_VERSION "v1"
set_env SECRET_WILDCARD_KEY_VERSION "v1"
set_env COMPOSE_FILE '"compose.yml:compose.wildcard.yml"'
have_secret() { docker secret ls --format '{{.Name}}' | grep -q "_$1_v1$"; }
have_secret ssl_cert || abra app secret insert "$PROXY_DOMAIN" ssl_cert v1 "$CERT_DIR/fullchain.pem" -f -n
have_secret ssl_key || abra app secret insert "$PROXY_DOMAIN" ssl_key v1 "$CERT_DIR/privkey.pem" -f -n
# Converge the stack (idempotent: no-op if already at desired state).
abra app deploy "$PROXY_DOMAIN" -n -C
export HOME=/root
exec ${pkgs.python3}/bin/python3 ${runnerSrc}/warm_reconcile.py traefik
'';
};
in
{
systemd.services.deploy-proxy = {
description = "Reconcile the Co-op Cloud traefik proxy (wildcard/no-ACME) via abra";
description = "Reconcile the Co-op Cloud traefik proxy (wildcard/no-ACME, health-gated) via abra";
after = [ "swarm-init.service" "docker.service" "network-online.target" ];
requires = [ "swarm-init.service" "docker.service" ];
wants = [ "network-online.target" ];
@ -61,6 +37,8 @@ in
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
# Generous: a traefik (re)deploy + health-gate; rollback on an unhealthy upgrade.
TimeoutStartSec = "900";
ExecStart = "${reconcile}/bin/cc-ci-reconcile-proxy";
};
};

View File

@ -36,6 +36,38 @@ from harness import abra, lifecycle, warmsnap # noqa: E402
# --------------------------------------------------------------------------- specs
def _traefik_setup(recipe: str, domain: str, version: str) -> None:
"""Per-app config for the traefik reverse-proxy reconcile — preserves EXACTLY what the prior
proxy.nix bash reconcile did (wildcard/file-provider mode serving the pre-issued cert as
ssl_cert/ssl_key swarm secrets; NO ACME). Uses the proven abra.env_set (newline-safe, unlike the
bash set_env that bit keycloak)."""
cert_dir = "/var/lib/ci-certs/live"
if not (os.path.isfile(f"{cert_dir}/fullchain.pem") and os.path.isfile(f"{cert_dir}/privkey.pem")):
raise RuntimeError(f"FATAL: wildcard cert missing at {cert_dir} (sops decrypt broken?)")
if not os.path.isfile(env_file(domain)):
_run(["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
timeout=120, check=True)
abra.env_set(domain, "DOMAIN", domain)
abra.env_set(domain, "LETS_ENCRYPT_ENV", "")
abra.env_set(domain, "WILDCARDS_ENABLED", "1")
abra.env_set(domain, "SECRET_WILDCARD_CERT_VERSION", "v1")
abra.env_set(domain, "SECRET_WILDCARD_KEY_VERSION", "v1")
abra.env_set(domain, "COMPOSE_FILE", '"compose.yml:compose.wildcard.yml"')
stack = lifecycle._stack_name(domain) # noqa: SLF001
have = set(lifecycle._docker_names("secret", stack)) # noqa: SLF001
def _has(name):
return any(s.endswith(f"_{name}_v1") for s in have)
if not _has("ssl_cert"):
_run(["abra", "app", "secret", "insert", domain, "ssl_cert", "v1",
f"{cert_dir}/fullchain.pem", "-f", "-n"], timeout=120, check=True)
if not _has("ssl_key"):
_run(["abra", "app", "secret", "insert", domain, "ssl_key", "v1",
f"{cert_dir}/privkey.pem", "-f", "-n"], timeout=120, check=True)
SPECS: dict[str, dict] = {
"keycloak": {
"recipe": "keycloak",
@ -46,6 +78,20 @@ SPECS: dict[str, dict] = {
"deploy_timeout": 900,
"health_timeout": 900,
},
# traefik = the reverse proxy: STATELESS (version-rollback-only, NO snapshot). Health is probed
# on a ROUTED host (the dashboard) since traefik's own domain has no route. `setup` preserves the
# wildcard cert / file-provider config.
"traefik": {
"recipe": "traefik",
"domain": "traefik.ci.commoninternet.net",
"health_domain": "ci.commoninternet.net",
"health_path": "/",
"health_ok": (200,),
"stateful": False,
"deploy_timeout": 600,
"health_timeout": 300,
"setup": _traefik_setup,
},
}
ALERTS_DIR = os.path.join(warmsnap.DEFAULT_WARM_ROOT, "alerts")
@ -166,7 +212,10 @@ def is_deployed(domain: str) -> bool:
def health_code(spec: dict) -> int:
domain = spec["domain"]
# health is probed on `health_domain` (defaults to the app domain). For traefik the app domain
# (traefik.ci…) has no route of its own — health is a ROUTED host (e.g. the dashboard
# ci.commoninternet.net), so a 200 proves traefik is up + routing + TLS-terminating.
domain = spec.get("health_domain", spec["domain"])
r = _run(
[
"curl", "-sk", "-o", "/dev/null", "-w", "%{http_code}", "--max-time", "10",
@ -300,8 +349,14 @@ def reconcile(app: str) -> str:
latest = latest_version(tags)
if not latest:
raise RuntimeError(f"no version tags for {recipe}")
ensure_app_config(recipe, domain, latest)
ensure_secrets(domain)
# Per-app config/secrets: a spec may provide its own `setup` (traefik's cert/file-provider wiring);
# otherwise the default keycloak-shaped path (app new + DOMAIN/LETS_ENCRYPT + generate secrets).
setup = spec.get("setup")
if setup:
setup(recipe, domain, latest)
else:
ensure_app_config(recipe, domain, latest)
ensure_secrets(domain)
current = current_version(domain)
deployed = is_deployed(domain)

View File

@ -54,6 +54,19 @@ def test_app_major_bump_held_even_if_no_plus_on_current():
assert wr.is_major_bump("0", "11.0.0+1.0.0") is True
def test_traefik_spec_is_stateless_with_setup():
# WC1.1 traefik = stateless (version-rollback-only, NO snapshot) + its own cert/file-provider
# setup + health probed on a ROUTED host (the dashboard), not traefik's own domain.
t = wr.SPECS["traefik"]
assert t["stateful"] is False
assert callable(t.get("setup"))
assert t["health_domain"] == "ci.commoninternet.net"
assert t["domain"] == "traefik.ci.commoninternet.net"
# keycloak stays stateful with no custom setup (default path)
assert wr.SPECS["keycloak"]["stateful"] is True
assert "setup" not in wr.SPECS["keycloak"]
def test_manual_migration_markers():
assert wr.notes_flag_manual_migration("This release requires a MANUAL MIGRATION of the DB.")
assert wr.notes_flag_manual_migration("Breaking change: action required before upgrade.")