feat(2w): W0.6 unpinned warm reconciler + WC1.2 safety gate + WC1.1 scaffold
runner/warm_reconcile.py (python, packaged into nix store, replaces bash reconcile): UNPIN keycloak (deploy latest published version TAG; recipe fetched at runtime -> D8 closure byte-identical). WC1.2 pre-deploy safety gate (runs FIRST): major recipe/app-version bump OR releaseNotes manual-migration marker -> hold-on-current + alert sentinel (no deploy churn). WC1.1 health-gated upgrade-with-rollback: record last-good -> [keycloak: undeploy->warmsnap.snapshot ->deploy latest] -> health-gate -> commit-or-(restore+redeploy-prior+alert). Alerts = /var/lib/ci-warm/alerts/*.json (Builder loop relays). current version read from abra TYPE=<recipe>:<version>. CCCI_SKIP_FETCH test hook. +8 unit tests for the version gate (56 unit pass). Proven on cc-ci: nixos-rebuild switch -> warm-keycloak.service runs the python reconciler -> noop-healthy (system 0-failed, /realms/master=200). WC1.2 holds proven live: MAJOR bump -> held-major (keycloak untouched); minor+manual- migration notes -> held-manual-migration (alert carries notes); no deploy churn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -1,90 +1,35 @@
|
||||
# Phase 2w / WC1 — a live-warm, shared keycloak SSO provider, deployed via abra at a STABLE domain
|
||||
# (distinct from cold per-run `<recipe[:4]>-<6hex>`; see DECISIONS.md Phase-2w). SSO-dependent
|
||||
# recipe runs use this one instance (creating + deleting a per-run namespaced realm) instead of
|
||||
# co-deploying a fresh keycloak each run — the highest-ROI warm layer (W0).
|
||||
# Phase 2w / WC1+WC1.1+WC1.2 — a live-warm, shared keycloak SSO provider, auto-updating to LATEST
|
||||
# with a pre-deploy safety gate + post-deploy health-gated rollback. Deployed via abra at a STABLE
|
||||
# domain (distinct from cold per-run `<recipe[:4]>-<6hex>`; see DECISIONS.md Phase-2w). SSO-dependent
|
||||
# recipe runs use this one instance (per-run namespaced realm, created+deleted) instead of
|
||||
# co-deploying a fresh keycloak each run.
|
||||
#
|
||||
# Declared as an idempotent-RECONCILE systemd oneshot (like deploy-proxy / swarm-init): it inspects
|
||||
# current state and converges every activation/boot, self-healing drift (redeploys if the stack is
|
||||
# gone). No run-once sentinel. So a from-scratch install re-warms keycloak with just
|
||||
# `nixos-rebuild switch` (D8 / WC8 "re-warmable from scratch"). The keycloak is declarative INFRA
|
||||
# (in the D8 closure); only warm *volumes/snapshots* (W1+) are cache excluded from D8. Its realm
|
||||
# data is ephemeral per-run.
|
||||
# The reconcile logic lives in `runner/warm_reconcile.py` (Python — reuses warmsnap/abra/lifecycle so
|
||||
# there is ONE snapshot impl, also used by the runner for WC5). The runner/ tree is copied into the
|
||||
# nix store, so this is D8-clean (no dependence on the /root/cc-ci checkout) and the recipe is fetched
|
||||
# at *runtime* → the nix closure stays byte-identical regardless of which keycloak version is live
|
||||
# (UNPINNED; the kcVersion pin is gone).
|
||||
#
|
||||
# Secrets are generated ONLY if missing — never rotated — so a reconcile against a running provider
|
||||
# does not invalidate the admin/db creds the harness reads from inside the container.
|
||||
# Idempotent RECONCILE oneshot (like deploy-proxy / swarm-init): converges every activation/boot.
|
||||
# WC1.2 safety gate (major / manual-migration → hold + alert, no churn) runs BEFORE WC1.1's
|
||||
# health-gated upgrade-with-rollback (snapshot keycloak's data volume before upgrade; restore +
|
||||
# redeploy prior version on an unhealthy upgrade). Alerts are sentinel JSON under
|
||||
# /var/lib/ci-warm/alerts/ relayed by the Builder loop (see DECISIONS).
|
||||
{ pkgs, ... }:
|
||||
let
|
||||
# Pinned known-good keycloak version (latest published as of 2026-05-28). Bump deliberately.
|
||||
kcVersion = "10.7.1+26.6.2";
|
||||
runnerSrc = ../../runner;
|
||||
reconcile = pkgs.writeShellApplication {
|
||||
name = "cc-ci-reconcile-warm-keycloak";
|
||||
runtimeInputs = with pkgs; [ abra docker jq gnused gnugrep coreutils git curl ];
|
||||
runtimeInputs = with pkgs; [ abra docker git curl jq gnused gnugrep gnutar coreutils ];
|
||||
text = ''
|
||||
DOMAIN="warm-keycloak.ci.commoninternet.net"
|
||||
VERSION="${kcVersion}"
|
||||
ENV_FILE="$HOME/.abra/servers/default/$DOMAIN.env"
|
||||
RECIPE_DIR="$HOME/.abra/recipes/keycloak"
|
||||
|
||||
abra server ls -m -n >/dev/null 2>&1 || abra server add --local -n || true
|
||||
abra recipe fetch keycloak -n >/dev/null
|
||||
|
||||
# Create the app config once (records ENV VERSION). No -S here: secrets are generated below,
|
||||
# guarded, so a reconcile never rotates a running provider's creds.
|
||||
[ -f "$ENV_FILE" ] || abra app new keycloak -s default -D "$DOMAIN" "$VERSION" -o -n
|
||||
|
||||
set_env() {
|
||||
sed -i -E "/^[[:space:]]*#?[[:space:]]*$1=/d" "$ENV_FILE"
|
||||
# Ensure the file ends in a newline before appending — keycloak's .env.sample ends with a
|
||||
# newline-less comment line (#COMPOSE_FILE=...), so a bare append would glue the var onto
|
||||
# that comment (commenting it out → KC_HOSTNAME=https:// with no host → crash). `$(tail -c1)`
|
||||
# is empty iff the last byte is already a newline. (Same bite as backupbot.nix.)
|
||||
if [ -s "$ENV_FILE" ] && [ -n "$(tail -c1 "$ENV_FILE")" ]; then printf '\n' >> "$ENV_FILE"; fi
|
||||
printf '%s=%s\n' "$1" "$2" >> "$ENV_FILE"
|
||||
}
|
||||
set_env DOMAIN "$DOMAIN"
|
||||
set_env LETS_ENCRYPT_ENV ""
|
||||
|
||||
# Pin the on-disk recipe to the version tag so a non-chaos deploy genuinely deploys VERSION
|
||||
# (a chaos deploy would ignore ENV VERSION and use the current checkout — see abra.recipe_checkout).
|
||||
git -C "$RECIPE_DIR" checkout --quiet "$VERSION"
|
||||
|
||||
# Generate secrets only if absent (idempotent; never rotate a live provider).
|
||||
have_secret() { docker secret ls --format '{{.Name}}' | grep -q "_$1_v1$"; }
|
||||
if ! have_secret admin_password; then
|
||||
abra app secret generate "$DOMAIN" --all -m -o -n
|
||||
fi
|
||||
|
||||
health() {
|
||||
curl -sk -o /dev/null -w '%{http_code}' --max-time 10 \
|
||||
--resolve "$DOMAIN:443:127.0.0.1" "https://$DOMAIN/realms/master" 2>/dev/null || true
|
||||
}
|
||||
|
||||
# Converge WITHOUT churning a healthy provider: only (re)deploy if it is not already serving.
|
||||
# This makes every activation/boot a true no-op when keycloak is up (no JVM restart blip), and
|
||||
# self-heals when the stack is gone or crash-looping. (To roll a new kcVersion, `abra app
|
||||
# undeploy` first so this redeploys — a deliberate, rare op; keycloak is the SSO dep, not under
|
||||
# test.) `-f` because a plain non-chaos deploy FATALs "already deployed".
|
||||
stack="warm-keycloak_ci_commoninternet_net"
|
||||
if [ "$(health)" = "200" ] && docker service ls --format '{{.Name}}' | grep -q "^''${stack}_app$"; then
|
||||
echo "warm keycloak already healthy ($DOMAIN) — no-op converge"
|
||||
exit 0
|
||||
fi
|
||||
abra app deploy "$DOMAIN" -o -n -f
|
||||
|
||||
# Wait until keycloak actually answers /realms/master (JVM + DB migration is slow). Surface a
|
||||
# failed unit if it never comes up rather than reporting success on a half-booted provider.
|
||||
for _ in $(seq 1 90); do
|
||||
[ "$(health)" = "200" ] && { echo "warm keycloak healthy ($DOMAIN)"; exit 0; }
|
||||
sleep 10
|
||||
done
|
||||
echo "FATAL: warm keycloak $DOMAIN did not become healthy" >&2
|
||||
exit 1
|
||||
export HOME=/root
|
||||
exec ${pkgs.python3}/bin/python3 ${runnerSrc}/warm_reconcile.py keycloak
|
||||
'';
|
||||
};
|
||||
in
|
||||
{
|
||||
systemd.services.warm-keycloak = {
|
||||
description = "Reconcile the live-warm shared keycloak SSO provider (WC1) via abra";
|
||||
description = "Reconcile the live-warm shared keycloak SSO provider (WC1/WC1.1/WC1.2) via abra";
|
||||
after = [ "deploy-proxy.service" "swarm-init.service" "docker.service" "network-online.target" ];
|
||||
requires = [ "swarm-init.service" "docker.service" ];
|
||||
wants = [ "deploy-proxy.service" "network-online.target" ];
|
||||
@ -93,8 +38,9 @@ in
|
||||
serviceConfig = {
|
||||
Type = "oneshot";
|
||||
RemainAfterExit = true;
|
||||
# Generous: a cold keycloak boot (JVM + DB migration) can take ~10min on this 2-vCPU node.
|
||||
TimeoutStartSec = "1200";
|
||||
# Generous: a cold keycloak boot (JVM + DB migration) can take ~10min, and a health-gated
|
||||
# upgrade may snapshot + deploy + (rollback) within one run.
|
||||
TimeoutStartSec = "1800";
|
||||
ExecStart = "${reconcile}/bin/cc-ci-reconcile-warm-keycloak";
|
||||
};
|
||||
};
|
||||
|
||||
Reference in New Issue
Block a user