Commit Graph

349 Commits

Author SHA1 Message Date
b8b698e2f5 review(2w): WC6 nightly full-cold sweep — PASS @2026-05-29 (declarative timer Persistent + orchestration + live systemd-service run: infra roll health-gated → serial cold sweep → canonical advanced, infra healthy, no leftovers) 2026-05-29 04:38:51 +01:00
465e1059b0 claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live
canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:33:08 +01:00
1e40a460ba status(2w): WC5 ADVERSARY PASS @2026-05-29 (8 WC items verified); building WC6 nightly sweep 2026-05-29 04:14:16 +01:00
5bbc47cb02 review(2w): WC5 promote-on-green-cold — PASS @2026-05-29 (gate predicate anti-poison verified + live advancement 1.10.0→1.11.0 cold-only; --quick/PR-head/red/unenrolled excluded) 2026-05-29 04:13:17 +01:00
125453df20 claim(2w): WC5 promote-on-green-cold proven — green cold run advances canonical (1.10.0→1.11.0); --quick never promotes; only cold advances
should_promote_canonical (enrolled+green+cold+latest) + promote_canonical
(re-seed canonical at green-verified latest, snapshot+registry, old known-good
replaced only on green). +5 unit (70 pass). Live: custom-html canonical advanced
1.10.0+1.28.0 → 1.11.0+1.29.0 via a full green cold run; snapshot refreshed; idle;
per-run app torn down. WC6 nightly sweep next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:08:14 +01:00
cf5999cdda decisions(2w): W3 WC5 promote-on-green-cold mechanism (re-seed canonical from fresh green-latest deploy; never lose known-good; gate=enrolled+green+cold+latest) 2026-05-29 04:01:59 +01:00
f2cfee5c32 status+journal(2w): W0.10a traefik WC1.1 ADVERSARY PASS — WC1.1 fully closed (both reconcilers); building W3 WC5 2026-05-29 03:59:37 +01:00
e3b08a9bdf review(2w): traefik WC1.1 (W0.10a) — PASS @2026-05-29 (stateless rollback proven, no TLS outage); CLOSES W0.10 tracked-open → WC1.1 fully verified both reconcilers 2026-05-29 03:58:33 +01:00
e678d2e006 claim(2w): W0.10a traefik WC1.1 migrated onto shared health-gated reconciler — no-op converge proven; destructive rollback = Adversary cold proof
warm_reconcile.py: per-spec setup hook + health_domain; SPECS[traefik]
(stateful=False, version-rollback-only, _traefik_setup preserves wildcard-cert/
file-provider config, health on routed dashboard host). keycloak path unchanged.
proxy.nix: deploy-proxy.service now execs warm_reconcile.py traefik. ZERO-disruption
migration (traefik already at latest 5.1.1+v3.6.15; pre-seeded TYPE+last_good →
clean no-op converge; traefik 200 + keycloak-through-traefik 200 + 0 failed).
65 unit pass. Per operator out: code+converge delivered; destructive rollback
(brief TLS blip) = Adversary's required cold proof. Closes the W0.10a tracked-open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:50:32 +01:00
aec6911c68 status+journal(2w): W2 gate WC4+WC7 ADVERSARY PASS @2026-05-29; advance to W3 (WC5/WC6) + traefik W0.10a quiet window 2026-05-29 03:34:29 +01:00
31f0e426c4 review(2w): WC4 + WC7 — PASS @2026-05-29 (gate 3ff2bf6; --quick never-promote + FAIL-rollback-to-exact-known-good + no-canonical→cold fallback, all cold-verified; live-bridge trigger battery) 2026-05-29 03:31:57 +01:00
3ff2bf6c48 claim(2w): Gate WC4+WC7 CLAIMED — --quick fast lane proven live (PASS keeps known-good, FAIL restores) + bridge !testme --quick deployed
WC4 run_quick: reattach canonical → upgrade-to-PR-head → assert → PASS
undeploy-keep-volume (known-good UNCHANGED, never promote) / FAIL restore
last-known-good snapshot + undeploy. Live PASS+FAIL proof on custom-html: ALL
PASS (canonical left clean idle@1.11.0+1.29.0). WC7: bridge parse_trigger
(!testme / !testme --quick / reject !testmexyz) → CCCI_QUICK param, deployed +
live-verified; default !testme stays cold; never gates merge; mode-labeled;
no-canonical fallback to cold. 64 unit pass. Full HOW/EXPECTED/WHERE in STATUS-2w.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:17:29 +01:00
9afc7f64b9 feat(2w): W2 WC7 trigger surface — bridge parses !testme --quick
bridge/bridge.py: parse_trigger(body) → (is_trigger, quick); accepts exactly
'!testme' (cold, default) and '!testme --quick' (opt-in fast lane), rejects
'!testmexyz'/'!testme foo'/etc. Threaded through both poll + webhook paths and
process_testme → trigger_build adds the CCCI_QUICK=1 Drone param (auto-exposed
to run_recipe_ci). PR comment labels a quick run lower-confidence. .drone.yml
echoes quick=. +3 unit tests (incl. the !testmexyz negative). 64 unit pass.
WC7: default !testme stays full cold; --quick opt-in, never gates merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:10:56 +01:00
191ebde466 fix(2w): W2 --quick live-proof fixes (time import + stale-TYPE reset)
3 bugs found by the live PASS+FAIL proof on the custom-html canonical:
- import time (run_quick._wait_undeployed used it → the FAIL rollback crashed
  with NameError before restore ran).
- canonical.deploy_canonical now resets .env TYPE=<recipe>:<version> before
  redeploy, so a stale TYPE left by a prior --quick upgrade (pointing at a
  since-removed broken PR commit) can't FATAL abra 'unable to resolve <commit>'.
- run_quick FAIL rollback resets TYPE to known-good after restore (idle .env
  agrees with the registry).

LIVE PROOF (custom-html canonical), ALL PASS: (A) PASS quick run → undeploy
keep-volume, known-good UNCHANGED, marker intact; (B) FAIL quick run (broken
image) → 'rolling back' → 'restored known-good data; canonical idle' → exit 1,
known-good UNCHANGED, DATA RESTORED. Canonical left clean (idle, 1.11.0+1.29.0).
61 unit pass; cold path untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:05:39 +01:00
f68e9d463f feat(2w): W2 --quick mode in run_recipe_ci.py (WC4+WC7)
run_quick(): opt-in fast lane (CCCI_QUICK=1 / MODE=quick) — reattach the
data-warm canonical (canonical.deploy_canonical, known-good volume) → deps wiring
(warm keycloak + per-run realm) → UPGRADE to PR head (chaos, run_lifecycle_tier
'upgrade': reconverge+moved+serving + overlay) → custom tier. PASS →
undeploy_keep_volume, known-good UNCHANGED (NEVER promote); FAIL → warmsnap.restore
last-known-good + undeploy (roll back, data safe). Always deletes per-run warm
realm. mode=quick labelled lower-confidence (WC7); skips install/backup/restore;
no deploy-count guard (no deploy_app). main() dispatches to run_quick when a
canonical exists, else clean no-canonical fallback to COLD. Cold path byte-identical
(deps wiring intentionally mirrored, not refactored). 61 unit pass; cold untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 02:45:44 +01:00
307269b5c6 status+journal(2w): W1 gate WC2+WC3 ADVERSARY PASS @2026-05-29; advance to W2 (--quick mode) 2026-05-29 02:35:55 +01:00
0246296370 review(2w): WC2 + WC3 — PASS @2026-05-29 (gate 4ce80f8; data-warm round-trip + restore round-trip cold-verified from own clone, canonical left idle+clean) 2026-05-29 02:33:35 +01:00
62f03191ed chore(2w): consume ADVERSARY-INBOX — WC2+WC3 formally claimed (4ce80f8); running cold reproduce 2026-05-29 02:26:03 +01:00
99d1a64ac2 inbox(2w): notify Adversary — WC2+WC3 gate IS claimed (4ce80f8); W1.2 data-warm proof done; custom-html canonical idle for cold reproduce 2026-05-29 02:25:27 +01:00
b56a15403c review(2w): watchdog [C2 C3] premature — no formal WC2/WC3 claim (W1.2 live data-warm proof pending); read-only glance at canonical.py, await formal claim 2026-05-29 02:24:41 +01:00
4ce80f8751 claim(2w): W1 gate WC2+WC3 CLAIMED — data-warm canonical proven (custom-html round-trip: undeploy-keep-volume → reattach → data survives)
W1.2: enrolled custom-html (recipe_meta.WARM_CANONICAL); live proof ALL PASS
(seed canonical → idle-with-volume-retained → re-warm → marker survived).
WC2 (registry+data-warm model) + WC3 (snapshot+restore) proven. 61 unit pass.
custom-html now the first real data-warm canonical (idle).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 02:23:22 +01:00
9144eeac2f status(2w): W1.1 registry module done; next W1.2 enroll custom-html + live data-warm proof 2026-05-29 02:15:35 +01:00
b6ef83ab0b feat(2w): W1 canonical registry module (WC2) + alerts archived
runner/harness/canonical.py: data-warm canonical registry + lifecycle —
is_enrolled (recipe_meta.WARM_CANONICAL), canonical_domain (warm.stable_domain
warm-<recipe>), registry read/write (/var/lib/ci-warm/<recipe>/canonical.json),
has_canonical (record + retained volume), deploy_canonical (reattach volume at
known-good version), undeploy_keep_volume (idle data-warm), seed_canonical
(record + warmsnap snapshot). warm.stable_domain helper added (keycloak path
unchanged). +4 unit tests (61 unit pass).

Also archived the Adversary's verification alert sentinels to alerts/seen/
(simulated rollback + 2 holds — evidentiary, gate PASSED; dir clean for real alerts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 02:15:11 +01:00
563156ae7e decisions(2w): W1 canonical registry design (recipe_meta.WARM_CANONICAL enrollment, warm-<recipe> data-warm lifecycle, canonical.json registry) 2026-05-29 02:11:58 +01:00
56a95c68ef status+journal(2w): W0 gate WC1+WC1.2+WC1.1(keycloak) ADVERSARY PASS @2026-05-29; advance to W1 (canonical registry); traefik W0.10 tracked before DONE 2026-05-29 02:10:55 +01:00
31ac86d644 review(2w): WC1 + WC1.2 + WC1.1(keycloak-stateful) — PASS @2026-05-29 (gate 985686f cleared, all 6 checks cold-verified from own clone); traefik WC1.1/W0.10 tracked open before DONE 2026-05-29 02:08:49 +01:00
3f566436a4 review(2w): recovery OK (kc canonical) + check6 WC1.2 holds PASS; check3 headline e2e in progress 2026-05-29 02:04:11 +01:00
95ada595aa review(2w): WC1 checks 1/2/4 PASS + WC1.1 MARQUEE rollback PASS (data intact, last_good held, alert correct); test-script cleanup bug noted, recovery in flight 2026-05-29 01:59:12 +01:00
eb54c95bfa chore(2w): consume ADVERSARY-INBOX — gate-claim confirmed, alerts-dir flag resolved (intentional cleanup), keycloak parked for my reproduce 2026-05-29 01:45:44 +01:00
d87cb8eee9 inbox(2w): consume BUILDER-INBOX; reply — gate IS claimed (985686f), pull+reproduce; alerts-dir cleaned test artifact intentionally 2026-05-29 01:45:22 +01:00
38ba153e90 review(2w): watchdog [C1] ping — no formal gate yet; read-only pre-review (reconciler clean, alerts-dir flag) + inbox heads-up to coordinate live reproduce 2026-05-29 01:44:05 +01:00
0f6e7d75e3 status(2w): gate scope note — WC1.1 proven for keycloak (stateful); traefik WC1.1 = W0.10 follow-up 2026-05-29 01:41:27 +01:00
985686f60e claim(2w): Gate WC1+WC1.1+WC1.2 CLAIMED — warm keycloak headline e2e GREEN + concurrency/reaping + rollback/holds proven
W0.7 (lasuite-docs race was transient) + W0.8 headline e2e: lasuite-docs custom
pass (3 SSO tests incl. oidc_login + password_grant) vs WARM keycloak,
deploy-count=1 (keycloak NOT co-deployed), per-run realm lasuite-docs-4c0858
created+deleted; warm kc left with only master realm. Concurrency+reaping proven
(distinct realms for concurrent same-recipe runs; reap keeps-live/deletes-orphans).
Gate claim in STATUS-2w carries full WHAT/HOW/EXPECTED/WHERE for cold verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 01:40:32 +01:00
cbc193e535 journal(2w): record docker-prune WC8 fix 2026-05-29 01:26:42 +01:00
e73e4393ed fix(2w): docker autoPrune drop --volumes (was failing daily + would wipe warm vols) [WC8]
The autoPrune flags passed '--volumes' WITH '--filter until=24h', which docker
rejects ('until filter not supported with --volumes') — so docker-prune.service
FAILED every day (system 'degraded') and never reclaimed anything (a cause of the
disk creeping to 96%). Worse, '--volumes' prunes volumes with no running
container — which would DELETE Phase-2w DATA-WARM canonical volumes (undeployed by
design). Removed '--volumes': now prunes images/containers/networks/build-cache
older than 24h only; warm volumes survive and are pruned deliberately by the warm
reconcilers (WC8).

Verified: nixos-rebuild switch -> docker-prune.service runs clean, system
'running' (0 failed units), warm keycloak still 200.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 01:26:24 +01:00
819c1bc0fd status+journal(2w): W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback); reconciler-side WC1/WC1.1/WC1.2 proven 2026-05-29 01:21:59 +01:00
32f00717ac fix(2w): W0.9 WC1.1 hardening (proven live: healthy upgrade + marquee rollback)
Bugs found by the live proof, fixed:
- warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole
  <recipe>/ dir — so the reconciler's sibling last_good file survives a
  snapshot swap (was being clobbered).
- warm_reconcile: deploy_version captures abra's stdout (it writes FATA to
  stdout) in the error; add wait_undeployed() after every undeploy so
  snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade
  deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers
  rollback. (57 unit pass.)

LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH):
(a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good
    committed=10.7.9, marker realm preserved.
(b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9,
    HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore),
    last_good NOT advanced, rollback alert written (attempted=10.7.10,
    last_good=10.7.9, recovered=True). keycloak recovered to canonical
    10.7.1+26.6.2 healthy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 01:21:05 +01:00
07ea951f31 fix(2w): WC1.1 reconcile rolls back on deploy FAILURE too (not just unhealthy)
A broken 'latest' can fail abra's converge (deploy_version raises) rather than
deploy-then-be-unhealthy; wrap the upgrade deploy so BOTH paths trigger the
snapshot-restore rollback instead of crashing the reconcile unit.
2026-05-29 01:01:28 +01:00
0812132452 review(2w): standing WC8 probe — lasu-0a6fb2 fully torn down (no app/svc/vol/secret), disk 63% 2026-05-29 00:55:49 +01:00
4808d0354a status(2w): W0.6 reconciler delivered + WC1.2 holds proven; next W0.9 WC1.1 live proofs 2026-05-29 00:43:10 +01:00
a044abb298 feat(2w): W0.6 unpinned warm reconciler + WC1.2 safety gate + WC1.1 scaffold
runner/warm_reconcile.py (python, packaged into nix store, replaces bash
reconcile): UNPIN keycloak (deploy latest published version TAG; recipe fetched
at runtime -> D8 closure byte-identical). WC1.2 pre-deploy safety gate (runs
FIRST): major recipe/app-version bump OR releaseNotes manual-migration marker
-> hold-on-current + alert sentinel (no deploy churn). WC1.1 health-gated
upgrade-with-rollback: record last-good -> [keycloak: undeploy->warmsnap.snapshot
->deploy latest] -> health-gate -> commit-or-(restore+redeploy-prior+alert).
Alerts = /var/lib/ci-warm/alerts/*.json (Builder loop relays). current version
read from abra TYPE=<recipe>:<version>. CCCI_SKIP_FETCH test hook.
+8 unit tests for the version gate (56 unit pass).

Proven on cc-ci: nixos-rebuild switch -> warm-keycloak.service runs the python
reconciler -> noop-healthy (system 0-failed, /realms/master=200). WC1.2 holds
proven live: MAJOR bump -> held-major (keycloak untouched); minor+manual-
migration notes -> held-manual-migration (alert carries notes); no deploy churn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:42:02 +01:00
aff50aac0a journal(2w): W0.5 proven + WC8 disk reclaim (96%->62%); checkpoint before W0.6 2026-05-29 00:29:42 +01:00
67240dca92 decisions+status(2w): W0.5 done (WC3 snapshot proven); W0.6 reconciler version model (deploy-by-tag, recipe-semver pre-+, python entrypoint in store) 2026-05-29 00:15:38 +01:00
4cc1e15a53 feat(2w): W0.5 WC3 snapshot/restore helper (warmsnap.py)
runner/harness/warmsnap.py: raw per-volume tar of an app's stack volumes while
UNDEPLOYED, under /var/lib/ci-warm/<recipe>/ (meta.json + volumes/<vol>.tar);
one last-good, atomic dir swap; restore clears+untars each volume back. Asserts
undeployed (consistency). Reused by WC1.1 (pre-upgrade keycloak snapshot) + WC5.
+5 unit tests (48 unit pass).

LIVE round-trip PROVEN on warm keycloak: create marker realm -> undeploy ->
snapshot (mariadb+providers vols) -> deploy -> delete marker (mutate DB) ->
undeploy -> restore -> deploy -> marker realm BACK; keycloak healthy. WC3 core.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:12:46 +01:00
ceacd0e6de backlog+decisions(2w): re-sequence W0 (WC3 helper first); unpin/snapshot/alert decisions 2026-05-29 00:05:13 +01:00
740d7bac4c status(2w): W0 core mechanism proven + reconciler up; absorb design update (unpin+WC1.1+WC1.2); re-sequence to WC3 snapshot helper first 2026-05-29 00:04:12 +01:00
b127078516 review(2w): add WC1.2 pre-deploy safety gate (major/manual-migration hold + alert-with-notes) to verification map 2026-05-29 00:02:59 +01:00
2dc1e6edc7 review(2w): absorb design update — WC1 unpin + new WC1.1 health-gated rollback proof + WC6 reorder into verification map 2026-05-29 00:00:09 +01:00
88c11142de fix(2w): W0.3 warm-keycloak reconciler — newline bite + skip-if-healthy
- set_env: ensure trailing newline before append (keycloak .env.sample ends
  with a newline-less #COMPOSE_FILE comment, so a bare append glued DOMAIN onto
  it -> DOMAIN unset -> KC_HOSTNAME=https:// -> crash-loop). Same bite fixed in
  backupbot.nix.
- converge skips the (forced) redeploy when keycloak already serves 200, so an
  activation/boot is a true no-op (no JVM-restart blip) and only redeploys when
  down/crash-looping. Health-wait extended to 15min.

Verified on cc-ci: nixos-rebuild switch -> warm-keycloak.service active,
'no-op converge', system running (0 failed), /realms/master=200.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:52:01 +01:00
c8e9ddb681 feat(2w): W0.3 declarative warm-keycloak reconciler (WC1)
nix/modules/warm-keycloak.nix: idempotent systemd oneshot (like deploy-proxy)
that converges a live-warm shared keycloak at warm-keycloak.ci.commoninternet.net
pinned to  10.7.1+26.6.2, secrets generated only-if-missing (never
rotate a live provider), waits /realms/master=200. Re-warmable from scratch
(D8/WC8). Wired into hosts/cc-ci/configuration.nix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:28:44 +01:00