Commit Graph

78 Commits

Author SHA1 Message Date
e6d55b53c7 fix(harness): a paused swarm update is settled — only active states block convergence
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
68ef0f8 made services_converged() require UpdateStatus settled, treating
'paused' as in flight. But swarm's default update-failure-action pauses the
update on a single task flicker and the flag persists FOREVER (until the next
update): immich CI 241 had the app service 'paused' from a restart during
restore while the service was back at 1/1 and healthy — every subsequent wait
hung to its deadline and the run had to be killed.

Only 'updating' and 'rollback_started' now block convergence: those are the
states swarm is actively driving (the 238 stop-first race lives in 'updating').
'paused'/'rollback_paused' make no progress without intervention, so waiting on
them is pointless — N/N replicas is already required, and the HTTP-health and
tier assertions still gate whether the app actually works.

lint: PASS, unit tests: 138 passed.
2026-06-09 23:07:36 +00:00
68ef0f84fb fix(harness): convergence must span stop-first rolling updates (immich 238 backup 409)
Some checks reported errors
continuous-integration/drone/push Build is passing
continuous-integration/drone Build was killed
services_converged() accepted N/N replicas as converged — but a chaos redeploy
that changes a non-app service image (immich PR #2 moves the db to the
vectorchord pin) registers a stop-first rolling update that swarm may not have
STARTED yet: the OLD task still shows 1/1, the wait passes, and the task dies
seconds later. Build 238: backupbot resolved the db hook container, the task
was killed in the gap, and the pre-hook exec crashed the whole backup with a
409 -> no dump in the snapshot -> restore had nothing -> RED.

- services_converged() now also requires every service's swarm UpdateStatus to
  be settled ('', completed, rollback_completed) — updating/paused/rollback in
  flight is NOT converged. Strictly stricter: no gate is weakened.
- backup_app() gains a bounded (300s) settle-wait before 'abra app backup
  create' as defence in depth; on timeout the backup still runs and the tier's
  assertion delivers the verdict.

lint: PASS, unit tests: 138 passed.
2026-06-09 22:10:55 +00:00
c0df77d0d9 fix(harness): make concurrent recipe runs safe (per-recipe flock + active-run registry)
All checks were successful
continuous-integration/drone/push Build is passing
capacity=2 went live with three stale capacity=1-era assumptions that corrupted
concurrent runs (immich 229/230 '/pg_backup.sh: No such file'):

- ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe rm-rf's/
  reclones and the upgrade tier git-checkouts mid-run. Same-recipe runs now
  serialise on an exclusive flock (/run/lock/cc-ci-recipe-<recipe>.lock), taken
  in main() BEFORE fetch_recipe and held for the whole run; the kernel releases
  it on any process death, so there is no stale-lock failure mode. Different
  recipes still run in parallel.

- CCCI_JANITOR_MAX_AGE=0 made a starting build reap ANY in-flight run app. Every
  run now registers its app domain + pid in /run/cc-ci-active/<domain> before
  app creation; the janitor checks the owner: alive (pid is a live run_recipe_ci
  process) -> never reaped; dead -> reaped immediately; unknown (pre-registry or
  post-reboot) -> age fallback (default 2h). The MAX_AGE=0 env override is gone
  from .drone.yml.

- .drone.yml: concurrency.limit 1 -> 2 to match DRONE_RUNNER_CAPACITY=2; the
  'safe because capacity=1' comments now describe the flock+registry model.

lint: PASS, unit tests: 138 passed.
2026-06-09 21:56:25 +00:00
9a7772563a style: repo-wide lint pass — make the lint gate green again
Push builds have been RED on the lint step since ~build 209 from accumulated
formatting drift. This is the mechanical cleanup: ruff format + ruff --fix
(UP038 isinstance unions, SIM105 contextlib.suppress, UP031 f-strings, SIM115
tempfile context manager), shfmt -i 2 -ci, nixpkgs-fmt/statix/deadnix (merged
attrsets, dropped unused lib args), yamllint, and shell quoting fixes in
tests/lasuite-docs/setup_custom_tests.sh. No behaviour changes intended;
lint: PASS, unit tests: 138 passed.
2026-06-09 21:56:15 +00:00
c51cd84159 feat(harness): intentional skips + custom-html-tiny functional test; 4-rung ladder (#6)
Some checks failed
continuous-integration/drone/push Build is failing
Declare intentional skips + custom-html-tiny functional test; 4-rung level ladder

- recipe_meta.EXPECTED_NA = {rung: reason} lists intentionally-skipped rungs; any
  essential rung skipped and not listed is unintentional. Skips still cap the level
  (never inflate). results.json: skips:{intentional,unintentional} + level_cap_rung.
- Level ladder = the four essential rungs (install, upgrade, backup/restore,
  functional; top = L4). integration & recipe-local are optional, not leveled
  (SSO still enforced for the run verdict, unchanged).
- Card shows skipped rungs as INTENTIONAL SKIP (green, reason below) / UNINTENTIONAL
  SKIP (amber); level badge gains an expected/gap? third segment.
- custom-html-tiny: functional serve test (exact-byte round-trip + 404); declares
  backup_restore intentionally skipped (stateless static server).

Independently verified by the adversary: 138 unit tests pass cold; live full-stage
run on custom-html-tiny green (upgrade tier ran; level 2; correct skips/badge);
clean teardown.
2026-06-09 03:12:11 +00:00
799cceb54a fix(3 U5.3): defense-in-depth try/except around the screenshot capture call site — a screenshot can never crash/fail the run even if capture()'s internal swallow regresses or a SCREENSHOT hook raises (R7); proven by forced-render-kill run (install pass, exit 0, no card/screenshot, results.json intact)
Some checks failed
continuous-integration/drone/push Build is failing
2026-05-31 10:13:30 +00:00
afe5e51057 feat(3 U2-wiring): render summary card PNG + level badge SVG into run artifact dir (best-effort, R7; not yet served) 2026-05-31 07:03:10 +00:00
5fa15d4949 feat(3 U1): wire app screenshot capture into run_recipe_ci (best-effort, post-healthy, secret-safe; sets results.json screenshot) 2026-05-31 06:56:20 +00:00
8179d3f3f9 fix(3 U2): inline-SVG sunflower + font-safe cap line for headless card render
Headless chromium has no colour-emoji font, so 🌻/🏆/⚑ rendered as tofu boxes in the PNG card.
Replace with a self-contained inline-SVG sunflower + plain-text 'capped:'/'full clean climb' markers.
The U3 PR comment keeps the real 🌻 emoji (Gitea markdown renders it). Pure render change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 06:23:13 +00:00
7217e0c98c feat(3 U2-scaffold): summary card + level/status SVG badge renderers (offline; pure)
harness/card.py: render_badge_svg/level_badge_svg (shields-style SVG, colour-by-level, R6) +
render_card_html (recipe+version, level badge, per-stage/per-test ✔/✘ table, embedded screenshot,
invariant flags — REPORTS results.json verbatim, never recomputes; cardinal no-inflation guardrail)
+ render_card_png (best-effort Playwright HTML->PNG, R7). 8 pure unit tests. Orchestrator wiring +
stable-URL serving + live PNG demo come after U0 PASSes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 06:11:47 +00:00
daa7edd3a7 feat(3 U1-scaffold): app screenshot capture module (offline; not yet wired)
harness/screenshot.py: best-effort Playwright capture of the live app (reuses harness browser).
Default = landing page (credential-free, secret-safe R7); recipes needing post-login opt into a
recipe-meta SCREENSHOT hook responsible for avoiding secret pages. Every failure swallowed -> None
(cosmetics never block, R7). Pure helpers unit-tested. Orchestrator wiring + live demo come after U0
PASSes (avoid deploy contention with the Adversary's cold U0 re-runs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 06:05:39 +00:00
52e5d210d8 feat(3 U0.2+U0.3): per-test results + results.json with computed level
harness/results.py: JUnit-XML parsing (stdlib) → per-stage/per-test rows; derive_rungs (documented
tier+deps/SSO → rung mapping); build_results assembles results.json {recipe,version,pr,ref,run_id,
stages[],level,level_cap_reason,rungs,flags{clean_teardown,no_secret_leak},screenshot,summary_card};
write_results (atomic). run_recipe_ci.py: tiers emit --junitxml + append {tier,source,file,rc,junit}
records; main() assembles+writes results.json wrapped so a failure NEVER changes the verdict (R7),
incl. a narrow leak-scan of the serialised artifact. 17 new unit tests (test_results.py).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 05:55:58 +00:00
9773e3ff63 feat(3 U0.1): pure level() ladder mapper (L0-L6, gap-caps) + unit tests
Phase-3 R1 foundation. harness.level.compute_level(rungs)->(level,cap_reason) with YunoHost
gap-caps semantics: level = highest rung 1..L all clean PASS; first non-PASS (FAIL or N/A) caps,
recorded in cap_reason. N/A caps like fail but distinctly (L5 'no integration surface' example).
Helpers backup_restore_status + tier_to_rung. 16 unit tests incl U0 gate cases (L4-pass, L2-cap).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 05:46:23 +00:00
4bf9e1d43d feat(mumble F2-14c): drop cc-ci compose.host-ports.yml fork; deploy 0.2.0 base minimally, add native host-ports on upgrade-to-latest via new UPGRADE_EXTRA_ENV harness hook + COMPOSE_FILE-aware READY_PROBE/install skip 2026-05-31 05:07:55 +00:00
2f6a6842b0 fix(2): echo abra backup output (backupbot pre-hook) into run log for diagnosis 2026-05-31 00:04:05 +00:00
4a29ca6a55 fix(2): echo abra restore output (backupbot post-hook) into run log for diagnosis 2026-05-30 23:37:55 +00:00
68a7c79668 fix(2): ghost F2-14b — harness BACKUP_VERIFY hook + retry; close the backup-capture race
Root cause (instrumented, DECISIONS 2026-05-30): a DB recipe dumps its data in a backupbot pre-hook,
but if the DB container cycles mid-dump (intermittent on the loaded CI node — full5/6/7 RED, full8
green; NOT OOM/NOT healthcheck) the dump is truncated/absent and restic snapshots an empty path —
abra app backup 'succeeds' yet a later restore silently loses the data (ghost ci_marker).

Fix (additive, recipe-scoped via meta like READY_PROBE): recipe_meta may define BACKUP_VERIFY(domain)
-> bool, a READ-ONLY post-backup integrity probe. When it returns False the harness re-runs the whole
backup (fresh snapshot, re-stabilised db) up to 3x. Recipes without the hook are unaffected. ghost's
BACKUP_VERIFY confirms /var/lib/mysql/backup.sql.gz is a valid non-empty gzip. Weakens no assertion —
it only retries a flaky CAPTURE so P4 restore is RELIABLY exercised, not luck-dependent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 21:30:25 +00:00
aebe93c299 fix(2): _load_meta whitelist UPGRADE_BASE_VERSION (override was silently dropped → base fell back to [-2])
The override added in a750937 had no effect: _load_meta only copies a fixed
key whitelist into the meta dict, and UPGRADE_BASE_VERSION wasn't in it, so
meta.get(...) returned None and the upgrade base fell back to previous_version()
= recipe_versions[-2] (0.6.3+3.1.2). Add it to the whitelist so discourse's
honest 0.7.0 base is selected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:30:39 +01:00
a750937fb0 feat(2): discourse Q4.6 honest upgrade crossover — UPGRADE_BASE_VERSION override (base-on-[-1]) + uniform bitnamilegacy image overlay
Implements the real 0.7.0+3.3.1 -> 0.8.0+3.3.1 upgrade crossover instead of a
§7.1 skip-with-sign-off (Adversary leans DENY on the deferral; agreed):
- recipe_meta UPGRADE_BASE_VERSION=0.7.0+3.3.1 + generic support in
  run_recipe_ci (prev = meta override or previous_version). Harness default
  [-2]=0.6.3+3.1.2 is a hollow base (img 3.1.2 != head 3.3.1); [-1]=0.7.0+3.3.1
  is the PR's true predecessor and shares head's servable 3.3.1 image.
- compose.ccci-health.yml re-pins services.{app,sidekiq}.image to
  bitnamilegacy/discourse:3.3.1 so the 0.7.0 base (compose pins 404 bitnami:3.3.1)
  is servable; idempotent on the head (PR already bitnamilegacy).
Consumes Adversary BUILDER-INBOX (deleted), leaves ADVERSARY-INBOX ack; STATUS-2
discourse section updated. Full lifecycle run launching next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:20:06 +01:00
a7e2af444a fix(2): assert_upgraded tolerate abra's '+U' working-tree marker on chaos-version
A cc-ci deploy overlay sitting in the recipe checkout as an untracked file (ghost's
compose.ccci-health.yml via install_steps) makes abra stamp chaos-version='<commit>+U' (U=untracked).
The commit still equals head_ref (HC1 satisfied) but the '+U' broke the exact-prefix match → spurious
upgrade-tier FAIL. Strip the working-tree-state marker before the commit match; HC1 preserved (commit
must still equal head_ref — a stale checkout's commit would not match even after stripping). General:
benefits every future cc-ci overlay recipe.
2026-05-30 05:49:27 +01:00
dd45e9555e revert(2): drop adversary scratch probe scripts accidentally staged by git add -A (runner/adv_*.py are local-only adversary scratch, not Builder code) 2026-05-29 23:37:48 +01:00
af94708de4 review(2): resume checkpoint — no gate pending; drone block genuine (/etc/timezone still absent on host); leftover drone smoke stack flagged (housekeeping); immich P4-restore still OPEN, unsigned 2026-05-29 23:37:17 +01:00
ec76072489 fix(2): Q4.2 mumble — TCP voice-server READY_PROBE gates backup past upgrade host-port churn
Diagnostic (RECIPE=mumble STAGES=install,backup,restore,custom, no upgrade) PROVED backup+restore green
on a stable 1.0.0 deploy incl. ci_marker survival (P4). The full-run backup 409 ('container not
running') was the chaos UPGRADE redeploy: host-mode 64738 must be released by the old task + rebound by
the new, and HEALTH_PATH '/' only proves the mumble-web sidecar (not the voice server), so wait_healthy
passed while the app churned → backup-bot execed a not-running container. Fix: extend
lifecycle.wait_ready_probes to support a TCP probe ({tcp_host,tcp_port,stable=N consecutive connects});
mumble recipe_meta READY_PROBE returns 64738 (stable=3) so the harness waits for the voice server up
after install AND upgrade before backup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 20:19:07 +01:00
1890cb58f3 fix(2): recipe_checkout force (-f) — fixes mumble upgrade-tier checkout collision with cc-ci overlay
git checkout <head_ref> aborted on the untracked install_steps-provided compose.host-ports.yml (which
head_ref tracks). Force-checkout yields the exact ref tree. Also fixes the mumble restore tier: backup
labels exist only in 1.0.0+, so backup/restore are meaningful only after the (now-working) upgrade moves
the app to head_ref. DECISIONS.md updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 20:03:41 +01:00
999dd0d564 fix(2): Q4.2 mumble — CHAOS_BASE_DEPLOY meta flag for chaos base deploy (clean-tree gate)
mumble's pinned base deploy (prev version 0.2.0) FATAs 'has locally unstaged changes' because
install_steps provides an untracked compose.host-ports.yml. New recipe_meta CHAOS_BASE_DEPLOY=True +
lifecycle._recipe_meta_flag + deploy_app branch -> base uses chaos (skips clean-tree/lint, deploys the
checked-out pinned version, not LATEST), mirroring the lightweight-tag chaos-base path. DECISIONS.md
records the full mumble enrollment design.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 19:32:48 +01:00
2bf40d69d6 feat(2): HQ1 image pre-pull (plan-prepull-images.md) — warm local store before deploy
lifecycle.prepull_images(recipe, domain): resolve images via docker compose config --images (COMPOSE_FILE
from the app .env — handles $VERSION interpolation + multi-compose) → docker pull each, skip-if-present
(zero network for cached pinned tags). Called in deploy_app before the (unchanged, real) abra.deploy AND
in generic.perform_upgrade before the chaos redeploy (warms new-version images). A pull failure RAISES a
clear pre-deploy error (not a converge timeout); deploy path unchanged (no docker service update/scale).
Removes PULL time not app-INIT time. 4 unit tests (tests/unit/test_prepull.py): present→skip, missing→
pull, pull-fail→raise, no-images→skip. NOT claimed yet — validating cold-verify criteria next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 16:02:21 +01:00
72719fe0d7 fix(2): R014 — chaos base deploy for recipes with lightweight tags (replaces fragile origin-repoint)
The origin-repoint approach hit go-git 'reference not found' (mirror HEAD→master vs main). Simpler +
robust: detect lightweight version tags (has_lightweight_version_tags, read-only) and, for the pinned
base deploy of such a recipe, use chaos — which SKIPS abra lint (so no R014 FATA) and deploys the
EXPLICITLY-checked-out pinned version (recipe_checkout already ran; chaos uses the current checkout,
so it's the prev version, NOT LATEST — F1d-2's hazard was the missing checkout). No-op / stays pinned
for all-annotated recipes. The upgrade tier's prev→PR-head crossover + HC1 (chaos-version==head_ref)
still hold (verified by the run's upgrade-tier log).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 14:15:07 +01:00
ad06a5dd3f fix(2): R014 normalize — use git clone --mirror (not --bare) so abra's later fetches find refs/heads/main
--bare lacked refs/heads/main, so abra's post-normalize git ops (app secret insert / deploy) failed
'unable to fetch tags: reference not found' when fetching from the repointed local origin. --mirror
copies all refs (heads+tags) → abra fetch OK + R014 passes (both verified).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 14:05:26 +01:00
da44e2ca8a fix(2): R014 normalize — repoint recipe origin to local bare with annotated tag (abra force-fetches tags before lint, reverting in-place re-annotation)
Diagnosed: abra runs git fetch --tags --force from origin before its pinned-deploy lint, so
re-annotating the lightweight tag in place is reverted before R014 runs. Fix: after re-annotating,
clone the recipe to a local bare repo (carrying the annotated tag) and repoint origin at it, so
abra's force-fetch pulls the annotated tag. Verified: abra recipe lint R014 then PASSES and the
annotation sticks. Deployed commit unchanged. No-op for all-annotated recipes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 13:59:03 +01:00
8c19b1fadc fix(2): normalize lightweight recipe tags to annotated before pinned deploy (R014)
lasuite-meet upgrade tier failed at the prev-version base deploy: abra's pinned-deploy lint FATA'd on
R014 'only annotated tags used for recipe version' because upstream coop-cloud lasuite-meet ships a
stray LIGHTWEIGHT tag (0.3.0+v1.16.0). chaos deploys skip lint (so install,custom passed) but the
upgrade tier's pinned prev-version deploy lints. New abra.normalize_recipe_tags() re-creates each
lightweight version tag as annotated at the SAME commit (no deployed content changes); called in
lifecycle.deploy_app after recipe_checkout when version is pinned. Idempotent; no-op for all-annotated
recipes (lasuite-drive etc.). Helps any recipe with a stray upstream lightweight tag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 13:48:55 +01:00
e1147b5fe3 fix(2): F2-12 lasuite-drive upgrade tier — own convergence wait (abra -c) + collabora READY_PROBE
Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor
FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init),
even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora
being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold).

Fix (cc-ci-side, stronger verification — not weaker):
- abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's
  impatient monitor no longer FATAs (the stack spec is applied regardless).
- perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services
  N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the
  recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade.
- recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI
  discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after
  the upgrade redeploy. No-op for recipes without a READY_PROBE.

NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora
crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 11:55:53 +01:00
4b38b66fa5 fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT
Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom +
OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed
on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it
mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The
install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C
readiness-gating, no test weakened):

- tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery
  on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly.
- runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op —
  plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default,
  while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow
  collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing.

Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md +
BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section
(241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated
IDEAS.md/plan-sso-dep-testing.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:37:55 +01:00
a151489996 feat(2): lasuite-drive Q3.2a Part A — wire OIDC at INSTALL, eliminate flaky redeploy
Q3.2a / plan-lasuite-drive-oidc-robustness.md Part A. The old setup_custom_tests.sh did a
post-deploy in-place `abra app deploy --force --chaos` of the heavy 12-service stack to apply
the OIDC env — flaky (collabora WOPI-discovery race + gunicorn-perms; JOURNAL Step 0). Since
the OIDC env only affects backend/app and keycloak is live-warm, provision the per-run realm
BEFORE the single deploy and wire OIDC into the .env at install time (no reconverge).

- runner/run_recipe_ci.py: new _provision_deps() helper (warm/cold split + SSO enrich + write
  $CCCI_DEPS_FILE), used by both paths. New per-recipe OIDC_AT_INSTALL meta flag (added to
  _load_meta whitelist). When set + deps live-warm: provision BEFORE deploy_app; the install
  tier's install_steps.sh wires OIDC into the single deploy; post-deploy step runs only the
  MinIO bucket one-shot — no re-provision, no redeploy. Legacy post-deploy path unchanged for
  all other dep recipes (gated on `not oidc_at_install`).
- tests/lasuite-drive/install_steps.sh (NEW): install-time OIDC env + secret wiring; no-ops on
  empty deps file (recipe still boots, OIDC test skips → F2-11 RED).
- tests/lasuite-drive/setup_custom_tests.sh: trimmed to MinIO-bucket-only (OIDC moved out).
- tests/lasuite-drive/recipe_meta.py: OIDC_AT_INSTALL = True.
- JOURNAL-2: Step-0 root-cause failure logs captured before the fix.

NOT a claim — validating 3x green (incl. now-required upgrade tier) before claiming Q3.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:10:05 +01:00
fc6e35d617 feat(2): mattermost-lts create-message round-trip (§4.3 P3) — first-user→login→team→channel→post→read-back; harness http.post_with_headers (returns response headers, for mattermost login Token) 2026-05-29 08:31:37 +01:00
40b03a9bf1 claim(2w): WC8 + WC9 (FINAL gates) — resource-safety consolidation + stale-warm prune + docs/warm.md + --quick rollback proof
WC8: canonical.prune_stale (drop de-enrolled warm data + volumes) wired into the
nightly sweep + df log; consolidated evidence (DRONE_RUNNER_CAPACITY=MAX_TESTS
serialize; autoPrune drops --volumes so warm vols survive; cold teardown sacred;
warm excluded from D8 — no nix source ref). +1 unit (72 pass). WC9: docs/warm.md
documents the full warm/quick model; --quick rollback proof already proven live
(W2 FAIL restores exact known-good; WC4 PASS byte-identical snapshot). On PASS,
all WC1-WC9 (incl WC1.1/WC1.2) verified → DONE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:43:34 +01:00
465e1059b0 claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live
canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:33:08 +01:00
125453df20 claim(2w): WC5 promote-on-green-cold proven — green cold run advances canonical (1.10.0→1.11.0); --quick never promotes; only cold advances
should_promote_canonical (enrolled+green+cold+latest) + promote_canonical
(re-seed canonical at green-verified latest, snapshot+registry, old known-good
replaced only on green). +5 unit (70 pass). Live: custom-html canonical advanced
1.10.0+1.28.0 → 1.11.0+1.29.0 via a full green cold run; snapshot refreshed; idle;
per-run app torn down. WC6 nightly sweep next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:08:14 +01:00
e678d2e006 claim(2w): W0.10a traefik WC1.1 migrated onto shared health-gated reconciler — no-op converge proven; destructive rollback = Adversary cold proof
warm_reconcile.py: per-spec setup hook + health_domain; SPECS[traefik]
(stateful=False, version-rollback-only, _traefik_setup preserves wildcard-cert/
file-provider config, health on routed dashboard host). keycloak path unchanged.
proxy.nix: deploy-proxy.service now execs warm_reconcile.py traefik. ZERO-disruption
migration (traefik already at latest 5.1.1+v3.6.15; pre-seeded TYPE+last_good →
clean no-op converge; traefik 200 + keycloak-through-traefik 200 + 0 failed).
65 unit pass. Per operator out: code+converge delivered; destructive rollback
(brief TLS blip) = Adversary's required cold proof. Closes the W0.10a tracked-open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:50:32 +01:00
191ebde466 fix(2w): W2 --quick live-proof fixes (time import + stale-TYPE reset)
3 bugs found by the live PASS+FAIL proof on the custom-html canonical:
- import time (run_quick._wait_undeployed used it → the FAIL rollback crashed
  with NameError before restore ran).
- canonical.deploy_canonical now resets .env TYPE=<recipe>:<version> before
  redeploy, so a stale TYPE left by a prior --quick upgrade (pointing at a
  since-removed broken PR commit) can't FATAL abra 'unable to resolve <commit>'.
- run_quick FAIL rollback resets TYPE to known-good after restore (idle .env
  agrees with the registry).

LIVE PROOF (custom-html canonical), ALL PASS: (A) PASS quick run → undeploy
keep-volume, known-good UNCHANGED, marker intact; (B) FAIL quick run (broken
image) → 'rolling back' → 'restored known-good data; canonical idle' → exit 1,
known-good UNCHANGED, DATA RESTORED. Canonical left clean (idle, 1.11.0+1.29.0).
61 unit pass; cold path untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:05:39 +01:00
f68e9d463f feat(2w): W2 --quick mode in run_recipe_ci.py (WC4+WC7)
run_quick(): opt-in fast lane (CCCI_QUICK=1 / MODE=quick) — reattach the
data-warm canonical (canonical.deploy_canonical, known-good volume) → deps wiring
(warm keycloak + per-run realm) → UPGRADE to PR head (chaos, run_lifecycle_tier
'upgrade': reconverge+moved+serving + overlay) → custom tier. PASS →
undeploy_keep_volume, known-good UNCHANGED (NEVER promote); FAIL → warmsnap.restore
last-known-good + undeploy (roll back, data safe). Always deletes per-run warm
realm. mode=quick labelled lower-confidence (WC7); skips install/backup/restore;
no deploy-count guard (no deploy_app). main() dispatches to run_quick when a
canonical exists, else clean no-canonical fallback to COLD. Cold path byte-identical
(deps wiring intentionally mirrored, not refactored). 61 unit pass; cold untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 02:45:44 +01:00
b6ef83ab0b feat(2w): W1 canonical registry module (WC2) + alerts archived
runner/harness/canonical.py: data-warm canonical registry + lifecycle —
is_enrolled (recipe_meta.WARM_CANONICAL), canonical_domain (warm.stable_domain
warm-<recipe>), registry read/write (/var/lib/ci-warm/<recipe>/canonical.json),
has_canonical (record + retained volume), deploy_canonical (reattach volume at
known-good version), undeploy_keep_volume (idle data-warm), seed_canonical
(record + warmsnap snapshot). warm.stable_domain helper added (keycloak path
unchanged). +4 unit tests (61 unit pass).

Also archived the Adversary's verification alert sentinels to alerts/seen/
(simulated rollback + 2 holds — evidentiary, gate PASSED; dir clean for real alerts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 02:15:11 +01:00
32f00717ac fix(2w): W0.9 WC1.1 hardening (proven live: healthy upgrade + marquee rollback)
Bugs found by the live proof, fixed:
- warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole
  <recipe>/ dir — so the reconciler's sibling last_good file survives a
  snapshot swap (was being clobbered).
- warm_reconcile: deploy_version captures abra's stdout (it writes FATA to
  stdout) in the error; add wait_undeployed() after every undeploy so
  snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade
  deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers
  rollback. (57 unit pass.)

LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH):
(a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good
    committed=10.7.9, marker realm preserved.
(b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9,
    HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore),
    last_good NOT advanced, rollback alert written (attempted=10.7.10,
    last_good=10.7.9, recovered=True). keycloak recovered to canonical
    10.7.1+26.6.2 healthy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 01:21:05 +01:00
07ea951f31 fix(2w): WC1.1 reconcile rolls back on deploy FAILURE too (not just unhealthy)
A broken 'latest' can fail abra's converge (deploy_version raises) rather than
deploy-then-be-unhealthy; wrap the upgrade deploy so BOTH paths trigger the
snapshot-restore rollback instead of crashing the reconcile unit.
2026-05-29 01:01:28 +01:00
a044abb298 feat(2w): W0.6 unpinned warm reconciler + WC1.2 safety gate + WC1.1 scaffold
runner/warm_reconcile.py (python, packaged into nix store, replaces bash
reconcile): UNPIN keycloak (deploy latest published version TAG; recipe fetched
at runtime -> D8 closure byte-identical). WC1.2 pre-deploy safety gate (runs
FIRST): major recipe/app-version bump OR releaseNotes manual-migration marker
-> hold-on-current + alert sentinel (no deploy churn). WC1.1 health-gated
upgrade-with-rollback: record last-good -> [keycloak: undeploy->warmsnap.snapshot
->deploy latest] -> health-gate -> commit-or-(restore+redeploy-prior+alert).
Alerts = /var/lib/ci-warm/alerts/*.json (Builder loop relays). current version
read from abra TYPE=<recipe>:<version>. CCCI_SKIP_FETCH test hook.
+8 unit tests for the version gate (56 unit pass).

Proven on cc-ci: nixos-rebuild switch -> warm-keycloak.service runs the python
reconciler -> noop-healthy (system 0-failed, /realms/master=200). WC1.2 holds
proven live: MAJOR bump -> held-major (keycloak untouched); minor+manual-
migration notes -> held-manual-migration (alert carries notes); no deploy churn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:42:02 +01:00
4cc1e15a53 feat(2w): W0.5 WC3 snapshot/restore helper (warmsnap.py)
runner/harness/warmsnap.py: raw per-volume tar of an app's stack volumes while
UNDEPLOYED, under /var/lib/ci-warm/<recipe>/ (meta.json + volumes/<vol>.tar);
one last-good, atomic dir swap; restore clears+untars each volume back. Asserts
undeployed (consistency). Reused by WC1.1 (pre-upgrade keycloak snapshot) + WC5.
+5 unit tests (48 unit pass).

LIVE round-trip PROVEN on warm keycloak: create marker realm -> undeploy ->
snapshot (mariadb+providers vols) -> deploy -> delete marker (mutate DB) ->
undeploy -> restore -> deploy -> marker realm BACK; keycloak healthy. WC3 core.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:12:46 +01:00
1b8d26b504 feat(2w): W0.2 live-warm keycloak dep mode in orchestrator (WC1)
- runner/harness/warm.py: stable-domain scheme (warm-<recipe>), is_warm_up
  probe, live_app_hexes scan, per-run realm_for naming, reap_orphan_realms.
- run_recipe_ci.py: split declared deps into live-warm (shared provider +
  per-run realm, no deploy, realm deleted at teardown) vs cold (co-deploy).
  Warm path used only when provider is up; cold fallback otherwise. Reap
  orphan realms at run start (concurrency-safe). deploy-count excludes warm
  deps. Realm naming now per-run namespaced (<parent>-<6hex>).
- dependent tests assert the namespaced realm pattern (stronger than ==parent).

Live proof on warm keycloak: realm create -> password-grant JWT -> discovery
issuer -> delete(idempotent) -> reap(keeps live hex, deletes orphan): PASS.
43 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:26:02 +01:00
74bf8c1723 feat(2w): W0.1 keycloak realm lifecycle primitives (WC1)
sso.py: list_realms, delete_keycloak_realm (idempotent, refuses master),
realms_to_reap (pure, concurrency-safe predicate), reap_orphaned_realms.
The per-run realm is the isolation unit on a shared live-warm keycloak;
orphans (crashed runs) reaped by hex not mapping to a live app stack.
+8 unit tests (tests/unit/test_warm_realm.py); 43 unit pass on cc-ci.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:16:48 +01:00
5b34496557 fix(2): F2-11 — SSO-dep deps-not-ready SKIP no longer yields GREEN !testme
When a DEPS-declaring recipe's setup_custom_tests fails, its @requires_deps (SSO/OIDC)
tests skip; a skip-only pytest file exits 0 so the run previously reported overall=0
(GREEN) while the only SSO test never ran (violates P7). Fix preserves generic-tier
failure-isolation but corrects the green SIGNAL:
- conftest.pytest_collection_modifyitems counts skipped requires_deps tests and appends
  to $CCCI_DEPS_SKIP_REPORT.
- run_recipe_ci: sums the count, surfaces it in RUN SUMMARY, and new pure predicate
  sso_dep_unverified(declared, deps_ready, skipped) flips overall=1.
- 7 new unit tests (tests/unit/test_f211_sso_skip.py).

Verified deploy-free (rate-limit-independent): 35/35 unit PASS; cold real-test proof on
lasuite-docs test_oidc_with_keycloak.py -> 1 skipped + skip-report==1 -> orchestrator
would set overall=1. Full e2e deferred until Docker Hub rate limit lifts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 21:25:27 +01:00
f59d8e6996 feat(2): Q3.2 lasuite-drive base enrollment + nested-subdomain + replicas:0 harness fixes
- harness: services_converged treats replicas:0 one-shot (minio-createbuckets) as
  converged (cur==want); removes the want==0 rejection that hung deploys. DECISIONS.md.
- recipe_meta.EXTRA_ENV flattens MINIO_DOMAIN/COLLABORA_DOMAIN to single-label wildcard
  siblings (the *.ci.commoninternet.net cert covers one label only). DECISIONS.md.
- lifecycle overlays (install/upgrade/backup/restore) + ops.py postgres ci_marker
  data-integrity (db user/name=drive). Parity health_check functional test. PARITY.md.
- DEPS=[keycloak] + OIDC/WOPI/upload functional tests deferred to the SSO iteration
  (probe-before-assert: prove the ~10-service base deploy converges first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 19:54:31 +01:00
41ede13042 feat(2): refactor — SSO-dep plan refinement (deps AFTER generic + setup_custom_tests + failure isolation)
Per operator-2026-05-28 SSO-dep plan (plan-sso-dep-testing.md). Substantial orchestrator
restructuring:

NEW LIFECYCLE ORDER:
  1. Recipe deploy ALONE (no deps).
  2. install / upgrade / backup / restore — recipe-only generic tiers.
  3. setup_custom_tests step (NEW):
     a. Deploy each declared dep + provision realm/client/test-user via harness.sso.
     b. Write $CCCI_DEPS_FILE in dict shape {dep_recipe: {domain, realm, client_id, client_secret,
        admin_user, admin_password, discovery_url, token_url, ...}}.
     c. Run tests/<recipe>/setup_custom_tests.sh hook (jq-readable; wires OIDC env via abra
        secret insert + .env edits + in-place 'abra app deploy --force --chaos').
  4. CUSTOM tier with deps-ready flag; @pytest.mark.requires_deps tests skip with
     'deps-not-ready: <reason>' when setup_custom_tests fails. NON-deps custom tests still run
     normally — FAILURE ISOLATION (a DoD item per plan).
  5. Teardown: recipe first, deps in reverse declaration order.

Harness changes:
- runner/run_recipe_ci.py: deps deploy moves from BEFORE recipe deploy to AFTER restore tier.
  Adds _enrich_deps_with_sso() + _run_setup_custom_tests_hook(). DG4.1 generalised to
  'one abra app new per app' (recipe + each dep); in-place redeploys (\--force) don't count.
- runner/harness/deps.py: write_run_state + load_run_state accept dict OR list shape;
  deps_as_dict() coerces either to a recipe→entry map.
- runner/harness/sso.py: admin_password_inside() public re-export.
- tests/conftest.py: deps_creds fixture (full creds dict); deps_apps fixture flattens to
  recipe→domain string. pytest_collection_modifyitems hook skips
  \@pytest.mark.requires_deps tests when CCCI_DEPS_READY=0.
  pytest_configure registers the marker.

Recipe content:
- tests/lasuite-docs/setup_custom_tests.sh: NEW hook reads $CCCI_DEPS_FILE via jq;
  inserts oidc_rpcs secret at BUMPED version (v1→v2) since abra app new -S generates v1 first
  and Swarm forbids overwriting; updates SECRET_OIDC_RPCS_VERSION in .env; writes 9 OIDC env
  vars (REALM/DISCOVERY/AUTH/TOKEN/USERINFO/LOGOUT/JWKS/CLIENT_ID/SCOPES); ensures trailing
  newline on .env so writes don't concatenate (caught a 'TIMEOUT=900OIDC_REALM=...' bug);
  triggers in-place 'abra app deploy --force --chaos --no-input'.
- tests/lasuite-docs/functional/test_oidc_with_keycloak.py: refactored to consume deps_creds
  fixture (no longer calls setup_keycloak_realm itself — the orchestrator does it in
  setup_custom_tests). Marked \@pytest.mark.requires_deps.

Cold-verifiable on cc-ci (log /root/ccci-refactor-lasuite-r5.log):
  RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
  install: PASS, custom: 3 PASS incl. test_oidc_password_grant_against_dep_keycloak.
  deploy-count = 2 (expect 2) — DG4.1 generalised holds.
  Smoke regression: RECIPE=custom-html STAGES=install,custom → 5 PASS, deploy-count=1.

Closes DEFERRED.md #5 (lasuite-docs OIDC parity ports via this plan).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 19:11:42 +01:00