cc-ci

Author	SHA1	Message	Date
autonomic-bot	e6d55b53c7	fix(harness): a paused swarm update is settled — only active states block convergence All checks were successful continuous-integration/drone/push Build is passing Details continuous-integration/drone Build is passing Details `68ef0f8` made services_converged() require UpdateStatus settled, treating 'paused' as in flight. But swarm's default update-failure-action pauses the update on a single task flicker and the flag persists FOREVER (until the next update): immich CI 241 had the app service 'paused' from a restart during restore while the service was back at 1/1 and healthy — every subsequent wait hung to its deadline and the run had to be killed. Only 'updating' and 'rollback_started' now block convergence: those are the states swarm is actively driving (the 238 stop-first race lives in 'updating'). 'paused'/'rollback_paused' make no progress without intervention, so waiting on them is pointless — N/N replicas is already required, and the HTTP-health and tier assertions still gate whether the app actually works. lint: PASS, unit tests: 138 passed.	2026-06-09 23:07:36 +00:00
autonomic-bot	68ef0f84fb	fix(harness): convergence must span stop-first rolling updates (immich 238 backup 409) Some checks reported errors continuous-integration/drone/push Build is passing Details continuous-integration/drone Build was killed Details services_converged() accepted N/N replicas as converged — but a chaos redeploy that changes a non-app service image (immich PR #2 moves the db to the vectorchord pin) registers a stop-first rolling update that swarm may not have STARTED yet: the OLD task still shows 1/1, the wait passes, and the task dies seconds later. Build 238: backupbot resolved the db hook container, the task was killed in the gap, and the pre-hook exec crashed the whole backup with a 409 -> no dump in the snapshot -> restore had nothing -> RED. - services_converged() now also requires every service's swarm UpdateStatus to be settled ('', completed, rollback_completed) — updating/paused/rollback in flight is NOT converged. Strictly stricter: no gate is weakened. - backup_app() gains a bounded (300s) settle-wait before 'abra app backup create' as defence in depth; on timeout the backup still runs and the tier's assertion delivers the verdict. lint: PASS, unit tests: 138 passed.	2026-06-09 22:10:55 +00:00
autonomic-bot	c0df77d0d9	fix(harness): make concurrent recipe runs safe (per-recipe flock + active-run registry) All checks were successful continuous-integration/drone/push Build is passing Details capacity=2 went live with three stale capacity=1-era assumptions that corrupted concurrent runs (immich 229/230 '/pg_backup.sh: No such file'): - ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe rm-rf's/ reclones and the upgrade tier git-checkouts mid-run. Same-recipe runs now serialise on an exclusive flock (/run/lock/cc-ci-recipe-<recipe>.lock), taken in main() BEFORE fetch_recipe and held for the whole run; the kernel releases it on any process death, so there is no stale-lock failure mode. Different recipes still run in parallel. - CCCI_JANITOR_MAX_AGE=0 made a starting build reap ANY in-flight run app. Every run now registers its app domain + pid in /run/cc-ci-active/<domain> before app creation; the janitor checks the owner: alive (pid is a live run_recipe_ci process) -> never reaped; dead -> reaped immediately; unknown (pre-registry or post-reboot) -> age fallback (default 2h). The MAX_AGE=0 env override is gone from .drone.yml. - .drone.yml: concurrency.limit 1 -> 2 to match DRONE_RUNNER_CAPACITY=2; the 'safe because capacity=1' comments now describe the flock+registry model. lint: PASS, unit tests: 138 passed.	2026-06-09 21:56:25 +00:00
autonomic-bot	9a7772563a	style: repo-wide lint pass — make the lint gate green again Push builds have been RED on the lint step since ~build 209 from accumulated formatting drift. This is the mechanical cleanup: ruff format + ruff --fix (UP038 isinstance unions, SIM105 contextlib.suppress, UP031 f-strings, SIM115 tempfile context manager), shfmt -i 2 -ci, nixpkgs-fmt/statix/deadnix (merged attrsets, dropped unused lib args), yamllint, and shell quoting fixes in tests/lasuite-docs/setup_custom_tests.sh. No behaviour changes intended; lint: PASS, unit tests: 138 passed.	2026-06-09 21:56:15 +00:00
autonomic-bot	c51cd84159	feat(harness): intentional skips + custom-html-tiny functional test; 4-rung ladder (#6 ) Some checks failed continuous-integration/drone/push Build is failing Details Declare intentional skips + custom-html-tiny functional test; 4-rung level ladder - recipe_meta.EXPECTED_NA = {rung: reason} lists intentionally-skipped rungs; any essential rung skipped and not listed is unintentional. Skips still cap the level (never inflate). results.json: skips:{intentional,unintentional} + level_cap_rung. - Level ladder = the four essential rungs (install, upgrade, backup/restore, functional; top = L4). integration & recipe-local are optional, not leveled (SSO still enforced for the run verdict, unchanged). - Card shows skipped rungs as INTENTIONAL SKIP (green, reason below) / UNINTENTIONAL SKIP (amber); level badge gains an expected/gap? third segment. - custom-html-tiny: functional serve test (exact-byte round-trip + 404); declares backup_restore intentionally skipped (stateless static server). Independently verified by the adversary: 138 unit tests pass cold; live full-stage run on custom-html-tiny green (upgrade tier ran; level 2; correct skips/badge); clean teardown.	2026-06-09 03:12:11 +00:00
autonomic-bot	8179d3f3f9	fix(3 U2): inline-SVG sunflower + font-safe cap line for headless card render Headless chromium has no colour-emoji font, so 🌻/🏆/⚑ rendered as tofu boxes in the PNG card. Replace with a self-contained inline-SVG sunflower + plain-text 'capped:'/'full clean climb' markers. The U3 PR comment keeps the real 🌻 emoji (Gitea markdown renders it). Pure render change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 06:23:13 +00:00
autonomic-bot	7217e0c98c	feat(3 U2-scaffold): summary card + level/status SVG badge renderers (offline; pure) harness/card.py: render_badge_svg/level_badge_svg (shields-style SVG, colour-by-level, R6) + render_card_html (recipe+version, level badge, per-stage/per-test ✔/✘ table, embedded screenshot, invariant flags — REPORTS results.json verbatim, never recomputes; cardinal no-inflation guardrail) + render_card_png (best-effort Playwright HTML->PNG, R7). 8 pure unit tests. Orchestrator wiring + stable-URL serving + live PNG demo come after U0 PASSes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 06:11:47 +00:00
autonomic-bot	daa7edd3a7	feat(3 U1-scaffold): app screenshot capture module (offline; not yet wired) harness/screenshot.py: best-effort Playwright capture of the live app (reuses harness browser). Default = landing page (credential-free, secret-safe R7); recipes needing post-login opt into a recipe-meta SCREENSHOT hook responsible for avoiding secret pages. Every failure swallowed -> None (cosmetics never block, R7). Pure helpers unit-tested. Orchestrator wiring + live demo come after U0 PASSes (avoid deploy contention with the Adversary's cold U0 re-runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 06:05:39 +00:00
autonomic-bot	52e5d210d8	feat(3 U0.2+U0.3): per-test results + results.json with computed level harness/results.py: JUnit-XML parsing (stdlib) → per-stage/per-test rows; derive_rungs (documented tier+deps/SSO → rung mapping); build_results assembles results.json {recipe,version,pr,ref,run_id, stages[],level,level_cap_reason,rungs,flags{clean_teardown,no_secret_leak},screenshot,summary_card}; write_results (atomic). run_recipe_ci.py: tiers emit --junitxml + append {tier,source,file,rc,junit} records; main() assembles+writes results.json wrapped so a failure NEVER changes the verdict (R7), incl. a narrow leak-scan of the serialised artifact. 17 new unit tests (test_results.py). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 05:55:58 +00:00
autonomic-bot	9773e3ff63	feat(3 U0.1): pure level() ladder mapper (L0-L6, gap-caps) + unit tests Phase-3 R1 foundation. harness.level.compute_level(rungs)->(level,cap_reason) with YunoHost gap-caps semantics: level = highest rung 1..L all clean PASS; first non-PASS (FAIL or N/A) caps, recorded in cap_reason. N/A caps like fail but distinctly (L5 'no integration surface' example). Helpers backup_restore_status + tier_to_rung. 16 unit tests incl U0 gate cases (L4-pass, L2-cap). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 05:46:23 +00:00
autonomic-bot	4bf9e1d43d	feat(mumble F2-14c): drop cc-ci compose.host-ports.yml fork; deploy 0.2.0 base minimally, add native host-ports on upgrade-to-latest via new UPGRADE_EXTRA_ENV harness hook + COMPOSE_FILE-aware READY_PROBE/install skip	2026-05-31 05:07:55 +00:00
autonomic-bot	2f6a6842b0	fix(2): echo abra backup output (backupbot pre-hook) into run log for diagnosis	2026-05-31 00:04:05 +00:00
autonomic-bot	4a29ca6a55	fix(2): echo abra restore output (backupbot post-hook) into run log for diagnosis	2026-05-30 23:37:55 +00:00
autonomic-bot	a7e2af444a	fix(2): assert_upgraded tolerate abra's '+U' working-tree marker on chaos-version A cc-ci deploy overlay sitting in the recipe checkout as an untracked file (ghost's compose.ccci-health.yml via install_steps) makes abra stamp chaos-version='<commit>+U' (U=untracked). The commit still equals head_ref (HC1 satisfied) but the '+U' broke the exact-prefix match → spurious upgrade-tier FAIL. Strip the working-tree-state marker before the commit match; HC1 preserved (commit must still equal head_ref — a stale checkout's commit would not match even after stripping). General: benefits every future cc-ci overlay recipe.	2026-05-30 05:49:27 +01:00
autonomic-bot	ec76072489	fix(2): Q4.2 mumble — TCP voice-server READY_PROBE gates backup past upgrade host-port churn Diagnostic (RECIPE=mumble STAGES=install,backup,restore,custom, no upgrade) PROVED backup+restore green on a stable 1.0.0 deploy incl. ci_marker survival (P4). The full-run backup 409 ('container not running') was the chaos UPGRADE redeploy: host-mode 64738 must be released by the old task + rebound by the new, and HEALTH_PATH '/' only proves the mumble-web sidecar (not the voice server), so wait_healthy passed while the app churned → backup-bot execed a not-running container. Fix: extend lifecycle.wait_ready_probes to support a TCP probe ({tcp_host,tcp_port,stable=N consecutive connects}); mumble recipe_meta READY_PROBE returns 64738 (stable=3) so the harness waits for the voice server up after install AND upgrade before backup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 20:19:07 +01:00
autonomic-bot	1890cb58f3	fix(2): recipe_checkout force (-f) — fixes mumble upgrade-tier checkout collision with cc-ci overlay git checkout <head_ref> aborted on the untracked install_steps-provided compose.host-ports.yml (which head_ref tracks). Force-checkout yields the exact ref tree. Also fixes the mumble restore tier: backup labels exist only in 1.0.0+, so backup/restore are meaningful only after the (now-working) upgrade moves the app to head_ref. DECISIONS.md updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 20:03:41 +01:00
autonomic-bot	999dd0d564	fix(2): Q4.2 mumble — CHAOS_BASE_DEPLOY meta flag for chaos base deploy (clean-tree gate) mumble's pinned base deploy (prev version 0.2.0) FATAs 'has locally unstaged changes' because install_steps provides an untracked compose.host-ports.yml. New recipe_meta CHAOS_BASE_DEPLOY=True + lifecycle._recipe_meta_flag + deploy_app branch -> base uses chaos (skips clean-tree/lint, deploys the checked-out pinned version, not LATEST), mirroring the lightweight-tag chaos-base path. DECISIONS.md records the full mumble enrollment design. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 19:32:48 +01:00
autonomic-bot	2bf40d69d6	feat(2): HQ1 image pre-pull (plan-prepull-images.md) — warm local store before deploy lifecycle.prepull_images(recipe, domain): resolve images via docker compose config --images (COMPOSE_FILE from the app .env — handles $VERSION interpolation + multi-compose) → docker pull each, skip-if-present (zero network for cached pinned tags). Called in deploy_app before the (unchanged, real) abra.deploy AND in generic.perform_upgrade before the chaos redeploy (warms new-version images). A pull failure RAISES a clear pre-deploy error (not a converge timeout); deploy path unchanged (no docker service update/scale). Removes PULL time not app-INIT time. 4 unit tests (tests/unit/test_prepull.py): present→skip, missing→ pull, pull-fail→raise, no-images→skip. NOT claimed yet — validating cold-verify criteria next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 16:02:21 +01:00
autonomic-bot	72719fe0d7	fix(2): R014 — chaos base deploy for recipes with lightweight tags (replaces fragile origin-repoint) The origin-repoint approach hit go-git 'reference not found' (mirror HEAD→master vs main). Simpler + robust: detect lightweight version tags (has_lightweight_version_tags, read-only) and, for the pinned base deploy of such a recipe, use chaos — which SKIPS abra lint (so no R014 FATA) and deploys the EXPLICITLY-checked-out pinned version (recipe_checkout already ran; chaos uses the current checkout, so it's the prev version, NOT LATEST — F1d-2's hazard was the missing checkout). No-op / stays pinned for all-annotated recipes. The upgrade tier's prev→PR-head crossover + HC1 (chaos-version==head_ref) still hold (verified by the run's upgrade-tier log). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 14:15:07 +01:00
autonomic-bot	ad06a5dd3f	fix(2): R014 normalize — use git clone --mirror (not --bare) so abra's later fetches find refs/heads/main --bare lacked refs/heads/main, so abra's post-normalize git ops (app secret insert / deploy) failed 'unable to fetch tags: reference not found' when fetching from the repointed local origin. --mirror copies all refs (heads+tags) → abra fetch OK + R014 passes (both verified). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 14:05:26 +01:00
autonomic-bot	da44e2ca8a	fix(2): R014 normalize — repoint recipe origin to local bare with annotated tag (abra force-fetches tags before lint, reverting in-place re-annotation) Diagnosed: abra runs git fetch --tags --force from origin before its pinned-deploy lint, so re-annotating the lightweight tag in place is reverted before R014 runs. Fix: after re-annotating, clone the recipe to a local bare repo (carrying the annotated tag) and repoint origin at it, so abra's force-fetch pulls the annotated tag. Verified: abra recipe lint R014 then PASSES and the annotation sticks. Deployed commit unchanged. No-op for all-annotated recipes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 13:59:03 +01:00
autonomic-bot	8c19b1fadc	fix(2): normalize lightweight recipe tags to annotated before pinned deploy (R014) lasuite-meet upgrade tier failed at the prev-version base deploy: abra's pinned-deploy lint FATA'd on R014 'only annotated tags used for recipe version' because upstream coop-cloud lasuite-meet ships a stray LIGHTWEIGHT tag (0.3.0+v1.16.0). chaos deploys skip lint (so install,custom passed) but the upgrade tier's pinned prev-version deploy lints. New abra.normalize_recipe_tags() re-creates each lightweight version tag as annotated at the SAME commit (no deployed content changes); called in lifecycle.deploy_app after recipe_checkout when version is pinned. Idempotent; no-op for all-annotated recipes (lasuite-drive etc.). Helps any recipe with a stray upstream lightweight tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 13:48:55 +01:00
autonomic-bot	e1147b5fe3	fix(2): F2-12 lasuite-drive upgrade tier — own convergence wait (abra -c) + collabora READY_PROBE Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init), even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold). Fix (cc-ci-side, stronger verification — not weaker): - abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's impatient monitor no longer FATAs (the stack spec is applied regardless). - perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade. - recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after the upgrade redeploy. No-op for recipes without a READY_PROBE. NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 11:55:53 +01:00
autonomic-bot	4b38b66fa5	fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom + OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C readiness-gating, no test weakened): - tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly. - runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op — plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default, while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing. Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md + BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section (241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated IDEAS.md/plan-sso-dep-testing.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 10:37:55 +01:00
autonomic-bot	fc6e35d617	feat(2): mattermost-lts create-message round-trip (§4.3 P3) — first-user→login→team→channel→post→read-back; harness http.post_with_headers (returns response headers, for mattermost login Token)	2026-05-29 08:31:37 +01:00
autonomic-bot	40b03a9bf1	claim(2w): WC8 + WC9 (FINAL gates) — resource-safety consolidation + stale-warm prune + docs/warm.md + --quick rollback proof WC8: canonical.prune_stale (drop de-enrolled warm data + volumes) wired into the nightly sweep + df log; consolidated evidence (DRONE_RUNNER_CAPACITY=MAX_TESTS serialize; autoPrune drops --volumes so warm vols survive; cold teardown sacred; warm excluded from D8 — no nix source ref). +1 unit (72 pass). WC9: docs/warm.md documents the full warm/quick model; --quick rollback proof already proven live (W2 FAIL restores exact known-good; WC4 PASS byte-identical snapshot). On PASS, all WC1-WC9 (incl WC1.1/WC1.2) verified → DONE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 04:43:34 +01:00
autonomic-bot	465e1059b0	claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik → serial full-cold over enrolled on latest → green promotes; skip if test active; operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix (timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live service run (repo-relative enrolled scan; util-linux for backup PTY). Live SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced 1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 04:33:08 +01:00
autonomic-bot	191ebde466	fix(2w): W2 --quick live-proof fixes (time import + stale-TYPE reset) 3 bugs found by the live PASS+FAIL proof on the custom-html canonical: - import time (run_quick._wait_undeployed used it → the FAIL rollback crashed with NameError before restore ran). - canonical.deploy_canonical now resets .env TYPE=<recipe>:<version> before redeploy, so a stale TYPE left by a prior --quick upgrade (pointing at a since-removed broken PR commit) can't FATAL abra 'unable to resolve <commit>'. - run_quick FAIL rollback resets TYPE to known-good after restore (idle .env agrees with the registry). LIVE PROOF (custom-html canonical), ALL PASS: (A) PASS quick run → undeploy keep-volume, known-good UNCHANGED, marker intact; (B) FAIL quick run (broken image) → 'rolling back' → 'restored known-good data; canonical idle' → exit 1, known-good UNCHANGED, DATA RESTORED. Canonical left clean (idle, 1.11.0+1.29.0). 61 unit pass; cold path untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 03:05:39 +01:00
autonomic-bot	b6ef83ab0b	feat(2w): W1 canonical registry module (WC2) + alerts archived runner/harness/canonical.py: data-warm canonical registry + lifecycle — is_enrolled (recipe_meta.WARM_CANONICAL), canonical_domain (warm.stable_domain warm-<recipe>), registry read/write (/var/lib/ci-warm/<recipe>/canonical.json), has_canonical (record + retained volume), deploy_canonical (reattach volume at known-good version), undeploy_keep_volume (idle data-warm), seed_canonical (record + warmsnap snapshot). warm.stable_domain helper added (keycloak path unchanged). +4 unit tests (61 unit pass). Also archived the Adversary's verification alert sentinels to alerts/seen/ (simulated rollback + 2 holds — evidentiary, gate PASSED; dir clean for real alerts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 02:15:11 +01:00
autonomic-bot	32f00717ac	fix(2w): W0.9 WC1.1 hardening (proven live: healthy upgrade + marquee rollback) Bugs found by the live proof, fixed: - warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole <recipe>/ dir — so the reconciler's sibling last_good file survives a snapshot swap (was being clobbered). - warm_reconcile: deploy_version captures abra's stdout (it writes FATA to stdout) in the error; add wait_undeployed() after every undeploy so snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers rollback. (57 unit pass.) LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good committed=10.7.9, marker realm preserved. (b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore), last_good NOT advanced, rollback alert written (attempted=10.7.10, last_good=10.7.9, recovered=True). keycloak recovered to canonical 10.7.1+26.6.2 healthy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 01:21:05 +01:00
autonomic-bot	4cc1e15a53	feat(2w): W0.5 WC3 snapshot/restore helper (warmsnap.py) runner/harness/warmsnap.py: raw per-volume tar of an app's stack volumes while UNDEPLOYED, under /var/lib/ci-warm/<recipe>/ (meta.json + volumes/<vol>.tar); one last-good, atomic dir swap; restore clears+untars each volume back. Asserts undeployed (consistency). Reused by WC1.1 (pre-upgrade keycloak snapshot) + WC5. +5 unit tests (48 unit pass). LIVE round-trip PROVEN on warm keycloak: create marker realm -> undeploy -> snapshot (mariadb+providers vols) -> deploy -> delete marker (mutate DB) -> undeploy -> restore -> deploy -> marker realm BACK; keycloak healthy. WC3 core. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 00:12:46 +01:00
autonomic-bot	1b8d26b504	feat(2w): W0.2 live-warm keycloak dep mode in orchestrator (WC1) - runner/harness/warm.py: stable-domain scheme (warm-<recipe>), is_warm_up probe, live_app_hexes scan, per-run realm_for naming, reap_orphan_realms. - run_recipe_ci.py: split declared deps into live-warm (shared provider + per-run realm, no deploy, realm deleted at teardown) vs cold (co-deploy). Warm path used only when provider is up; cold fallback otherwise. Reap orphan realms at run start (concurrency-safe). deploy-count excludes warm deps. Realm naming now per-run namespaced (<parent>-<6hex>). - dependent tests assert the namespaced realm pattern (stronger than ==parent). Live proof on warm keycloak: realm create -> password-grant JWT -> discovery issuer -> delete(idempotent) -> reap(keeps live hex, deletes orphan): PASS. 43 unit pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 23:26:02 +01:00
autonomic-bot	74bf8c1723	feat(2w): W0.1 keycloak realm lifecycle primitives (WC1) sso.py: list_realms, delete_keycloak_realm (idempotent, refuses master), realms_to_reap (pure, concurrency-safe predicate), reap_orphaned_realms. The per-run realm is the isolation unit on a shared live-warm keycloak; orphans (crashed runs) reaped by hex not mapping to a live app stack. +8 unit tests (tests/unit/test_warm_realm.py); 43 unit pass on cc-ci. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 23:16:48 +01:00
autonomic-bot	f59d8e6996	feat(2): Q3.2 lasuite-drive base enrollment + nested-subdomain + replicas:0 harness fixes - harness: services_converged treats replicas:0 one-shot (minio-createbuckets) as converged (cur==want); removes the want==0 rejection that hung deploys. DECISIONS.md. - recipe_meta.EXTRA_ENV flattens MINIO_DOMAIN/COLLABORA_DOMAIN to single-label wildcard siblings (the *.ci.commoninternet.net cert covers one label only). DECISIONS.md. - lifecycle overlays (install/upgrade/backup/restore) + ops.py postgres ci_marker data-integrity (db user/name=drive). Parity health_check functional test. PARITY.md. - DEPS=[keycloak] + OIDC/WOPI/upload functional tests deferred to the SSO iteration (probe-before-assert: prove the ~10-service base deploy converges first). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 19:54:31 +01:00
autonomic-bot	41ede13042	feat(2): refactor — SSO-dep plan refinement (deps AFTER generic + setup_custom_tests + failure isolation) Per operator-2026-05-28 SSO-dep plan (plan-sso-dep-testing.md). Substantial orchestrator restructuring: NEW LIFECYCLE ORDER: 1. Recipe deploy ALONE (no deps). 2. install / upgrade / backup / restore — recipe-only generic tiers. 3. setup_custom_tests step (NEW): a. Deploy each declared dep + provision realm/client/test-user via harness.sso. b. Write $CCCI_DEPS_FILE in dict shape {dep_recipe: {domain, realm, client_id, client_secret, admin_user, admin_password, discovery_url, token_url, ...}}. c. Run tests/<recipe>/setup_custom_tests.sh hook (jq-readable; wires OIDC env via abra secret insert + .env edits + in-place 'abra app deploy --force --chaos'). 4. CUSTOM tier with deps-ready flag; @pytest.mark.requires_deps tests skip with 'deps-not-ready: <reason>' when setup_custom_tests fails. NON-deps custom tests still run normally — FAILURE ISOLATION (a DoD item per plan). 5. Teardown: recipe first, deps in reverse declaration order. Harness changes: - runner/run_recipe_ci.py: deps deploy moves from BEFORE recipe deploy to AFTER restore tier. Adds _enrich_deps_with_sso() + _run_setup_custom_tests_hook(). DG4.1 generalised to 'one abra app new per app' (recipe + each dep); in-place redeploys (\--force) don't count. - runner/harness/deps.py: write_run_state + load_run_state accept dict OR list shape; deps_as_dict() coerces either to a recipe→entry map. - runner/harness/sso.py: admin_password_inside() public re-export. - tests/conftest.py: deps_creds fixture (full creds dict); deps_apps fixture flattens to recipe→domain string. pytest_collection_modifyitems hook skips \@pytest.mark.requires_deps tests when CCCI_DEPS_READY=0. pytest_configure registers the marker. Recipe content: - tests/lasuite-docs/setup_custom_tests.sh: NEW hook reads $CCCI_DEPS_FILE via jq; inserts oidc_rpcs secret at BUMPED version (v1→v2) since abra app new -S generates v1 first and Swarm forbids overwriting; updates SECRET_OIDC_RPCS_VERSION in .env; writes 9 OIDC env vars (REALM/DISCOVERY/AUTH/TOKEN/USERINFO/LOGOUT/JWKS/CLIENT_ID/SCOPES); ensures trailing newline on .env so writes don't concatenate (caught a 'TIMEOUT=900OIDC_REALM=...' bug); triggers in-place 'abra app deploy --force --chaos --no-input'. - tests/lasuite-docs/functional/test_oidc_with_keycloak.py: refactored to consume deps_creds fixture (no longer calls setup_keycloak_realm itself — the orchestrator does it in setup_custom_tests). Marked \@pytest.mark.requires_deps. Cold-verifiable on cc-ci (log /root/ccci-refactor-lasuite-r5.log): RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py install: PASS, custom: 3 PASS incl. test_oidc_password_grant_against_dep_keycloak. deploy-count = 2 (expect 2) — DG4.1 generalised holds. Smoke regression: RECIPE=custom-html STAGES=install,custom → 5 PASS, deploy-count=1. Closes DEFERRED.md #5 (lasuite-docs OIDC parity ports via this plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 19:11:42 +01:00
autonomic-bot	1bd7c7a1d3	feat(2): Q4.4 ghost + DEPLOY_TIMEOUT plumb-through for heavy recipes Harness change (small, surgical): - runner/harness/lifecycle.deploy_app gains a deploy_timeout param (default 900s); passes through to abra.deploy(timeout=...). For heavy recipes (ghost, matrix-synapse, lasuite-meet), the orchestrator + dep resolver now read recipe_meta.DEPLOY_TIMEOUT and pass it so the Python subprocess wrapping abra deploy doesn't SIGKILL it before the recipe's INTERNAL TIMEOUT (via EXTRA_ENV) finishes swarm convergence. - runner/run_recipe_ci.py + runner/harness/deps.py: thread recipe_meta.DEPLOY_TIMEOUT into the per-recipe deploy_app call. Q4.4 ghost enrollment: - recipe_meta.py: HEALTH_PATH=/, DEPLOY_TIMEOUT=1200 (subprocess), EXTRA_ENV={TIMEOUT: 1200} (recipe internal). Ghost cold-start with theme + DB migration runs ~12-15min on cc-ci. - functional/test_health_check.py: GET / returns 200 (themed site). - functional/test_content_api.py: GET /ghost/api/content/settings/ returns 200 (settings JSON) or 401/403 (Ghost error envelope) — distinguishes ghost-server up + JSON API working from static fallback. - functional/test_admin_redirect.py: GET /ghost/ returns 200 or 302 + Ghost branding; proves admin route is wired through nginx proxy. - PARITY.md: recipe-maintainer corpus has no ghost tests/, Phase-2 health_check is the parity baseline; create-a-post deeper test deferred (DEFERRED.md, --extra-tests linked). Cold-verifiable (log /root/ccci-q44-ghost-r3.log): RECIPE=ghost STAGES=install,custom cc-ci-run runner/run_recipe_ci.py install + 3 functional tests PASS, deploy-count=1. 28/28 unit tests still PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 17:23:40 +01:00
autonomic-bot	c6e94af766	fix(2): F2-5 — dep teardown verify=True, errors propagate to run-fail (Adversary cold) Per REVIEW-2 ## Q2 FAIL: runner/harness/deps.py::teardown_deps suppressed ALL exceptions via contextlib.suppress(Exception), silently swallowing teardown failures. The 'DEPS teardown' print fired even when undeploy actually raised — leaving leftover swarm services/volumes/secrets that broke the NEXT run targeting the same deterministic dep domain (this is what caused the Q3.1 dep flake I saw immediately after the Q2.4 acceptance run). Fix: - runner/harness/deps.py: teardown_deps now uses lifecycle.teardown_app(..., verify=True) so residuals raise TeardownError. Errors are LOGGED LOUDLY per-dep but we continue to other deps so one failure doesn't strand the rest. After all attempts: raise a combined TeardownError if any dep failed. - runner/run_recipe_ci.py: orchestrator catches the dep TeardownError in finally, prints it, captures into dep_teardown_error; the run summary surfaces it and the exit code is non-zero. The run STILL prints the diagnosable summary so a leak doesn't hide other failures. Per §9 teardown sacred / DG7: a green run that leaks state is not 'green'. F2-5 now correctly fails the run instead of silently passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 09:00:37 +01:00
autonomic-bot	47f7cb47c2	fix(2): F2-3 systemic — harness.browser.goto_with_retry; applied to all install overlays Phase 2 lesson from F2-3 (n8n install Playwright flake on net::ERR_NETWORK_CHANGED): every install overlay that does page.goto needs the same try/except PlaywrightError + status retry. Centralize in runner/harness/browser.py::goto_with_retry; apply to ALL install overlays. - runner/harness/browser.py: shared helper. Polls page.goto until status in accept_statuses; catches PlaywrightError (net::ERR_*) as a retryable signal, not a failure. Raises AssertionError with last_status + last_err diagnostic only on deadline expiry. - tests/custom-html/test_install.py: now uses goto_with_retry (200 only, wait_until=load). - tests/custom-html/playwright/test_browser_smoke.py: same. - tests/n8n/test_install.py: replaced inline retry loop with goto_with_retry (200, 304). - tests/keycloak/test_install.py: goto_with_retry for admin console (200, 302, 303; 45s goto). - tests/cryptpad/test_install.py: goto_with_retry (200, 304; 60s goto, wait_until=load). - tests/lasuite-docs/test_install.py: goto_with_retry (200, 301, 302; 60s goto). Cold-verifiable: ssh cc-ci 'RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py' all 5 stages PASS (including the install overlay that flaked in the deps_smoke run), deploy-count=1, head_ref=8a026066==chaos-version=8a026066 (HC1 non-vacuous). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 07:46:34 +01:00
autonomic-bot	4d6b040ba7	feat(2): Q2.3 — dep resolver + SSO-setup harness primitives - runner/harness/deps.py: dep resolver primitive (Phase 2 §4.2 / Q2.3). - declared_deps(recipe) reads DEPS list from tests/<recipe>/recipe_meta.py - dep_domain(parent, pr, ref, dep) — per-run domain per (parent, dep) pair so two recipes' deps of the same kind don't collide on a host - deploy_deps / teardown_deps — sequential deploy + reverse-order teardown - read/write of run-scoped $CCCI_DEPS_FILE - runner/harness/sso.py: SSO-setup / OIDC-flow primitive (Phase 2 §4.2 / Q2.3). - setup_keycloak_realm: idempotent realm + confidential OIDC client + test user with generated 25-char alphanumeric password (class-B per §4.4-B); returns SsoCreds dict with discovery_url, token_url, all identifiers. - oidc_password_grant: exercises the password-grant OIDC flow; returns access_token (a JWT) or raises. - assert_discovery_endpoint: GET /.well-known/openid-configuration; asserts issuer matches the per-run provider domain+realm. - runner/run_recipe_ci.py: wired in dep deploy BEFORE recipe-under-test, dep teardown LAST in finally (reverse order). DG4.1 deploy-count guard now expects 1 + len(deps_state) — accommodates declared deps without breaking the no-extra-deploys invariant. - tests/conftest.py: deps_apps fixture reads $CCCI_DEPS_FILE -> dict mapping dep_recipe -> dep_domain. - tests/unit/test_deps.py: 7 unit tests covering declared_deps parsing, per-(parent,dep) domain distinctness, run-state JSON write/load, env-var no-op semantics. 28/28 unit tests PASS on cc-ci. Smoke test confirmed deploy_count == expected (1) when no deps declared (custom-html install run, log /root/ccci-q2-deps-smoke.log). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 07:41:56 +01:00
autonomic-bot	0d0fc6c4bc	feat(2): Q0.1/Q0.2 — harness.http + discovery recurses functional/playwright (Phase 2) - runner/harness/http.py: canonical Phase-2 recipe-test HTTP API (vendored from recipe-maintainer/utils/tests/helpers.py): http_get/http_post, retry variants, wait_for_http, assert_converges. JSON-parsing, header support, form/JSON POST bodies, transport-failure -> status=0. Self-contained (cc-ci does not import recipe-maintainer at runtime per DECISIONS Phase 2). - harness.discovery.custom_tests now also recurses into tests/<recipe>/{functional,playwright}/test_*.py (Phase 2 §4.1 layout) while excluding lifecycle test_<op>.py names and honoring the HC2 repo-local gate. - Unit tests: tests/unit/test_http.py — in-process http.server fixture; deterministic proofs of parsing/retry/convergence semantics, no network egress. tests/unit/test_discovery_phase2.py — functional/+playwright/ recursion + HC2 gate still applies to subdirs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 04:36:49 +01:00
autonomic-bot	6eabfdc0fb	fix(1e): F1e-1 exec_in_app race + HC1 head_ref/move hardening F1e-1 (Adversary): exec_in_app silently returned '' on a failed docker exec, flipping a healthy recipe RED under opt-out (post-backup container cycle, no readiness buffer). Now polls (re-resolve container + re-exec) until rc==0 or 90s, then RAISES — never masks an exec failure as empty data. No assertion weakened. Verified: opt-out install,backup,restore on custom-html now PASS. HC1: head_ref = ref or recipe_head_commit (prefer explicit PR head sha $REF — robust, no git race; production !testme always sets REF). assert_upgraded, when head_ref known, REQUIRES the deployed chaos-version commit to MATCH head_ref (direct + non-vacuous proof the PR-head code was deployed; a stale prev-checkout chaos redeploy fails). Falls back to version/image/chaos move check otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:41:42 +01:00
autonomic-bot	b7e6cbd7be	feat(1e): HC3 additive generic + op/assertion split (orchestrator owns the op) - orchestrator: per mutating tier, run optional pre-op seed hook (ops.py pre_<op>) → perform the op ONCE (harness-owned) → run generic assertion (unless opted out) AND overlay assertion, both against the shared post-op deployment. Op results passed op→assertion via run-scoped CCCI_OP_STATE_FILE. - opt-out: CCCI_SKIP_GENERIC / CCCI_SKIP_GENERIC_<OP> / recipe_meta.SKIP_GENERIC (declarative). - generic.py: split do_* into op primitives (perform_upgrade/backup/restore) + assertions (assert_upgraded/backup_artifact/restore_healthy) reading op_state(); deployed_identity now returns {version,image,chaos} (chaos label ready for HC1). - generic test_<op>.py + all 6 recipe overlays migrated to assertion-only; pre-op seeding moved to per-recipe ops.py (pre_upgrade/pre_backup/pre_restore). install overlays unchanged (no op). - deploy-count stays 1 (op primitives never call deploy_app). lint PASS; 8 unit tests PASS on cc-ci. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:12:04 +01:00
autonomic-bot	d38a695fa3	feat(1e): HC2 repo-local approval allowlist (default-deny) + discovery gate - tests/repo-local-approved.txt (empty ⇒ default-deny); CCCI_REPO_LOCAL_APPROVED_FILE override. - discovery: repo_local_approved()/_gated() centralize the gate; resolve_overlay_op + generic_op (HC3 additive split); custom_tests/install_steps/pre_op_hook all honor the gate. - unit tests rewritten for approved-vs-not + the generic floor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 02:55:58 +01:00
autonomic-bot	feb6f80d50	fix(1d): bounded retry in _app_container (backup briefly cycles the app container) abra app backup create (backup-bot-two) stops/cycles the app container, so a mutate exec_in_app right after backup hit an empty docker ps and raised. _app_container now polls (no bare sleep) for the container to reappear within a timeout. Recipe-agnostic harness robustness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 00:06:28 +01:00
autonomic-bot	81e26a1bdc	fix(1d): F1d-2 — pinned base deploys the pinned version; upgrade is non-vacuous - deploy_app: checkout the pinned tag + deploy NON-chaos when a version is pinned (chaos only for version=None / PR-head). Was always -C, which ignored the pin and deployed LATEST -> upgrade no-op. - do_upgrade: assert the deployment actually MOVED (coop-cloud version label and/or image changed) via lifecycle.deployed_identity -> a vacuous no-op upgrade can no longer pass (DG2). - G2: migrate custom-html overlays to the assertion-only contract (override + extend-by-composition + data-continuity; split backup/restore). tests/unit/test_discovery.py proves precedence (5/5). Probe (Adversary's F1d-2 test): hedgedoc deploy-prev=1.10.7 -> upgrade=1.10.8, CHANGED=True. hedgedoc full generic lifecycle green (install/upgrade/backup/restore, deploy-count=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 00:02:59 +01:00
autonomic-bot	6c5d8f28ea	fix(1d): G1 backup/restore + F1d-1 cert-check reframe - backup artifact: read snapshot_id from 'abra app backup create' output (snapshots needs a TTY); generic.parse_snapshot_id + do_backup assert it - restore serving race: lifecycle.http_fetch (one request -> status+body, never raises) + assert_serving is now a bounded poll (settles a post-op reconverge, no bare sleep); drop wait_serving - F1d-1 (Adversary, low): reframe served_cert/assert_serving honestly as an INFRA TLS sanity check (catches a lapsed/mis-rotated wildcard cert), NOT app-vs-fallback (Traefik serves the wildcard zone-wide); the genuine serving proof is services_converged + non-404 status. Awaiting re-test. DG1 Adversary PASS @ef44d46. G1 full-lifecycle re-verification in flight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:39:45 +01:00
autonomic-bot	ef44d4658b	feat(1d): G0 — generic install + deploy-once orchestrator (DG1 green on hedgedoc) - harness/generic.py: recipe-agnostic assert_serving (converged + real HTTP, 404-excluded + not Traefik 404 body + CA-verified trusted wildcard cert), op helpers, backup_capable detect - harness/discovery.py: per-op overlay resolution (repo-local > cc-ci > generic), custom + hook - tests/_generic/: assertion-only tiers (install/upgrade/backup/restore) on the shared deployment - run_recipe_ci.py: deploy-ONCE orchestrator, per-op summary, deploy-count guard (DG4.1) - conftest live_app fixture; lifecycle deploy-count + install-steps hook + pin DOMAIN to run domain DG1 cold-verified green on hedgedoc (pure generic, deploy-count=1, clean teardown). G0 CLAIMED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:27:55 +01:00
autonomic-bot	2cede01ed7	style(1b): auto-format + lint-clean the whole codebase (RL1) Mechanical, semantics-preserving cleanup so the codebase passes the new lint stage: - ruff format: all 32 Python files (wraps long signatures, normalizes quotes/blank lines). - nixpkgs-fmt: modules/drone-runner.nix. - shfmt (-i 2 -ci): scripts/.sh. Lint fixes (reviewed, behavior-preserving — no test weakened): - ruff SIM105: try/except-pass -> contextlib.suppress (abra.py app_config rm; lifecycle.py janitor). - ruff SIM115: open().read() -> with open() (run_recipe_ci.py redaction-values + gitea-token). - statix: merge repeated sops `secrets.` keys into one `secrets = { ... }` (comments kept); empty fn pattern `{ ... }:` -> `_:` (packages.nix). - deadnix: drop unused lambda args (flake `self`; configuration.nix `lib`; overlay `final` -> `_`). Verified on cc-ci: `scripts/lint.sh` -> lint: PASS; nixosConfigurations.cc-ci evaluates; all Python byte-compiles. The deployed bridge/dashboard/runner source changes hash (reformat), so cc-ci will be rebuilt to the new closure in W2 before the cold D1-D10 re-verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 20:52:05 +01:00
autonomic-bot	575efb5054	fix: abra app upgrade -c (no-converge-checks) — abra false-fails slow heavy rolling upgrades All checks were successful continuous-integration/drone/push Build is passing Details continuous-integration/drone Build is passing Details Diagnosed via instrumented diag: lasuite-docs upgrade reported 'FATA deploy failed' while all 9 services converged 1/1 — abra's convergence poll gives up too early on the slow stop-first roll (pulling new images). Disable abra's check; the harness wait_healthy + data-survival assertion is the real, more-patient gate (a genuine failure still fails the test: app never gets healthy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 11:34:59 +01:00
autonomic-bot	4d5f7e25c6	fix: abra app upgrade -o (offline) — was 401'ing fetching tags from the private mirror origin All checks were successful continuous-integration/drone/push Build is passing Details continuous-integration/drone Build is passing Details Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 08:31:40 +01:00

1 2

59 Commits