cc-ci

Author	SHA1	Message	Date
autonomic-bot	68954be53e	feat(harness): P5 — customization manifest (rcust) All checks were successful continuous-integration/drone/push Build is passing Details One block at run start answering "what does this recipe customize?" across every surface (non-default recipe_meta keys, ops.py pre-ops, install_steps.sh, compose.ccci.yml, lifecycle overlays by source, custom-test counts, active CCCI_SKIP_GENERIC* env overrides — !!-flagged when riding a CI run, P2c), printed to the run log and embedded verbatim in results.json under "customization". Pure presentation — building/printing it never influences a verdict; the manifest honors the HC2 repo-local gate so it never advertises code the run will not execute. Unit tests: synthetic recipe exercising every surface -> complete + deterministic + JSON-clean; HC2 invisibility; env-override flagging; render golden lines; build_results threads the dict verbatim (key always present, None when absent). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:57:26 +00:00
autonomic-bot	29a28e2028	feat(harness): P4 — custom-test ergonomics (rcust) All checks were successful continuous-integration/drone/push Build is passing Details Placement RULE: discovery.custom_tests covers ONLY functional/ + playwright/ — the top-level test_*.py glob for recipe dirs is removed (top level is reserved for lifecycle overlays; zero in-repo users of top-level custom tests, verified by sweep). Lifecycle-name exclusion inside the subdirs stays as the double-run safety net. HC2 default-deny unchanged (repo-local custom now pinned via functional/ in the gate test). New conftest fixture op_state: parses $CCCI_OP_STATE_FILE (op context: versions, artifact paths), skipping with a clear reason when unset/absent/unparseable — overlay tests read op facts from the fixture instead of hand-parsing env (zero existing hand-parsers found; the fixture is the documented path forward). deps fixture landed in P2d. Unit tests: placement-rule discovery tests (top-level custom NOT discovered; functional/playwright are; misfiled lifecycle names excluded), op_state fixture contract (reads file / skips without env / skips on missing file), deps fixture attribute sugar. Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 184 passed; scripts/lint.sh -> PASS.	2026-06-10 17:14:21 +00:00
autonomic-bot	fd02d9f4b8	feat(harness): P3 — uniform ctx hook convention (rcust) All checks were successful continuous-integration/drone/push Build is passing Details harness.meta.HookCtx (frozen): .domain, .base_url, .meta (RecipeMeta), .deps (provisioned dep creds from $CCCI_DEPS_FILE or None), .op (current lifecycle op or None); built via meta.hook_ctx() at each hook call site. All recipe callables now take ctx: EXTRA_ENV(ctx), UPGRADE_EXTRA_ENV(ctx), READY_PROBE(ctx), BACKUP_VERIFY(ctx), SCREENSHOT(page, ctx), ops.py pre_<op>(ctx). Dict-valued EXTRA_ENV/UPGRADE_EXTRA_ENV unchanged (only the callable signature moved). Call sites converted: deploy_app env shaping, perform_upgrade, wait_ready_probes (gains op=), _perform_op BACKUP_VERIFY, screenshot.capture, _run_pre_hook. Legacy signatures fail FAST with a clear migration message: the registry carries hook_params per hook key, enforced at meta.load() (MetaError names the old vs new signature); ops.py pre-op hooks get the same check at the orchestrator call site (meta.check_hook_signature) — no silent TypeError mid-run. Migrated every in-repo user mechanically (17 ops.py files; cryptpad/lasuite-*/ mailu EXTRA_ENV; mumble+lasuite-drive READY_PROBE; ghost/discourse BACKUP_VERIFY) — seeded values, probes and assertions byte-identical (domain -> ctx.domain; keycloak pre_restore's meta arg -> ctx.meta). Unit tests: hook_ctx field contract, ctx.deps from the run deps file, legacy- signature MetaError (READY_PROBE/EXTRA_ENV/SCREENSHOT + pre-op checker), ctx signatures accepted. Docs table regenerated (signature docs in key docs). Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 180 passed; scripts/lint.sh -> PASS.	2026-06-10 17:10:26 +00:00
autonomic-bot	8cd72fd78d	feat(harness): P2 — delete legacy customization keys & paths (rcust) All checks were successful continuous-integration/drone/push Build is passing Details a) compose.ccci.yml is FIRST-CLASS: the harness auto-copies tests/<recipe>/ compose.ccci.yml into the run's recipe checkout (ABRA_DIR-aware, lifecycle. provide_ccci_overlay) and auto-chaoses the pinned base deploy on its presence (kills the R7 implicit coupling). ghost/discourse install_steps.sh (copy-only boilerplate) deleted; CHAOS_BASE_DEPLOY removed from both metas + the registry. b) install-time deps wiring is the ONLY mode: deps with DEPS provision BEFORE the single deploy; legacy post-deploy provisioning + the setup_custom_tests.sh invocation machinery deleted. lasuite-docs migrated to install_steps.sh OIDC wiring (same env names/values as the old hook — only the timing moved); lasuite-drive's remaining post-deploy MinIO bucket one-shot moved to ops.py pre_install; both setup_custom_tests.sh files deleted; OIDC_AT_INSTALL removed from drive/meet metas + the registry. c) SKIP_GENERIC meta key deleted (zero users). Env form CCCI_SKIP_GENERIC* stays as the documented dev-only escape hatch; when active in a drone CI run the orchestrator prints a loud !! warning (manifest embedding lands in P5). d) conftest cleanup: dead pre-deploy-once fixtures deployed/deployed_app deleted (zero users), app_domain + _short + _wait_healthy dropped (only users were the deleted fixtures); deps_apps+deps_creds consolidated into ONE deps fixture (entries expose .domain etc. as attributes; dict access intact); the 6 lasuite test files renamed deps_creds->deps (fixture name only — assertions and flows byte-identical). requires_deps marker + F2-11 skip-report plumbing unchanged. Registry is now exactly the 14 final keys; docs §4 table regenerated. Stale setup_custom_tests/OIDC_AT_INSTALL prose in docstrings/comments/assert MESSAGES updated (no assert logic or expected value touched). Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.	2026-06-10 17:01:33 +00:00
autonomic-bot	472a68b32c	feat(harness): P1 — single registry-backed meta loader (rcust) All checks were successful continuous-integration/drone/push Build is passing Details One loader: runner/harness/meta.py::load(recipe) -> RecipeMeta (frozen dataclass, attribute access), backed by the declarative KEYS registry (14 final keys + 3 P2-deprecated). The ONLY exec() of tests/<recipe>/recipe_meta.py. Validation per the locked decision: unknown ALL-CAPS top-level name or type mismatch = MetaError (hard error at load); underscore-prefixed names recipe-private; callables only on hook-typed keys. Migrated all six legacy loaders (spec §4 L1–L6): - run_recipe_ci.py::_load_meta deleted; orchestrator loads once, passes meta down - tests/conftest.py::_recipe_meta deleted; meta fixture returns full RecipeMeta (R3) - lifecycle.py::_recipe_extra_env/_recipe_meta_flag deleted; deploy_app takes meta - deps.py::declared_deps deleted; callers read meta.DEPS - canonical.py::is_enrolled reads through meta.load() - screenshot.py now actually receives SCREENSHOT through the orchestrator path (R2 fix; proven by unit test through the real load path) Mumble private constants underscore-prefixed (_WELCOME_TEXT_MARKER/_MAX_USERS) + importers fixed. New tests/unit/test_meta.py (all-recipes-load-clean typo gate, MetaError cases, spec §2 baseline defaults, underscore exemption, doc sync). Docs §4 key table now GENERATED from the registry (scripts/gen-meta-docs.py); drift fails CI. Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.	2026-06-10 16:46:58 +00:00
autonomic-bot	b6e12ef428	fix(harness): run-keyed run-scoped state files — CONC-A1 (same-domain runs corrupted shared deploy-count) All checks were successful continuous-integration/drone/push Build is passing Details The four CCCI state files (deploys countfile, opstate, deps, depskip) were keyed by app domain in shared /tmp. A second run of the same domain executes its main() preamble + deploy_app's pre-lock _record_deploy BEFORE blocking at the app lock, so it reset/polluted the live first run's counter (false DG4.1 deploy-count=2, build 279) and the first run's end-of-run os.remove crashed the second (FileNotFoundError, build 281). Masked pre-restructure by the end-to-end recipe flock. Now keyed by run id + harness pid via _run_state_path(); children receive exact paths via the CCCI_*_FILE env vars, so domain keying was never load-bearing. tests/concurrency/test_run_state.py: path-invariant cases + a real-process regression (helpers.py deploy-count-run) reproducing the live interleaving — verified to FAIL under simulated shared keying. docs/concurrency.md §3 updated.	2026-06-10 08:16:09 +00:00
autonomic-bot	17ebdf39ac	feat(harness): P3 per-run ABRA_DIR — structural recipe-tree isolation, recipe flock deleted All checks were successful continuous-integration/drone/push Build is passing Details - run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared canonical path, so janitor discovery and env-based teardown are unchanged; per-domain filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ — each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation. - fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock. CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging workflow, run reads staged state). - abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions, generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but follows the per-run tree when imported by promote_canonical inside a run). - deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the must-lock-before-fetch ordering rule. - tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix required by per-run trees; no assertion/gate touched — see DECISIONS.md). - .drone.yml comments updated (HOME=/root rationale now via the servers symlink).	2026-06-10 04:18:33 +00:00
autonomic-bot	b302f3ab63	feat(harness): P2 flock-probe janitor — the kernel flock IS the liveness oracle All checks were successful continuous-integration/drone/push Build is passing Details - acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line when another run of the same domain is in flight (double-!testme serialisation). The file object is retained in module-level _held_app_locks so GC can never close the fd and silently release the lock. mtime is touched at acquisition (lock age for the long-held flag). - janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the probe lock (a new same-domain run blocks until the reap finishes), then unlink before release. Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log. - unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on the live path, and a probe that won one skips (unlinking now would hit a newer run's file). - deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard. teardown_app no longer unregisters (release is process exit). janitor() takes no args now. - post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate reap (improvement over the old 2h age fallback).	2026-06-10 04:11:31 +00:00
autonomic-bot	b492f995bd	feat(harness): P1 lock-lifetime hardening — PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard deadline All checks were successful continuous-integration/drone/push Build is passing Details - new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged and ignored (begin_teardown guard) so a second signal can't abort the running cleanup. - run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown finally: blocks (cold + quick) mark begin_teardown(). - .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of leaking it (docs/concurrency.md §8.1). - PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.	2026-06-10 04:04:28 +00:00
autonomic-bot	e6d55b53c7	fix(harness): a paused swarm update is settled — only active states block convergence All checks were successful continuous-integration/drone/push Build is passing Details continuous-integration/drone Build is passing Details `68ef0f8` made services_converged() require UpdateStatus settled, treating 'paused' as in flight. But swarm's default update-failure-action pauses the update on a single task flicker and the flag persists FOREVER (until the next update): immich CI 241 had the app service 'paused' from a restart during restore while the service was back at 1/1 and healthy — every subsequent wait hung to its deadline and the run had to be killed. Only 'updating' and 'rollback_started' now block convergence: those are the states swarm is actively driving (the 238 stop-first race lives in 'updating'). 'paused'/'rollback_paused' make no progress without intervention, so waiting on them is pointless — N/N replicas is already required, and the HTTP-health and tier assertions still gate whether the app actually works. lint: PASS, unit tests: 138 passed.	2026-06-09 23:07:36 +00:00
autonomic-bot	68ef0f84fb	fix(harness): convergence must span stop-first rolling updates (immich 238 backup 409) Some checks reported errors continuous-integration/drone/push Build is passing Details continuous-integration/drone Build was killed Details services_converged() accepted N/N replicas as converged — but a chaos redeploy that changes a non-app service image (immich PR #2 moves the db to the vectorchord pin) registers a stop-first rolling update that swarm may not have STARTED yet: the OLD task still shows 1/1, the wait passes, and the task dies seconds later. Build 238: backupbot resolved the db hook container, the task was killed in the gap, and the pre-hook exec crashed the whole backup with a 409 -> no dump in the snapshot -> restore had nothing -> RED. - services_converged() now also requires every service's swarm UpdateStatus to be settled ('', completed, rollback_completed) — updating/paused/rollback in flight is NOT converged. Strictly stricter: no gate is weakened. - backup_app() gains a bounded (300s) settle-wait before 'abra app backup create' as defence in depth; on timeout the backup still runs and the tier's assertion delivers the verdict. lint: PASS, unit tests: 138 passed.	2026-06-09 22:10:55 +00:00
autonomic-bot	c0df77d0d9	fix(harness): make concurrent recipe runs safe (per-recipe flock + active-run registry) All checks were successful continuous-integration/drone/push Build is passing Details capacity=2 went live with three stale capacity=1-era assumptions that corrupted concurrent runs (immich 229/230 '/pg_backup.sh: No such file'): - ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe rm-rf's/ reclones and the upgrade tier git-checkouts mid-run. Same-recipe runs now serialise on an exclusive flock (/run/lock/cc-ci-recipe-<recipe>.lock), taken in main() BEFORE fetch_recipe and held for the whole run; the kernel releases it on any process death, so there is no stale-lock failure mode. Different recipes still run in parallel. - CCCI_JANITOR_MAX_AGE=0 made a starting build reap ANY in-flight run app. Every run now registers its app domain + pid in /run/cc-ci-active/<domain> before app creation; the janitor checks the owner: alive (pid is a live run_recipe_ci process) -> never reaped; dead -> reaped immediately; unknown (pre-registry or post-reboot) -> age fallback (default 2h). The MAX_AGE=0 env override is gone from .drone.yml. - .drone.yml: concurrency.limit 1 -> 2 to match DRONE_RUNNER_CAPACITY=2; the 'safe because capacity=1' comments now describe the flock+registry model. lint: PASS, unit tests: 138 passed.	2026-06-09 21:56:25 +00:00
autonomic-bot	9a7772563a	style: repo-wide lint pass — make the lint gate green again Push builds have been RED on the lint step since ~build 209 from accumulated formatting drift. This is the mechanical cleanup: ruff format + ruff --fix (UP038 isinstance unions, SIM105 contextlib.suppress, UP031 f-strings, SIM115 tempfile context manager), shfmt -i 2 -ci, nixpkgs-fmt/statix/deadnix (merged attrsets, dropped unused lib args), yamllint, and shell quoting fixes in tests/lasuite-docs/setup_custom_tests.sh. No behaviour changes intended; lint: PASS, unit tests: 138 passed.	2026-06-09 21:56:15 +00:00
autonomic-bot	c51cd84159	feat(harness): intentional skips + custom-html-tiny functional test; 4-rung ladder (#6 ) Some checks failed continuous-integration/drone/push Build is failing Details Declare intentional skips + custom-html-tiny functional test; 4-rung level ladder - recipe_meta.EXPECTED_NA = {rung: reason} lists intentionally-skipped rungs; any essential rung skipped and not listed is unintentional. Skips still cap the level (never inflate). results.json: skips:{intentional,unintentional} + level_cap_rung. - Level ladder = the four essential rungs (install, upgrade, backup/restore, functional; top = L4). integration & recipe-local are optional, not leveled (SSO still enforced for the run verdict, unchanged). - Card shows skipped rungs as INTENTIONAL SKIP (green, reason below) / UNINTENTIONAL SKIP (amber); level badge gains an expected/gap? third segment. - custom-html-tiny: functional serve test (exact-byte round-trip + 404); declares backup_restore intentionally skipped (stateless static server). Independently verified by the adversary: 138 unit tests pass cold; live full-stage run on custom-html-tiny green (upgrade tier ran; level 2; correct skips/badge); clean teardown.	2026-06-09 03:12:11 +00:00
autonomic-bot	799cceb54a	fix(3 U5.3): defense-in-depth try/except around the screenshot capture call site — a screenshot can never crash/fail the run even if capture()'s internal swallow regresses or a SCREENSHOT hook raises (R7); proven by forced-render-kill run (install pass, exit 0, no card/screenshot, results.json intact) Some checks failed continuous-integration/drone/push Build is failing Details	2026-05-31 10:13:30 +00:00
autonomic-bot	afe5e51057	feat(3 U2-wiring): render summary card PNG + level badge SVG into run artifact dir (best-effort, R7; not yet served)	2026-05-31 07:03:10 +00:00
autonomic-bot	5fa15d4949	feat(3 U1): wire app screenshot capture into run_recipe_ci (best-effort, post-healthy, secret-safe; sets results.json screenshot)	2026-05-31 06:56:20 +00:00
autonomic-bot	8179d3f3f9	fix(3 U2): inline-SVG sunflower + font-safe cap line for headless card render Headless chromium has no colour-emoji font, so 🌻/🏆/⚑ rendered as tofu boxes in the PNG card. Replace with a self-contained inline-SVG sunflower + plain-text 'capped:'/'full clean climb' markers. The U3 PR comment keeps the real 🌻 emoji (Gitea markdown renders it). Pure render change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 06:23:13 +00:00
autonomic-bot	7217e0c98c	feat(3 U2-scaffold): summary card + level/status SVG badge renderers (offline; pure) harness/card.py: render_badge_svg/level_badge_svg (shields-style SVG, colour-by-level, R6) + render_card_html (recipe+version, level badge, per-stage/per-test ✔/✘ table, embedded screenshot, invariant flags — REPORTS results.json verbatim, never recomputes; cardinal no-inflation guardrail) + render_card_png (best-effort Playwright HTML->PNG, R7). 8 pure unit tests. Orchestrator wiring + stable-URL serving + live PNG demo come after U0 PASSes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 06:11:47 +00:00
autonomic-bot	daa7edd3a7	feat(3 U1-scaffold): app screenshot capture module (offline; not yet wired) harness/screenshot.py: best-effort Playwright capture of the live app (reuses harness browser). Default = landing page (credential-free, secret-safe R7); recipes needing post-login opt into a recipe-meta SCREENSHOT hook responsible for avoiding secret pages. Every failure swallowed -> None (cosmetics never block, R7). Pure helpers unit-tested. Orchestrator wiring + live demo come after U0 PASSes (avoid deploy contention with the Adversary's cold U0 re-runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 06:05:39 +00:00
autonomic-bot	52e5d210d8	feat(3 U0.2+U0.3): per-test results + results.json with computed level harness/results.py: JUnit-XML parsing (stdlib) → per-stage/per-test rows; derive_rungs (documented tier+deps/SSO → rung mapping); build_results assembles results.json {recipe,version,pr,ref,run_id, stages[],level,level_cap_reason,rungs,flags{clean_teardown,no_secret_leak},screenshot,summary_card}; write_results (atomic). run_recipe_ci.py: tiers emit --junitxml + append {tier,source,file,rc,junit} records; main() assembles+writes results.json wrapped so a failure NEVER changes the verdict (R7), incl. a narrow leak-scan of the serialised artifact. 17 new unit tests (test_results.py). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 05:55:58 +00:00
autonomic-bot	9773e3ff63	feat(3 U0.1): pure level() ladder mapper (L0-L6, gap-caps) + unit tests Phase-3 R1 foundation. harness.level.compute_level(rungs)->(level,cap_reason) with YunoHost gap-caps semantics: level = highest rung 1..L all clean PASS; first non-PASS (FAIL or N/A) caps, recorded in cap_reason. N/A caps like fail but distinctly (L5 'no integration surface' example). Helpers backup_restore_status + tier_to_rung. 16 unit tests incl U0 gate cases (L4-pass, L2-cap). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 05:46:23 +00:00
autonomic-bot	4bf9e1d43d	feat(mumble F2-14c): drop cc-ci compose.host-ports.yml fork; deploy 0.2.0 base minimally, add native host-ports on upgrade-to-latest via new UPGRADE_EXTRA_ENV harness hook + COMPOSE_FILE-aware READY_PROBE/install skip	2026-05-31 05:07:55 +00:00
autonomic-bot	2f6a6842b0	fix(2): echo abra backup output (backupbot pre-hook) into run log for diagnosis	2026-05-31 00:04:05 +00:00
autonomic-bot	4a29ca6a55	fix(2): echo abra restore output (backupbot post-hook) into run log for diagnosis	2026-05-30 23:37:55 +00:00
autonomic-bot	68a7c79668	fix(2): ghost F2-14b — harness BACKUP_VERIFY hook + retry; close the backup-capture race Root cause (instrumented, DECISIONS 2026-05-30): a DB recipe dumps its data in a backupbot pre-hook, but if the DB container cycles mid-dump (intermittent on the loaded CI node — full5/6/7 RED, full8 green; NOT OOM/NOT healthcheck) the dump is truncated/absent and restic snapshots an empty path — abra app backup 'succeeds' yet a later restore silently loses the data (ghost ci_marker). Fix (additive, recipe-scoped via meta like READY_PROBE): recipe_meta may define BACKUP_VERIFY(domain) -> bool, a READ-ONLY post-backup integrity probe. When it returns False the harness re-runs the whole backup (fresh snapshot, re-stabilised db) up to 3x. Recipes without the hook are unaffected. ghost's BACKUP_VERIFY confirms /var/lib/mysql/backup.sql.gz is a valid non-empty gzip. Weakens no assertion — it only retries a flaky CAPTURE so P4 restore is RELIABLY exercised, not luck-dependent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 21:30:25 +00:00
autonomic-bot	aebe93c299	fix(2): _load_meta whitelist UPGRADE_BASE_VERSION (override was silently dropped → base fell back to [-2]) The override added in `a750937` had no effect: _load_meta only copies a fixed key whitelist into the meta dict, and UPGRADE_BASE_VERSION wasn't in it, so meta.get(...) returned None and the upgrade base fell back to previous_version() = recipe_versions[-2] (0.6.3+3.1.2). Add it to the whitelist so discourse's honest 0.7.0 base is selected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 14:30:39 +01:00
autonomic-bot	a750937fb0	feat(2): discourse Q4.6 honest upgrade crossover — UPGRADE_BASE_VERSION override (base-on-[-1]) + uniform bitnamilegacy image overlay Implements the real 0.7.0+3.3.1 -> 0.8.0+3.3.1 upgrade crossover instead of a §7.1 skip-with-sign-off (Adversary leans DENY on the deferral; agreed): - recipe_meta UPGRADE_BASE_VERSION=0.7.0+3.3.1 + generic support in run_recipe_ci (prev = meta override or previous_version). Harness default [-2]=0.6.3+3.1.2 is a hollow base (img 3.1.2 != head 3.3.1); [-1]=0.7.0+3.3.1 is the PR's true predecessor and shares head's servable 3.3.1 image. - compose.ccci-health.yml re-pins services.{app,sidekiq}.image to bitnamilegacy/discourse:3.3.1 so the 0.7.0 base (compose pins 404 bitnami:3.3.1) is servable; idempotent on the head (PR already bitnamilegacy). Consumes Adversary BUILDER-INBOX (deleted), leaves ADVERSARY-INBOX ack; STATUS-2 discourse section updated. Full lifecycle run launching next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 14:20:06 +01:00
autonomic-bot	a7e2af444a	fix(2): assert_upgraded tolerate abra's '+U' working-tree marker on chaos-version A cc-ci deploy overlay sitting in the recipe checkout as an untracked file (ghost's compose.ccci-health.yml via install_steps) makes abra stamp chaos-version='<commit>+U' (U=untracked). The commit still equals head_ref (HC1 satisfied) but the '+U' broke the exact-prefix match → spurious upgrade-tier FAIL. Strip the working-tree-state marker before the commit match; HC1 preserved (commit must still equal head_ref — a stale checkout's commit would not match even after stripping). General: benefits every future cc-ci overlay recipe.	2026-05-30 05:49:27 +01:00
autonomic-bot	dd45e9555e	revert(2): drop adversary scratch probe scripts accidentally staged by git add -A (runner/adv_*.py are local-only adversary scratch, not Builder code)	2026-05-29 23:37:48 +01:00
autonomic-bot	af94708de4	review(2): resume checkpoint — no gate pending; drone block genuine (/etc/timezone still absent on host); leftover drone smoke stack flagged (housekeeping); immich P4-restore still OPEN, unsigned	2026-05-29 23:37:17 +01:00
autonomic-bot	ec76072489	fix(2): Q4.2 mumble — TCP voice-server READY_PROBE gates backup past upgrade host-port churn Diagnostic (RECIPE=mumble STAGES=install,backup,restore,custom, no upgrade) PROVED backup+restore green on a stable 1.0.0 deploy incl. ci_marker survival (P4). The full-run backup 409 ('container not running') was the chaos UPGRADE redeploy: host-mode 64738 must be released by the old task + rebound by the new, and HEALTH_PATH '/' only proves the mumble-web sidecar (not the voice server), so wait_healthy passed while the app churned → backup-bot execed a not-running container. Fix: extend lifecycle.wait_ready_probes to support a TCP probe ({tcp_host,tcp_port,stable=N consecutive connects}); mumble recipe_meta READY_PROBE returns 64738 (stable=3) so the harness waits for the voice server up after install AND upgrade before backup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 20:19:07 +01:00
autonomic-bot	1890cb58f3	fix(2): recipe_checkout force (-f) — fixes mumble upgrade-tier checkout collision with cc-ci overlay git checkout <head_ref> aborted on the untracked install_steps-provided compose.host-ports.yml (which head_ref tracks). Force-checkout yields the exact ref tree. Also fixes the mumble restore tier: backup labels exist only in 1.0.0+, so backup/restore are meaningful only after the (now-working) upgrade moves the app to head_ref. DECISIONS.md updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 20:03:41 +01:00
autonomic-bot	999dd0d564	fix(2): Q4.2 mumble — CHAOS_BASE_DEPLOY meta flag for chaos base deploy (clean-tree gate) mumble's pinned base deploy (prev version 0.2.0) FATAs 'has locally unstaged changes' because install_steps provides an untracked compose.host-ports.yml. New recipe_meta CHAOS_BASE_DEPLOY=True + lifecycle._recipe_meta_flag + deploy_app branch -> base uses chaos (skips clean-tree/lint, deploys the checked-out pinned version, not LATEST), mirroring the lightweight-tag chaos-base path. DECISIONS.md records the full mumble enrollment design. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 19:32:48 +01:00
autonomic-bot	2bf40d69d6	feat(2): HQ1 image pre-pull (plan-prepull-images.md) — warm local store before deploy lifecycle.prepull_images(recipe, domain): resolve images via docker compose config --images (COMPOSE_FILE from the app .env — handles $VERSION interpolation + multi-compose) → docker pull each, skip-if-present (zero network for cached pinned tags). Called in deploy_app before the (unchanged, real) abra.deploy AND in generic.perform_upgrade before the chaos redeploy (warms new-version images). A pull failure RAISES a clear pre-deploy error (not a converge timeout); deploy path unchanged (no docker service update/scale). Removes PULL time not app-INIT time. 4 unit tests (tests/unit/test_prepull.py): present→skip, missing→ pull, pull-fail→raise, no-images→skip. NOT claimed yet — validating cold-verify criteria next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 16:02:21 +01:00
autonomic-bot	72719fe0d7	fix(2): R014 — chaos base deploy for recipes with lightweight tags (replaces fragile origin-repoint) The origin-repoint approach hit go-git 'reference not found' (mirror HEAD→master vs main). Simpler + robust: detect lightweight version tags (has_lightweight_version_tags, read-only) and, for the pinned base deploy of such a recipe, use chaos — which SKIPS abra lint (so no R014 FATA) and deploys the EXPLICITLY-checked-out pinned version (recipe_checkout already ran; chaos uses the current checkout, so it's the prev version, NOT LATEST — F1d-2's hazard was the missing checkout). No-op / stays pinned for all-annotated recipes. The upgrade tier's prev→PR-head crossover + HC1 (chaos-version==head_ref) still hold (verified by the run's upgrade-tier log). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 14:15:07 +01:00
autonomic-bot	ad06a5dd3f	fix(2): R014 normalize — use git clone --mirror (not --bare) so abra's later fetches find refs/heads/main --bare lacked refs/heads/main, so abra's post-normalize git ops (app secret insert / deploy) failed 'unable to fetch tags: reference not found' when fetching from the repointed local origin. --mirror copies all refs (heads+tags) → abra fetch OK + R014 passes (both verified). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 14:05:26 +01:00
autonomic-bot	da44e2ca8a	fix(2): R014 normalize — repoint recipe origin to local bare with annotated tag (abra force-fetches tags before lint, reverting in-place re-annotation) Diagnosed: abra runs git fetch --tags --force from origin before its pinned-deploy lint, so re-annotating the lightweight tag in place is reverted before R014 runs. Fix: after re-annotating, clone the recipe to a local bare repo (carrying the annotated tag) and repoint origin at it, so abra's force-fetch pulls the annotated tag. Verified: abra recipe lint R014 then PASSES and the annotation sticks. Deployed commit unchanged. No-op for all-annotated recipes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 13:59:03 +01:00
autonomic-bot	8c19b1fadc	fix(2): normalize lightweight recipe tags to annotated before pinned deploy (R014) lasuite-meet upgrade tier failed at the prev-version base deploy: abra's pinned-deploy lint FATA'd on R014 'only annotated tags used for recipe version' because upstream coop-cloud lasuite-meet ships a stray LIGHTWEIGHT tag (0.3.0+v1.16.0). chaos deploys skip lint (so install,custom passed) but the upgrade tier's pinned prev-version deploy lints. New abra.normalize_recipe_tags() re-creates each lightweight version tag as annotated at the SAME commit (no deployed content changes); called in lifecycle.deploy_app after recipe_checkout when version is pinned. Idempotent; no-op for all-annotated recipes (lasuite-drive etc.). Helps any recipe with a stray upstream lightweight tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 13:48:55 +01:00
autonomic-bot	e1147b5fe3	fix(2): F2-12 lasuite-drive upgrade tier — own convergence wait (abra -c) + collabora READY_PROBE Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init), even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold). Fix (cc-ci-side, stronger verification — not weaker): - abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's impatient monitor no longer FATAs (the stack spec is applied regardless). - perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade. - recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after the upgrade redeploy. No-op for recipes without a READY_PROBE. NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 11:55:53 +01:00
autonomic-bot	4b38b66fa5	fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom + OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C readiness-gating, no test weakened): - tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly. - runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op — plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default, while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing. Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md + BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section (241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated IDEAS.md/plan-sso-dep-testing.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 10:37:55 +01:00
autonomic-bot	a151489996	feat(2): lasuite-drive Q3.2a Part A — wire OIDC at INSTALL, eliminate flaky redeploy Q3.2a / plan-lasuite-drive-oidc-robustness.md Part A. The old setup_custom_tests.sh did a post-deploy in-place `abra app deploy --force --chaos` of the heavy 12-service stack to apply the OIDC env — flaky (collabora WOPI-discovery race + gunicorn-perms; JOURNAL Step 0). Since the OIDC env only affects backend/app and keycloak is live-warm, provision the per-run realm BEFORE the single deploy and wire OIDC into the .env at install time (no reconverge). - runner/run_recipe_ci.py: new _provision_deps() helper (warm/cold split + SSO enrich + write $CCCI_DEPS_FILE), used by both paths. New per-recipe OIDC_AT_INSTALL meta flag (added to _load_meta whitelist). When set + deps live-warm: provision BEFORE deploy_app; the install tier's install_steps.sh wires OIDC into the single deploy; post-deploy step runs only the MinIO bucket one-shot — no re-provision, no redeploy. Legacy post-deploy path unchanged for all other dep recipes (gated on `not oidc_at_install`). - tests/lasuite-drive/install_steps.sh (NEW): install-time OIDC env + secret wiring; no-ops on empty deps file (recipe still boots, OIDC test skips → F2-11 RED). - tests/lasuite-drive/setup_custom_tests.sh: trimmed to MinIO-bucket-only (OIDC moved out). - tests/lasuite-drive/recipe_meta.py: OIDC_AT_INSTALL = True. - JOURNAL-2: Step-0 root-cause failure logs captured before the fix. NOT a claim — validating 3x green (incl. now-required upgrade tier) before claiming Q3.2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 10:10:05 +01:00
autonomic-bot	fc6e35d617	feat(2): mattermost-lts create-message round-trip (§4.3 P3) — first-user→login→team→channel→post→read-back; harness http.post_with_headers (returns response headers, for mattermost login Token)	2026-05-29 08:31:37 +01:00
autonomic-bot	40b03a9bf1	claim(2w): WC8 + WC9 (FINAL gates) — resource-safety consolidation + stale-warm prune + docs/warm.md + --quick rollback proof WC8: canonical.prune_stale (drop de-enrolled warm data + volumes) wired into the nightly sweep + df log; consolidated evidence (DRONE_RUNNER_CAPACITY=MAX_TESTS serialize; autoPrune drops --volumes so warm vols survive; cold teardown sacred; warm excluded from D8 — no nix source ref). +1 unit (72 pass). WC9: docs/warm.md documents the full warm/quick model; --quick rollback proof already proven live (W2 FAIL restores exact known-good; WC4 PASS byte-identical snapshot). On PASS, all WC1-WC9 (incl WC1.1/WC1.2) verified → DONE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 04:43:34 +01:00
autonomic-bot	465e1059b0	claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik → serial full-cold over enrolled on latest → green promotes; skip if test active; operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix (timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live service run (repo-relative enrolled scan; util-linux for backup PTY). Live SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced 1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 04:33:08 +01:00
autonomic-bot	125453df20	claim(2w): WC5 promote-on-green-cold proven — green cold run advances canonical (1.10.0→1.11.0); --quick never promotes; only cold advances should_promote_canonical (enrolled+green+cold+latest) + promote_canonical (re-seed canonical at green-verified latest, snapshot+registry, old known-good replaced only on green). +5 unit (70 pass). Live: custom-html canonical advanced 1.10.0+1.28.0 → 1.11.0+1.29.0 via a full green cold run; snapshot refreshed; idle; per-run app torn down. WC6 nightly sweep next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 04:08:14 +01:00
autonomic-bot	e678d2e006	claim(2w): W0.10a traefik WC1.1 migrated onto shared health-gated reconciler — no-op converge proven; destructive rollback = Adversary cold proof warm_reconcile.py: per-spec setup hook + health_domain; SPECS[traefik] (stateful=False, version-rollback-only, _traefik_setup preserves wildcard-cert/ file-provider config, health on routed dashboard host). keycloak path unchanged. proxy.nix: deploy-proxy.service now execs warm_reconcile.py traefik. ZERO-disruption migration (traefik already at latest 5.1.1+v3.6.15; pre-seeded TYPE+last_good → clean no-op converge; traefik 200 + keycloak-through-traefik 200 + 0 failed). 65 unit pass. Per operator out: code+converge delivered; destructive rollback (brief TLS blip) = Adversary's required cold proof. Closes the W0.10a tracked-open. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 03:50:32 +01:00
autonomic-bot	191ebde466	fix(2w): W2 --quick live-proof fixes (time import + stale-TYPE reset) 3 bugs found by the live PASS+FAIL proof on the custom-html canonical: - import time (run_quick._wait_undeployed used it → the FAIL rollback crashed with NameError before restore ran). - canonical.deploy_canonical now resets .env TYPE=<recipe>:<version> before redeploy, so a stale TYPE left by a prior --quick upgrade (pointing at a since-removed broken PR commit) can't FATAL abra 'unable to resolve <commit>'. - run_quick FAIL rollback resets TYPE to known-good after restore (idle .env agrees with the registry). LIVE PROOF (custom-html canonical), ALL PASS: (A) PASS quick run → undeploy keep-volume, known-good UNCHANGED, marker intact; (B) FAIL quick run (broken image) → 'rolling back' → 'restored known-good data; canonical idle' → exit 1, known-good UNCHANGED, DATA RESTORED. Canonical left clean (idle, 1.11.0+1.29.0). 61 unit pass; cold path untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 03:05:39 +01:00
autonomic-bot	f68e9d463f	feat(2w): W2 --quick mode in run_recipe_ci.py (WC4+WC7) run_quick(): opt-in fast lane (CCCI_QUICK=1 / MODE=quick) — reattach the data-warm canonical (canonical.deploy_canonical, known-good volume) → deps wiring (warm keycloak + per-run realm) → UPGRADE to PR head (chaos, run_lifecycle_tier 'upgrade': reconverge+moved+serving + overlay) → custom tier. PASS → undeploy_keep_volume, known-good UNCHANGED (NEVER promote); FAIL → warmsnap.restore last-known-good + undeploy (roll back, data safe). Always deletes per-run warm realm. mode=quick labelled lower-confidence (WC7); skips install/backup/restore; no deploy-count guard (no deploy_app). main() dispatches to run_quick when a canonical exists, else clean no-canonical fallback to COLD. Cold path byte-identical (deps wiring intentionally mirrored, not refactored). 61 unit pass; cold untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 02:45:44 +01:00
autonomic-bot	b6ef83ab0b	feat(2w): W1 canonical registry module (WC2) + alerts archived runner/harness/canonical.py: data-warm canonical registry + lifecycle — is_enrolled (recipe_meta.WARM_CANONICAL), canonical_domain (warm.stable_domain warm-<recipe>), registry read/write (/var/lib/ci-warm/<recipe>/canonical.json), has_canonical (record + retained volume), deploy_canonical (reattach volume at known-good version), undeploy_keep_volume (idle data-warm), seed_canonical (record + warmsnap snapshot). warm.stable_domain helper added (keycloak path unchanged). +4 unit tests (61 unit pass). Also archived the Adversary's verification alert sentinels to alerts/seen/ (simulated rollback + 2 holds — evidentiary, gate PASSED; dir clean for real alerts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 02:15:11 +01:00

1 2

87 Commits