cc-ci

Author	SHA1	Message	Date
autonomic-bot	e9c26c72af	harden(dstamp): assert_upgrade_converged waits for the NEW swarm update (StartedAt advanced) before accepting a terminal state — closes the Adversary-flagged race where a stale 'completed' from the base deploy could mask a later rollback; no-op redeploy grace preserved Some checks failed continuous-integration/drone/push Build is failing Details	2026-06-11 17:18:50 +00:00
autonomic-bot	0cc31a507e	fix(dstamp): discourse upgrade stop-first overlay (stop 2x-memory start-first OOM→spurious swarm rollback) + harness assert_upgrade_converged (detect rollback/pause → honest upgrade failure, HC1 unweakened). Root cause: failure_action:rollback reverted chaos-version label, masked by start-first+wait_healthy All checks were successful continuous-integration/drone/push Build is passing Details	2026-06-11 17:07:38 +00:00
autonomic-bot	be2026aafb	fix(harness): services_converged — a replica deficit explained entirely by Complete tasks is converged (triggered one-shot, rcust M2 lasuite-drive root cause) All checks were successful continuous-integration/drone/push Build is passing Details	2026-06-11 00:26:53 +00:00
autonomic-bot	fd02d9f4b8	feat(harness): P3 — uniform ctx hook convention (rcust) All checks were successful continuous-integration/drone/push Build is passing Details harness.meta.HookCtx (frozen): .domain, .base_url, .meta (RecipeMeta), .deps (provisioned dep creds from $CCCI_DEPS_FILE or None), .op (current lifecycle op or None); built via meta.hook_ctx() at each hook call site. All recipe callables now take ctx: EXTRA_ENV(ctx), UPGRADE_EXTRA_ENV(ctx), READY_PROBE(ctx), BACKUP_VERIFY(ctx), SCREENSHOT(page, ctx), ops.py pre_<op>(ctx). Dict-valued EXTRA_ENV/UPGRADE_EXTRA_ENV unchanged (only the callable signature moved). Call sites converted: deploy_app env shaping, perform_upgrade, wait_ready_probes (gains op=), _perform_op BACKUP_VERIFY, screenshot.capture, _run_pre_hook. Legacy signatures fail FAST with a clear migration message: the registry carries hook_params per hook key, enforced at meta.load() (MetaError names the old vs new signature); ops.py pre-op hooks get the same check at the orchestrator call site (meta.check_hook_signature) — no silent TypeError mid-run. Migrated every in-repo user mechanically (17 ops.py files; cryptpad/lasuite-*/ mailu EXTRA_ENV; mumble+lasuite-drive READY_PROBE; ghost/discourse BACKUP_VERIFY) — seeded values, probes and assertions byte-identical (domain -> ctx.domain; keycloak pre_restore's meta arg -> ctx.meta). Unit tests: hook_ctx field contract, ctx.deps from the run deps file, legacy- signature MetaError (READY_PROBE/EXTRA_ENV/SCREENSHOT + pre-op checker), ctx signatures accepted. Docs table regenerated (signature docs in key docs). Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 180 passed; scripts/lint.sh -> PASS.	2026-06-10 17:10:26 +00:00
autonomic-bot	8cd72fd78d	feat(harness): P2 — delete legacy customization keys & paths (rcust) All checks were successful continuous-integration/drone/push Build is passing Details a) compose.ccci.yml is FIRST-CLASS: the harness auto-copies tests/<recipe>/ compose.ccci.yml into the run's recipe checkout (ABRA_DIR-aware, lifecycle. provide_ccci_overlay) and auto-chaoses the pinned base deploy on its presence (kills the R7 implicit coupling). ghost/discourse install_steps.sh (copy-only boilerplate) deleted; CHAOS_BASE_DEPLOY removed from both metas + the registry. b) install-time deps wiring is the ONLY mode: deps with DEPS provision BEFORE the single deploy; legacy post-deploy provisioning + the setup_custom_tests.sh invocation machinery deleted. lasuite-docs migrated to install_steps.sh OIDC wiring (same env names/values as the old hook — only the timing moved); lasuite-drive's remaining post-deploy MinIO bucket one-shot moved to ops.py pre_install; both setup_custom_tests.sh files deleted; OIDC_AT_INSTALL removed from drive/meet metas + the registry. c) SKIP_GENERIC meta key deleted (zero users). Env form CCCI_SKIP_GENERIC* stays as the documented dev-only escape hatch; when active in a drone CI run the orchestrator prints a loud !! warning (manifest embedding lands in P5). d) conftest cleanup: dead pre-deploy-once fixtures deployed/deployed_app deleted (zero users), app_domain + _short + _wait_healthy dropped (only users were the deleted fixtures); deps_apps+deps_creds consolidated into ONE deps fixture (entries expose .domain etc. as attributes; dict access intact); the 6 lasuite test files renamed deps_creds->deps (fixture name only — assertions and flows byte-identical). requires_deps marker + F2-11 skip-report plumbing unchanged. Registry is now exactly the 14 final keys; docs §4 table regenerated. Stale setup_custom_tests/OIDC_AT_INSTALL prose in docstrings/comments/assert MESSAGES updated (no assert logic or expected value touched). Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.	2026-06-10 17:01:33 +00:00
autonomic-bot	472a68b32c	feat(harness): P1 — single registry-backed meta loader (rcust) All checks were successful continuous-integration/drone/push Build is passing Details One loader: runner/harness/meta.py::load(recipe) -> RecipeMeta (frozen dataclass, attribute access), backed by the declarative KEYS registry (14 final keys + 3 P2-deprecated). The ONLY exec() of tests/<recipe>/recipe_meta.py. Validation per the locked decision: unknown ALL-CAPS top-level name or type mismatch = MetaError (hard error at load); underscore-prefixed names recipe-private; callables only on hook-typed keys. Migrated all six legacy loaders (spec §4 L1–L6): - run_recipe_ci.py::_load_meta deleted; orchestrator loads once, passes meta down - tests/conftest.py::_recipe_meta deleted; meta fixture returns full RecipeMeta (R3) - lifecycle.py::_recipe_extra_env/_recipe_meta_flag deleted; deploy_app takes meta - deps.py::declared_deps deleted; callers read meta.DEPS - canonical.py::is_enrolled reads through meta.load() - screenshot.py now actually receives SCREENSHOT through the orchestrator path (R2 fix; proven by unit test through the real load path) Mumble private constants underscore-prefixed (_WELCOME_TEXT_MARKER/_MAX_USERS) + importers fixed. New tests/unit/test_meta.py (all-recipes-load-clean typo gate, MetaError cases, spec §2 baseline defaults, underscore exemption, doc sync). Docs §4 key table now GENERATED from the registry (scripts/gen-meta-docs.py); drift fails CI. Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.	2026-06-10 16:46:58 +00:00
autonomic-bot	17ebdf39ac	feat(harness): P3 per-run ABRA_DIR — structural recipe-tree isolation, recipe flock deleted All checks were successful continuous-integration/drone/push Build is passing Details - run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared canonical path, so janitor discovery and env-based teardown are unchanged; per-domain filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ — each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation. - fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock. CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging workflow, run reads staged state). - abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions, generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but follows the per-run tree when imported by promote_canonical inside a run). - deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the must-lock-before-fetch ordering rule. - tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix required by per-run trees; no assertion/gate touched — see DECISIONS.md). - .drone.yml comments updated (HOME=/root rationale now via the servers symlink).	2026-06-10 04:18:33 +00:00
autonomic-bot	b302f3ab63	feat(harness): P2 flock-probe janitor — the kernel flock IS the liveness oracle All checks were successful continuous-integration/drone/push Build is passing Details - acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line when another run of the same domain is in flight (double-!testme serialisation). The file object is retained in module-level _held_app_locks so GC can never close the fd and silently release the lock. mtime is touched at acquisition (lock age for the long-held flag). - janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the probe lock (a new same-domain run blocks until the reap finishes), then unlink before release. Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log. - unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on the live path, and a probe that won one skips (unlinking now would hit a newer run's file). - deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard. teardown_app no longer unregisters (release is process exit). janitor() takes no args now. - post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate reap (improvement over the old 2h age fallback).	2026-06-10 04:11:31 +00:00
autonomic-bot	b492f995bd	feat(harness): P1 lock-lifetime hardening — PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard deadline All checks were successful continuous-integration/drone/push Build is passing Details - new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged and ignored (begin_teardown guard) so a second signal can't abort the running cleanup. - run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown finally: blocks (cold + quick) mark begin_teardown(). - .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of leaking it (docs/concurrency.md §8.1). - PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.	2026-06-10 04:04:28 +00:00
autonomic-bot	e6d55b53c7	fix(harness): a paused swarm update is settled — only active states block convergence All checks were successful continuous-integration/drone/push Build is passing Details continuous-integration/drone Build is passing Details `68ef0f8` made services_converged() require UpdateStatus settled, treating 'paused' as in flight. But swarm's default update-failure-action pauses the update on a single task flicker and the flag persists FOREVER (until the next update): immich CI 241 had the app service 'paused' from a restart during restore while the service was back at 1/1 and healthy — every subsequent wait hung to its deadline and the run had to be killed. Only 'updating' and 'rollback_started' now block convergence: those are the states swarm is actively driving (the 238 stop-first race lives in 'updating'). 'paused'/'rollback_paused' make no progress without intervention, so waiting on them is pointless — N/N replicas is already required, and the HTTP-health and tier assertions still gate whether the app actually works. lint: PASS, unit tests: 138 passed.	2026-06-09 23:07:36 +00:00
autonomic-bot	68ef0f84fb	fix(harness): convergence must span stop-first rolling updates (immich 238 backup 409) Some checks reported errors continuous-integration/drone/push Build is passing Details continuous-integration/drone Build was killed Details services_converged() accepted N/N replicas as converged — but a chaos redeploy that changes a non-app service image (immich PR #2 moves the db to the vectorchord pin) registers a stop-first rolling update that swarm may not have STARTED yet: the OLD task still shows 1/1, the wait passes, and the task dies seconds later. Build 238: backupbot resolved the db hook container, the task was killed in the gap, and the pre-hook exec crashed the whole backup with a 409 -> no dump in the snapshot -> restore had nothing -> RED. - services_converged() now also requires every service's swarm UpdateStatus to be settled ('', completed, rollback_completed) — updating/paused/rollback in flight is NOT converged. Strictly stricter: no gate is weakened. - backup_app() gains a bounded (300s) settle-wait before 'abra app backup create' as defence in depth; on timeout the backup still runs and the tier's assertion delivers the verdict. lint: PASS, unit tests: 138 passed.	2026-06-09 22:10:55 +00:00
autonomic-bot	c0df77d0d9	fix(harness): make concurrent recipe runs safe (per-recipe flock + active-run registry) All checks were successful continuous-integration/drone/push Build is passing Details capacity=2 went live with three stale capacity=1-era assumptions that corrupted concurrent runs (immich 229/230 '/pg_backup.sh: No such file'): - ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe rm-rf's/ reclones and the upgrade tier git-checkouts mid-run. Same-recipe runs now serialise on an exclusive flock (/run/lock/cc-ci-recipe-<recipe>.lock), taken in main() BEFORE fetch_recipe and held for the whole run; the kernel releases it on any process death, so there is no stale-lock failure mode. Different recipes still run in parallel. - CCCI_JANITOR_MAX_AGE=0 made a starting build reap ANY in-flight run app. Every run now registers its app domain + pid in /run/cc-ci-active/<domain> before app creation; the janitor checks the owner: alive (pid is a live run_recipe_ci process) -> never reaped; dead -> reaped immediately; unknown (pre-registry or post-reboot) -> age fallback (default 2h). The MAX_AGE=0 env override is gone from .drone.yml. - .drone.yml: concurrency.limit 1 -> 2 to match DRONE_RUNNER_CAPACITY=2; the 'safe because capacity=1' comments now describe the flock+registry model. lint: PASS, unit tests: 138 passed.	2026-06-09 21:56:25 +00:00
autonomic-bot	4bf9e1d43d	feat(mumble F2-14c): drop cc-ci compose.host-ports.yml fork; deploy 0.2.0 base minimally, add native host-ports on upgrade-to-latest via new UPGRADE_EXTRA_ENV harness hook + COMPOSE_FILE-aware READY_PROBE/install skip	2026-05-31 05:07:55 +00:00
autonomic-bot	ec76072489	fix(2): Q4.2 mumble — TCP voice-server READY_PROBE gates backup past upgrade host-port churn Diagnostic (RECIPE=mumble STAGES=install,backup,restore,custom, no upgrade) PROVED backup+restore green on a stable 1.0.0 deploy incl. ci_marker survival (P4). The full-run backup 409 ('container not running') was the chaos UPGRADE redeploy: host-mode 64738 must be released by the old task + rebound by the new, and HEALTH_PATH '/' only proves the mumble-web sidecar (not the voice server), so wait_healthy passed while the app churned → backup-bot execed a not-running container. Fix: extend lifecycle.wait_ready_probes to support a TCP probe ({tcp_host,tcp_port,stable=N consecutive connects}); mumble recipe_meta READY_PROBE returns 64738 (stable=3) so the harness waits for the voice server up after install AND upgrade before backup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 20:19:07 +01:00
autonomic-bot	999dd0d564	fix(2): Q4.2 mumble — CHAOS_BASE_DEPLOY meta flag for chaos base deploy (clean-tree gate) mumble's pinned base deploy (prev version 0.2.0) FATAs 'has locally unstaged changes' because install_steps provides an untracked compose.host-ports.yml. New recipe_meta CHAOS_BASE_DEPLOY=True + lifecycle._recipe_meta_flag + deploy_app branch -> base uses chaos (skips clean-tree/lint, deploys the checked-out pinned version, not LATEST), mirroring the lightweight-tag chaos-base path. DECISIONS.md records the full mumble enrollment design. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 19:32:48 +01:00
autonomic-bot	2bf40d69d6	feat(2): HQ1 image pre-pull (plan-prepull-images.md) — warm local store before deploy lifecycle.prepull_images(recipe, domain): resolve images via docker compose config --images (COMPOSE_FILE from the app .env — handles $VERSION interpolation + multi-compose) → docker pull each, skip-if-present (zero network for cached pinned tags). Called in deploy_app before the (unchanged, real) abra.deploy AND in generic.perform_upgrade before the chaos redeploy (warms new-version images). A pull failure RAISES a clear pre-deploy error (not a converge timeout); deploy path unchanged (no docker service update/scale). Removes PULL time not app-INIT time. 4 unit tests (tests/unit/test_prepull.py): present→skip, missing→ pull, pull-fail→raise, no-images→skip. NOT claimed yet — validating cold-verify criteria next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 16:02:21 +01:00
autonomic-bot	72719fe0d7	fix(2): R014 — chaos base deploy for recipes with lightweight tags (replaces fragile origin-repoint) The origin-repoint approach hit go-git 'reference not found' (mirror HEAD→master vs main). Simpler + robust: detect lightweight version tags (has_lightweight_version_tags, read-only) and, for the pinned base deploy of such a recipe, use chaos — which SKIPS abra lint (so no R014 FATA) and deploys the EXPLICITLY-checked-out pinned version (recipe_checkout already ran; chaos uses the current checkout, so it's the prev version, NOT LATEST — F1d-2's hazard was the missing checkout). No-op / stays pinned for all-annotated recipes. The upgrade tier's prev→PR-head crossover + HC1 (chaos-version==head_ref) still hold (verified by the run's upgrade-tier log). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 14:15:07 +01:00
autonomic-bot	8c19b1fadc	fix(2): normalize lightweight recipe tags to annotated before pinned deploy (R014) lasuite-meet upgrade tier failed at the prev-version base deploy: abra's pinned-deploy lint FATA'd on R014 'only annotated tags used for recipe version' because upstream coop-cloud lasuite-meet ships a stray LIGHTWEIGHT tag (0.3.0+v1.16.0). chaos deploys skip lint (so install,custom passed) but the upgrade tier's pinned prev-version deploy lints. New abra.normalize_recipe_tags() re-creates each lightweight version tag as annotated at the SAME commit (no deployed content changes); called in lifecycle.deploy_app after recipe_checkout when version is pinned. Idempotent; no-op for all-annotated recipes (lasuite-drive etc.). Helps any recipe with a stray upstream lightweight tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 13:48:55 +01:00
autonomic-bot	e1147b5fe3	fix(2): F2-12 lasuite-drive upgrade tier — own convergence wait (abra -c) + collabora READY_PROBE Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init), even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold). Fix (cc-ci-side, stronger verification — not weaker): - abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's impatient monitor no longer FATAs (the stack spec is applied regardless). - perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade. - recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after the upgrade redeploy. No-op for recipes without a READY_PROBE. NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 11:55:53 +01:00
autonomic-bot	4b38b66fa5	fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom + OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C readiness-gating, no test weakened): - tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly. - runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op — plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default, while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing. Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md + BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section (241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated IDEAS.md/plan-sso-dep-testing.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 10:37:55 +01:00
autonomic-bot	f59d8e6996	feat(2): Q3.2 lasuite-drive base enrollment + nested-subdomain + replicas:0 harness fixes - harness: services_converged treats replicas:0 one-shot (minio-createbuckets) as converged (cur==want); removes the want==0 rejection that hung deploys. DECISIONS.md. - recipe_meta.EXTRA_ENV flattens MINIO_DOMAIN/COLLABORA_DOMAIN to single-label wildcard siblings (the *.ci.commoninternet.net cert covers one label only). DECISIONS.md. - lifecycle overlays (install/upgrade/backup/restore) + ops.py postgres ci_marker data-integrity (db user/name=drive). Parity health_check functional test. PARITY.md. - DEPS=[keycloak] + OIDC/WOPI/upload functional tests deferred to the SSO iteration (probe-before-assert: prove the ~10-service base deploy converges first). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 19:54:31 +01:00
autonomic-bot	1bd7c7a1d3	feat(2): Q4.4 ghost + DEPLOY_TIMEOUT plumb-through for heavy recipes Harness change (small, surgical): - runner/harness/lifecycle.deploy_app gains a deploy_timeout param (default 900s); passes through to abra.deploy(timeout=...). For heavy recipes (ghost, matrix-synapse, lasuite-meet), the orchestrator + dep resolver now read recipe_meta.DEPLOY_TIMEOUT and pass it so the Python subprocess wrapping abra deploy doesn't SIGKILL it before the recipe's INTERNAL TIMEOUT (via EXTRA_ENV) finishes swarm convergence. - runner/run_recipe_ci.py + runner/harness/deps.py: thread recipe_meta.DEPLOY_TIMEOUT into the per-recipe deploy_app call. Q4.4 ghost enrollment: - recipe_meta.py: HEALTH_PATH=/, DEPLOY_TIMEOUT=1200 (subprocess), EXTRA_ENV={TIMEOUT: 1200} (recipe internal). Ghost cold-start with theme + DB migration runs ~12-15min on cc-ci. - functional/test_health_check.py: GET / returns 200 (themed site). - functional/test_content_api.py: GET /ghost/api/content/settings/ returns 200 (settings JSON) or 401/403 (Ghost error envelope) — distinguishes ghost-server up + JSON API working from static fallback. - functional/test_admin_redirect.py: GET /ghost/ returns 200 or 302 + Ghost branding; proves admin route is wired through nginx proxy. - PARITY.md: recipe-maintainer corpus has no ghost tests/, Phase-2 health_check is the parity baseline; create-a-post deeper test deferred (DEFERRED.md, --extra-tests linked). Cold-verifiable (log /root/ccci-q44-ghost-r3.log): RECIPE=ghost STAGES=install,custom cc-ci-run runner/run_recipe_ci.py install + 3 functional tests PASS, deploy-count=1. 28/28 unit tests still PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 17:23:40 +01:00
autonomic-bot	6eabfdc0fb	fix(1e): F1e-1 exec_in_app race + HC1 head_ref/move hardening F1e-1 (Adversary): exec_in_app silently returned '' on a failed docker exec, flipping a healthy recipe RED under opt-out (post-backup container cycle, no readiness buffer). Now polls (re-resolve container + re-exec) until rc==0 or 90s, then RAISES — never masks an exec failure as empty data. No assertion weakened. Verified: opt-out install,backup,restore on custom-html now PASS. HC1: head_ref = ref or recipe_head_commit (prefer explicit PR head sha $REF — robust, no git race; production !testme always sets REF). assert_upgraded, when head_ref known, REQUIRES the deployed chaos-version commit to MATCH head_ref (direct + non-vacuous proof the PR-head code was deployed; a stale prev-checkout chaos redeploy fails). Falls back to version/image/chaos move check otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:41:42 +01:00
autonomic-bot	b7e6cbd7be	feat(1e): HC3 additive generic + op/assertion split (orchestrator owns the op) - orchestrator: per mutating tier, run optional pre-op seed hook (ops.py pre_<op>) → perform the op ONCE (harness-owned) → run generic assertion (unless opted out) AND overlay assertion, both against the shared post-op deployment. Op results passed op→assertion via run-scoped CCCI_OP_STATE_FILE. - opt-out: CCCI_SKIP_GENERIC / CCCI_SKIP_GENERIC_<OP> / recipe_meta.SKIP_GENERIC (declarative). - generic.py: split do_* into op primitives (perform_upgrade/backup/restore) + assertions (assert_upgraded/backup_artifact/restore_healthy) reading op_state(); deployed_identity now returns {version,image,chaos} (chaos label ready for HC1). - generic test_<op>.py + all 6 recipe overlays migrated to assertion-only; pre-op seeding moved to per-recipe ops.py (pre_upgrade/pre_backup/pre_restore). install overlays unchanged (no op). - deploy-count stays 1 (op primitives never call deploy_app). lint PASS; 8 unit tests PASS on cc-ci. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 03:12:04 +01:00
autonomic-bot	feb6f80d50	fix(1d): bounded retry in _app_container (backup briefly cycles the app container) abra app backup create (backup-bot-two) stops/cycles the app container, so a mutate exec_in_app right after backup hit an empty docker ps and raised. _app_container now polls (no bare sleep) for the container to reappear within a timeout. Recipe-agnostic harness robustness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 00:06:28 +01:00
autonomic-bot	81e26a1bdc	fix(1d): F1d-2 — pinned base deploys the pinned version; upgrade is non-vacuous - deploy_app: checkout the pinned tag + deploy NON-chaos when a version is pinned (chaos only for version=None / PR-head). Was always -C, which ignored the pin and deployed LATEST -> upgrade no-op. - do_upgrade: assert the deployment actually MOVED (coop-cloud version label and/or image changed) via lifecycle.deployed_identity -> a vacuous no-op upgrade can no longer pass (DG2). - G2: migrate custom-html overlays to the assertion-only contract (override + extend-by-composition + data-continuity; split backup/restore). tests/unit/test_discovery.py proves precedence (5/5). Probe (Adversary's F1d-2 test): hedgedoc deploy-prev=1.10.7 -> upgrade=1.10.8, CHANGED=True. hedgedoc full generic lifecycle green (install/upgrade/backup/restore, deploy-count=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 00:02:59 +01:00
autonomic-bot	6c5d8f28ea	fix(1d): G1 backup/restore + F1d-1 cert-check reframe - backup artifact: read snapshot_id from 'abra app backup create' output (snapshots needs a TTY); generic.parse_snapshot_id + do_backup assert it - restore serving race: lifecycle.http_fetch (one request -> status+body, never raises) + assert_serving is now a bounded poll (settles a post-op reconverge, no bare sleep); drop wait_serving - F1d-1 (Adversary, low): reframe served_cert/assert_serving honestly as an INFRA TLS sanity check (catches a lapsed/mis-rotated wildcard cert), NOT app-vs-fallback (Traefik serves the wildcard zone-wide); the genuine serving proof is services_converged + non-404 status. Awaiting re-test. DG1 Adversary PASS @ef44d46. G1 full-lifecycle re-verification in flight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:39:45 +01:00
autonomic-bot	ef44d4658b	feat(1d): G0 — generic install + deploy-once orchestrator (DG1 green on hedgedoc) - harness/generic.py: recipe-agnostic assert_serving (converged + real HTTP, 404-excluded + not Traefik 404 body + CA-verified trusted wildcard cert), op helpers, backup_capable detect - harness/discovery.py: per-op overlay resolution (repo-local > cc-ci > generic), custom + hook - tests/_generic/: assertion-only tiers (install/upgrade/backup/restore) on the shared deployment - run_recipe_ci.py: deploy-ONCE orchestrator, per-op summary, deploy-count guard (DG4.1) - conftest live_app fixture; lifecycle deploy-count + install-steps hook + pin DOMAIN to run domain DG1 cold-verified green on hedgedoc (pure generic, deploy-count=1, clean teardown). G0 CLAIMED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 23:27:55 +01:00
autonomic-bot	2cede01ed7	style(1b): auto-format + lint-clean the whole codebase (RL1) Mechanical, semantics-preserving cleanup so the codebase passes the new lint stage: - ruff format: all 32 Python files (wraps long signatures, normalizes quotes/blank lines). - nixpkgs-fmt: modules/drone-runner.nix. - shfmt (-i 2 -ci): scripts/.sh. Lint fixes (reviewed, behavior-preserving — no test weakened): - ruff SIM105: try/except-pass -> contextlib.suppress (abra.py app_config rm; lifecycle.py janitor). - ruff SIM115: open().read() -> with open() (run_recipe_ci.py redaction-values + gitea-token). - statix: merge repeated sops `secrets.` keys into one `secrets = { ... }` (comments kept); empty fn pattern `{ ... }:` -> `_:` (packages.nix). - deadnix: drop unused lambda args (flake `self`; configuration.nix `lib`; overlay `final` -> `_`). Verified on cc-ci: `scripts/lint.sh` -> lint: PASS; nixosConfigurations.cc-ci evaluates; all Python byte-compiles. The deployed bridge/dashboard/runner source changes hash (reformat), so cc-ci will be rebuilt to the new closure in W2 before the cold D1-D10 re-verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 20:52:05 +01:00
autonomic-bot	ebb4c0cbca	M6.5: enroll cryptpad (recipe #3 , stateful/no-DB) + generic per-recipe EXTRA_ENV All checks were successful continuous-integration/drone/push Build is passing Details Adds a shared-harness EXTRA_ENV mechanism (recipe_meta.py dict or domain-callable), applied in deploy_app at every deploy path — no per-recipe harness surgery (D5). cryptpad uses it for its required distinct SANDBOX_DOMAIN. Tests assert data survival via a marker file in the backed-up cryptpad_data volume (exec_in_app, since cryptpad data isn't HTTP-served). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 04:41:44 +01:00
autonomic-bot	7fc26fae68	M6 (part 1): per-recipe meta + D4 recipe-local discovery + shared naming helper All checks were successful continuous-integration/drone/push Build is passing Details Recipe-agnostic harness (no surgery to enroll a recipe): recipe_meta.py for health path/codes/timeouts; run_recipe_local discovers + runs recipe-shipped tests/ against the live app. install non-regressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 01:16:29 +01:00
autonomic-bot	b7a2d70380	harness: fix A2 (janitor real-name + docker reap + age gate) and A3 (verified teardown) All checks were successful continuous-integration/drone/push Build is passing Details teardown_app now docker-stack-rm fallback, removes .env only after stack gone, retries volume rm, and verifies no residual (raises TeardownError). janitor matches the real <recipe[:4]>-<6hex> scheme + reaps env-less orphans via docker. Verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 01:05:18 +01:00
autonomic-bot	7eb0dd3c77	M5: upgrade + backup/restore stages green (custom-html); backup-bot-two oneshot All checks were successful continuous-integration/drone/push Build is passing Details 3-stage run green (install/upgrade/backup), clean teardown. backupbot deployed via reconcile oneshot; PTY (script) for abra backup/restore; -m for secret generate (no value leak). M5 CLAIMED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 00:53:16 +01:00
autonomic-bot	38a145fd9c	M4: harness + green install stage (custom-html + Playwright); guaranteed teardown; M4 CLAIMED All checks were successful continuous-integration/drone/push Build is passing Details run_recipe_ci.py + conftest + abra/lifecycle wrappers + Nix python/playwright env. deploy_app forces LETS_ENCRYPT_ENV='' (addresses A1). Short per-run domain scheme for the 64-char swarm name limit. 2 passed; teardown leaves zero orphans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 00:23:55 +01:00

34 Commits