68ef0f8 made services_converged() require UpdateStatus settled, treating
'paused' as in flight. But swarm's default update-failure-action pauses the
update on a single task flicker and the flag persists FOREVER (until the next
update): immich CI 241 had the app service 'paused' from a restart during
restore while the service was back at 1/1 and healthy — every subsequent wait
hung to its deadline and the run had to be killed.
Only 'updating' and 'rollback_started' now block convergence: those are the
states swarm is actively driving (the 238 stop-first race lives in 'updating').
'paused'/'rollback_paused' make no progress without intervention, so waiting on
them is pointless — N/N replicas is already required, and the HTTP-health and
tier assertions still gate whether the app actually works.
lint: PASS, unit tests: 138 passed.
services_converged() accepted N/N replicas as converged — but a chaos redeploy
that changes a non-app service image (immich PR #2 moves the db to the
vectorchord pin) registers a stop-first rolling update that swarm may not have
STARTED yet: the OLD task still shows 1/1, the wait passes, and the task dies
seconds later. Build 238: backupbot resolved the db hook container, the task
was killed in the gap, and the pre-hook exec crashed the whole backup with a
409 -> no dump in the snapshot -> restore had nothing -> RED.
- services_converged() now also requires every service's swarm UpdateStatus to
be settled ('', completed, rollback_completed) — updating/paused/rollback in
flight is NOT converged. Strictly stricter: no gate is weakened.
- backup_app() gains a bounded (300s) settle-wait before 'abra app backup
create' as defence in depth; on timeout the backup still runs and the tier's
assertion delivers the verdict.
lint: PASS, unit tests: 138 passed.
capacity=2 went live with three stale capacity=1-era assumptions that corrupted
concurrent runs (immich 229/230 '/pg_backup.sh: No such file'):
- ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe rm-rf's/
reclones and the upgrade tier git-checkouts mid-run. Same-recipe runs now
serialise on an exclusive flock (/run/lock/cc-ci-recipe-<recipe>.lock), taken
in main() BEFORE fetch_recipe and held for the whole run; the kernel releases
it on any process death, so there is no stale-lock failure mode. Different
recipes still run in parallel.
- CCCI_JANITOR_MAX_AGE=0 made a starting build reap ANY in-flight run app. Every
run now registers its app domain + pid in /run/cc-ci-active/<domain> before
app creation; the janitor checks the owner: alive (pid is a live run_recipe_ci
process) -> never reaped; dead -> reaped immediately; unknown (pre-registry or
post-reboot) -> age fallback (default 2h). The MAX_AGE=0 env override is gone
from .drone.yml.
- .drone.yml: concurrency.limit 1 -> 2 to match DRONE_RUNNER_CAPACITY=2; the
'safe because capacity=1' comments now describe the flock+registry model.
lint: PASS, unit tests: 138 passed.
Push builds have been RED on the lint step since ~build 209 from accumulated
formatting drift. This is the mechanical cleanup: ruff format + ruff --fix
(UP038 isinstance unions, SIM105 contextlib.suppress, UP031 f-strings, SIM115
tempfile context manager), shfmt -i 2 -ci, nixpkgs-fmt/statix/deadnix (merged
attrsets, dropped unused lib args), yamllint, and shell quoting fixes in
tests/lasuite-docs/setup_custom_tests.sh. No behaviour changes intended;
lint: PASS, unit tests: 138 passed.
Declare intentional skips + custom-html-tiny functional test; 4-rung level ladder
- recipe_meta.EXPECTED_NA = {rung: reason} lists intentionally-skipped rungs; any
essential rung skipped and not listed is unintentional. Skips still cap the level
(never inflate). results.json: skips:{intentional,unintentional} + level_cap_rung.
- Level ladder = the four essential rungs (install, upgrade, backup/restore,
functional; top = L4). integration & recipe-local are optional, not leveled
(SSO still enforced for the run verdict, unchanged).
- Card shows skipped rungs as INTENTIONAL SKIP (green, reason below) / UNINTENTIONAL
SKIP (amber); level badge gains an expected/gap? third segment.
- custom-html-tiny: functional serve test (exact-byte round-trip + 404); declares
backup_restore intentionally skipped (stateless static server).
Independently verified by the adversary: 138 unit tests pass cold; live full-stage
run on custom-html-tiny green (upgrade tier ran; level 2; correct skips/badge);
clean teardown.
Headless chromium has no colour-emoji font, so 🌻/🏆/⚑ rendered as tofu boxes in the PNG card.
Replace with a self-contained inline-SVG sunflower + plain-text 'capped:'/'full clean climb' markers.
The U3 PR comment keeps the real 🌻 emoji (Gitea markdown renders it). Pure render change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
harness/screenshot.py: best-effort Playwright capture of the live app (reuses harness browser).
Default = landing page (credential-free, secret-safe R7); recipes needing post-login opt into a
recipe-meta SCREENSHOT hook responsible for avoiding secret pages. Every failure swallowed -> None
(cosmetics never block, R7). Pure helpers unit-tested. Orchestrator wiring + live demo come after U0
PASSes (avoid deploy contention with the Adversary's cold U0 re-runs).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A cc-ci deploy overlay sitting in the recipe checkout as an untracked file (ghost's
compose.ccci-health.yml via install_steps) makes abra stamp chaos-version='<commit>+U' (U=untracked).
The commit still equals head_ref (HC1 satisfied) but the '+U' broke the exact-prefix match → spurious
upgrade-tier FAIL. Strip the working-tree-state marker before the commit match; HC1 preserved (commit
must still equal head_ref — a stale checkout's commit would not match even after stripping). General:
benefits every future cc-ci overlay recipe.
Diagnostic (RECIPE=mumble STAGES=install,backup,restore,custom, no upgrade) PROVED backup+restore green
on a stable 1.0.0 deploy incl. ci_marker survival (P4). The full-run backup 409 ('container not
running') was the chaos UPGRADE redeploy: host-mode 64738 must be released by the old task + rebound by
the new, and HEALTH_PATH '/' only proves the mumble-web sidecar (not the voice server), so wait_healthy
passed while the app churned → backup-bot execed a not-running container. Fix: extend
lifecycle.wait_ready_probes to support a TCP probe ({tcp_host,tcp_port,stable=N consecutive connects});
mumble recipe_meta READY_PROBE returns 64738 (stable=3) so the harness waits for the voice server up
after install AND upgrade before backup.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
git checkout <head_ref> aborted on the untracked install_steps-provided compose.host-ports.yml (which
head_ref tracks). Force-checkout yields the exact ref tree. Also fixes the mumble restore tier: backup
labels exist only in 1.0.0+, so backup/restore are meaningful only after the (now-working) upgrade moves
the app to head_ref. DECISIONS.md updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mumble's pinned base deploy (prev version 0.2.0) FATAs 'has locally unstaged changes' because
install_steps provides an untracked compose.host-ports.yml. New recipe_meta CHAOS_BASE_DEPLOY=True +
lifecycle._recipe_meta_flag + deploy_app branch -> base uses chaos (skips clean-tree/lint, deploys the
checked-out pinned version, not LATEST), mirroring the lightweight-tag chaos-base path. DECISIONS.md
records the full mumble enrollment design.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lifecycle.prepull_images(recipe, domain): resolve images via docker compose config --images (COMPOSE_FILE
from the app .env — handles $VERSION interpolation + multi-compose) → docker pull each, skip-if-present
(zero network for cached pinned tags). Called in deploy_app before the (unchanged, real) abra.deploy AND
in generic.perform_upgrade before the chaos redeploy (warms new-version images). A pull failure RAISES a
clear pre-deploy error (not a converge timeout); deploy path unchanged (no docker service update/scale).
Removes PULL time not app-INIT time. 4 unit tests (tests/unit/test_prepull.py): present→skip, missing→
pull, pull-fail→raise, no-images→skip. NOT claimed yet — validating cold-verify criteria next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The origin-repoint approach hit go-git 'reference not found' (mirror HEAD→master vs main). Simpler +
robust: detect lightweight version tags (has_lightweight_version_tags, read-only) and, for the pinned
base deploy of such a recipe, use chaos — which SKIPS abra lint (so no R014 FATA) and deploys the
EXPLICITLY-checked-out pinned version (recipe_checkout already ran; chaos uses the current checkout,
so it's the prev version, NOT LATEST — F1d-2's hazard was the missing checkout). No-op / stays pinned
for all-annotated recipes. The upgrade tier's prev→PR-head crossover + HC1 (chaos-version==head_ref)
still hold (verified by the run's upgrade-tier log).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
--bare lacked refs/heads/main, so abra's post-normalize git ops (app secret insert / deploy) failed
'unable to fetch tags: reference not found' when fetching from the repointed local origin. --mirror
copies all refs (heads+tags) → abra fetch OK + R014 passes (both verified).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnosed: abra runs git fetch --tags --force from origin before its pinned-deploy lint, so
re-annotating the lightweight tag in place is reverted before R014 runs. Fix: after re-annotating,
clone the recipe to a local bare repo (carrying the annotated tag) and repoint origin at it, so
abra's force-fetch pulls the annotated tag. Verified: abra recipe lint R014 then PASSES and the
annotation sticks. Deployed commit unchanged. No-op for all-annotated recipes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lasuite-meet upgrade tier failed at the prev-version base deploy: abra's pinned-deploy lint FATA'd on
R014 'only annotated tags used for recipe version' because upstream coop-cloud lasuite-meet ships a
stray LIGHTWEIGHT tag (0.3.0+v1.16.0). chaos deploys skip lint (so install,custom passed) but the
upgrade tier's pinned prev-version deploy lints. New abra.normalize_recipe_tags() re-creates each
lightweight version tag as annotated at the SAME commit (no deployed content changes); called in
lifecycle.deploy_app after recipe_checkout when version is pinned. Idempotent; no-op for all-annotated
recipes (lasuite-drive etc.). Helps any recipe with a stray upstream lightweight tag.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor
FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init),
even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora
being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold).
Fix (cc-ci-side, stronger verification — not weaker):
- abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's
impatient monitor no longer FATAs (the stack spec is applied regardless).
- perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services
N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the
recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade.
- recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI
discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after
the upgrade redeploy. No-op for recipes without a READY_PROBE.
NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora
crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom +
OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed
on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it
mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The
install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C
readiness-gating, no test weakened):
- tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery
on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly.
- runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op —
plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default,
while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow
collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing.
Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md +
BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section
(241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated
IDEAS.md/plan-sso-dep-testing.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3 bugs found by the live PASS+FAIL proof on the custom-html canonical:
- import time (run_quick._wait_undeployed used it → the FAIL rollback crashed
with NameError before restore ran).
- canonical.deploy_canonical now resets .env TYPE=<recipe>:<version> before
redeploy, so a stale TYPE left by a prior --quick upgrade (pointing at a
since-removed broken PR commit) can't FATAL abra 'unable to resolve <commit>'.
- run_quick FAIL rollback resets TYPE to known-good after restore (idle .env
agrees with the registry).
LIVE PROOF (custom-html canonical), ALL PASS: (A) PASS quick run → undeploy
keep-volume, known-good UNCHANGED, marker intact; (B) FAIL quick run (broken
image) → 'rolling back' → 'restored known-good data; canonical idle' → exit 1,
known-good UNCHANGED, DATA RESTORED. Canonical left clean (idle, 1.11.0+1.29.0).
61 unit pass; cold path untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bugs found by the live proof, fixed:
- warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole
<recipe>/ dir — so the reconciler's sibling last_good file survives a
snapshot swap (was being clobbered).
- warm_reconcile: deploy_version captures abra's stdout (it writes FATA to
stdout) in the error; add wait_undeployed() after every undeploy so
snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade
deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers
rollback. (57 unit pass.)
LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH):
(a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good
committed=10.7.9, marker realm preserved.
(b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9,
HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore),
last_good NOT advanced, rollback alert written (attempted=10.7.10,
last_good=10.7.9, recovered=True). keycloak recovered to canonical
10.7.1+26.6.2 healthy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sso.py: list_realms, delete_keycloak_realm (idempotent, refuses master),
realms_to_reap (pure, concurrency-safe predicate), reap_orphaned_realms.
The per-run realm is the isolation unit on a shared live-warm keycloak;
orphans (crashed runs) reaped by hex not mapping to a live app stack.
+8 unit tests (tests/unit/test_warm_realm.py); 43 unit pass on cc-ci.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Harness change (small, surgical):
- runner/harness/lifecycle.deploy_app gains a deploy_timeout param (default 900s); passes
through to abra.deploy(timeout=...). For heavy recipes (ghost, matrix-synapse, lasuite-meet),
the orchestrator + dep resolver now read recipe_meta.DEPLOY_TIMEOUT and pass it so the Python
subprocess wrapping abra deploy doesn't SIGKILL it before the recipe's INTERNAL TIMEOUT
(via EXTRA_ENV) finishes swarm convergence.
- runner/run_recipe_ci.py + runner/harness/deps.py: thread recipe_meta.DEPLOY_TIMEOUT into
the per-recipe deploy_app call.
Q4.4 ghost enrollment:
- recipe_meta.py: HEALTH_PATH=/, DEPLOY_TIMEOUT=1200 (subprocess), EXTRA_ENV={TIMEOUT: 1200}
(recipe internal). Ghost cold-start with theme + DB migration runs ~12-15min on cc-ci.
- functional/test_health_check.py: GET / returns 200 (themed site).
- functional/test_content_api.py: GET /ghost/api/content/settings/ returns 200 (settings JSON)
or 401/403 (Ghost error envelope) — distinguishes ghost-server up + JSON API working from
static fallback.
- functional/test_admin_redirect.py: GET /ghost/ returns 200 or 302 + Ghost branding;
proves admin route is wired through nginx proxy.
- PARITY.md: recipe-maintainer corpus has no ghost tests/, Phase-2 health_check is the
parity baseline; create-a-post deeper test deferred (DEFERRED.md, --extra-tests linked).
Cold-verifiable (log /root/ccci-q44-ghost-r3.log):
RECIPE=ghost STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
install + 3 functional tests PASS, deploy-count=1. 28/28 unit tests still PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per REVIEW-2 ## Q2 FAIL: runner/harness/deps.py::teardown_deps suppressed ALL exceptions via
contextlib.suppress(Exception), silently swallowing teardown failures. The 'DEPS teardown' print
fired even when undeploy actually raised — leaving leftover swarm services/volumes/secrets that
broke the NEXT run targeting the same deterministic dep domain (this is what caused the Q3.1 dep
flake I saw immediately after the Q2.4 acceptance run).
Fix:
- runner/harness/deps.py: teardown_deps now uses lifecycle.teardown_app(..., verify=True) so
residuals raise TeardownError. Errors are LOGGED LOUDLY per-dep but we continue to other deps
so one failure doesn't strand the rest. After all attempts: raise a combined TeardownError if
any dep failed.
- runner/run_recipe_ci.py: orchestrator catches the dep TeardownError in finally, prints it,
captures into dep_teardown_error; the run summary surfaces it and the exit code is non-zero.
The run STILL prints the diagnosable summary so a leak doesn't hide other failures.
Per §9 teardown sacred / DG7: a green run that leaks state is not 'green'. F2-5 now correctly
fails the run instead of silently passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 lesson from F2-3 (n8n install Playwright flake on net::ERR_NETWORK_CHANGED): every
install overlay that does page.goto needs the same try/except PlaywrightError + status retry.
Centralize in runner/harness/browser.py::goto_with_retry; apply to ALL install overlays.
- runner/harness/browser.py: shared helper. Polls page.goto until status in accept_statuses;
catches PlaywrightError (net::ERR_*) as a retryable signal, not a failure. Raises AssertionError
with last_status + last_err diagnostic only on deadline expiry.
- tests/custom-html/test_install.py: now uses goto_with_retry (200 only, wait_until=load).
- tests/custom-html/playwright/test_browser_smoke.py: same.
- tests/n8n/test_install.py: replaced inline retry loop with goto_with_retry (200, 304).
- tests/keycloak/test_install.py: goto_with_retry for admin console (200, 302, 303; 45s goto).
- tests/cryptpad/test_install.py: goto_with_retry (200, 304; 60s goto, wait_until=load).
- tests/lasuite-docs/test_install.py: goto_with_retry (200, 301, 302; 60s goto).
Cold-verifiable: ssh cc-ci 'RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py'
all 5 stages PASS (including the install overlay that flaked in the deps_smoke run),
deploy-count=1, head_ref=8a026066==chaos-version=8a026066 (HC1 non-vacuous).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- runner/harness/http.py: canonical Phase-2 recipe-test HTTP API (vendored from
recipe-maintainer/utils/tests/helpers.py): http_get/http_post, retry variants,
wait_for_http, assert_converges. JSON-parsing, header support, form/JSON POST
bodies, transport-failure -> status=0. Self-contained (cc-ci does not import
recipe-maintainer at runtime per DECISIONS Phase 2).
- harness.discovery.custom_tests now also recurses into
tests/<recipe>/{functional,playwright}/test_*.py (Phase 2 §4.1 layout) while
excluding lifecycle test_<op>.py names and honoring the HC2 repo-local gate.
- Unit tests:
tests/unit/test_http.py — in-process http.server fixture; deterministic
proofs of parsing/retry/convergence semantics, no network egress.
tests/unit/test_discovery_phase2.py — functional/+playwright/ recursion
+ HC2 gate still applies to subdirs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
F1e-1 (Adversary): exec_in_app silently returned '' on a failed docker exec, flipping a healthy
recipe RED under opt-out (post-backup container cycle, no readiness buffer). Now polls (re-resolve
container + re-exec) until rc==0 or 90s, then RAISES — never masks an exec failure as empty data.
No assertion weakened. Verified: opt-out install,backup,restore on custom-html now PASS.
HC1: head_ref = ref or recipe_head_commit (prefer explicit PR head sha $REF — robust, no git race;
production !testme always sets REF). assert_upgraded, when head_ref known, REQUIRES the deployed
chaos-version commit to MATCH head_ref (direct + non-vacuous proof the PR-head code was deployed; a
stale prev-checkout chaos redeploy fails). Falls back to version/image/chaos move check otherwise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- orchestrator: per mutating tier, run optional pre-op seed hook (ops.py pre_<op>) → perform the op
ONCE (harness-owned) → run generic assertion (unless opted out) AND overlay assertion, both against
the shared post-op deployment. Op results passed op→assertion via run-scoped CCCI_OP_STATE_FILE.
- opt-out: CCCI_SKIP_GENERIC / CCCI_SKIP_GENERIC_<OP> / recipe_meta.SKIP_GENERIC (declarative).
- generic.py: split do_* into op primitives (perform_upgrade/backup/restore) + assertions
(assert_upgraded/backup_artifact/restore_healthy) reading op_state(); deployed_identity now returns
{version,image,chaos} (chaos label ready for HC1).
- generic test_<op>.py + all 6 recipe overlays migrated to assertion-only; pre-op seeding moved to
per-recipe ops.py (pre_upgrade/pre_backup/pre_restore). install overlays unchanged (no op).
- deploy-count stays 1 (op primitives never call deploy_app). lint PASS; 8 unit tests PASS on cc-ci.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
abra app backup create (backup-bot-two) stops/cycles the app container, so a mutate exec_in_app
right after backup hit an empty docker ps and raised. _app_container now polls (no bare sleep) for
the container to reappear within a timeout. Recipe-agnostic harness robustness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- deploy_app: checkout the pinned tag + deploy NON-chaos when a version is pinned (chaos only for
version=None / PR-head). Was always -C, which ignored the pin and deployed LATEST -> upgrade no-op.
- do_upgrade: assert the deployment actually MOVED (coop-cloud version label and/or image changed)
via lifecycle.deployed_identity -> a vacuous no-op upgrade can no longer pass (DG2).
- G2: migrate custom-html overlays to the assertion-only contract (override + extend-by-composition
+ data-continuity; split backup/restore). tests/unit/test_discovery.py proves precedence (5/5).
Probe (Adversary's F1d-2 test): hedgedoc deploy-prev=1.10.7 -> upgrade=1.10.8, CHANGED=True.
hedgedoc full generic lifecycle green (install/upgrade/backup/restore, deploy-count=1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- backup artifact: read snapshot_id from 'abra app backup create' output (snapshots needs a TTY);
generic.parse_snapshot_id + do_backup assert it
- restore serving race: lifecycle.http_fetch (one request -> status+body, never raises) +
assert_serving is now a bounded poll (settles a post-op reconverge, no bare sleep); drop wait_serving
- F1d-1 (Adversary, low): reframe served_cert/assert_serving honestly as an INFRA TLS sanity check
(catches a lapsed/mis-rotated wildcard cert), NOT app-vs-fallback (Traefik serves the wildcard
zone-wide); the genuine serving proof is services_converged + non-404 status. Awaiting re-test.
DG1 Adversary PASS @ef44d46. G1 full-lifecycle re-verification in flight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnosed via instrumented diag: lasuite-docs upgrade reported 'FATA deploy failed' while all 9
services converged 1/1 — abra's convergence poll gives up too early on the slow stop-first roll
(pulling new images). Disable abra's check; the harness wait_healthy + data-survival assertion is
the real, more-patient gate (a genuine failure still fails the test: app never gets healthy).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>