fix(1e): F1e-1 exec_in_app race + HC1 head_ref/move hardening

F1e-1 (Adversary): exec_in_app silently returned '' on a failed docker exec, flipping a healthy
recipe RED under opt-out (post-backup container cycle, no readiness buffer). Now polls (re-resolve
container + re-exec) until rc==0 or 90s, then RAISES — never masks an exec failure as empty data.
No assertion weakened. Verified: opt-out install,backup,restore on custom-html now PASS.

HC1: head_ref = ref or recipe_head_commit (prefer explicit PR head sha $REF — robust, no git race;
production !testme always sets REF). assert_upgraded, when head_ref known, REQUIRES the deployed
chaos-version commit to MATCH head_ref (direct + non-vacuous proof the PR-head code was deployed; a
stale prev-checkout chaos redeploy fails). Falls back to version/image/chaos move check otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 03:41:42 +01:00
parent 4334e19a7b
commit 6eabfdc0fb
5 changed files with 149 additions and 35 deletions

View File

@ -76,3 +76,28 @@ Next: confirm opt-out result, claim E1/HC3 gate, then E2 (HC1 chaos-to-PR-head).
leftover custom-html stack). Log: /root/ccci-1e-optout.log. leftover custom-html stack). Log: /root/ccci-1e-optout.log.
- HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out = - HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out =
generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary. generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary.
## 2026-05-28 — Adversary F1e-1 (HC3 opt-out race) + HC1 hardening
- **F1e-1 (E1/HC3 FAIL withheld):** under `CCCI_SKIP_GENERIC=1`, `test_backup_captures_state` flaked
`'' == 'original'`. Root cause (valid): `lifecycle.exec_in_app` returned `proc.stdout` WITHOUT
checking returncode — when backup-bot cycles the app container, `docker exec` fails and the empty
stdout was silently returned as data; the generic pytest spawn (~1s) had been an accidental timing
buffer that opt-out removes. **Fix (no assertion weakened):** `exec_in_app` now polls — re-resolves
the container + re-execs until returncode==0 or a 90s timeout, then RAISES. A container-cycle race
now waits-and-succeeds; a genuine exec failure is loud, never masquerades as empty data. This makes
the backup/restore overlays robust to the post-op cycle independent of the generic timing buffer, so
opt-out is behavior-neutral.
- **HC1 hardening (my own findings from E2 e2e):**
- `head_ref` capture was racy (returned None under a concurrent run wiping the shared recipe dir),
and a chaos-redeploy of the SAME prev checkout falsely "moved" via the chaos label alone. Fixes:
`head_ref = ref or recipe_head_commit(recipe)` (prefer the explicit PR head sha $REF — robust, no
git race; production `!testme` always sets REF); store head_ref in op_state.
- `assert_upgraded` now, when head_ref is known, REQUIRES the deployed `chaos-version` commit to
MATCH head_ref — direct proof the PR-head code under test was deployed, and non-vacuous (a stale
prev-checkout chaos redeploy stamps prev's commit ≠ head_ref → FAIL). Falls back to the
version/image/chaos move check only when head_ref is unknown.
- **Coordination note:** my E2 manual custom-html e2e ran concurrently with the Adversary's E1
cold-verify — both share `/root/.abra/recipes/custom-html` + (at PR=0) the same run domain, so they
collided (explains my non-deterministic 1.10→1.11 vs 1.10→1.10 and the None head_ref). Manual ad-hoc
runs bypass Drone's capacity=1 queue. Going forward I serialize: don't run a recipe manually while a
gate is under Adversary verification; verify when `pgrep run_recipe_ci` is clear.

View File

@ -152,6 +152,18 @@ def restore(domain: str, timeout: int = 900) -> None:
_run_pty(["app", "restore", domain, "-n", "-C", "-o"], timeout=timeout) _run_pty(["app", "restore", domain, "-n", "-C", "-o"], timeout=timeout)
def recipe_head_commit(recipe: str) -> str | None:
"""The current HEAD commit of the recipe checkout — captured right after fetch (the PR head, or
the catalogue current) so the upgrade tier can re-checkout it for the chaos redeploy after the
prev-tag base deploy reset the working tree (HC1)."""
import os
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
proc = subprocess.run(["git", "-C", path, "rev-parse", "HEAD"], capture_output=True, text=True)
out = proc.stdout.strip()
return out or None
def recipe_versions(recipe: str) -> list[str]: def recipe_versions(recipe: str) -> list[str]:
"""Published versions of a recipe, oldest→newest (from the recipe git tags).""" """Published versions of a recipe, oldest→newest (from the recipe git tags)."""
import os import os

View File

@ -142,24 +142,43 @@ def op_state() -> dict:
def assert_upgraded(domain: str, meta: dict) -> None: def assert_upgraded(domain: str, meta: dict) -> None:
"""Generic UPGRADE assertion (post-op): the orchestrator already performed the upgrade once. """Generic UPGRADE assertion (post-op): the orchestrator already performed the upgrade once via
Assert it reconverged + still serves AND that the deployment actually MOVED — guarding against a `abra app deploy --chaos` of the PR-head checkout. Assert it reconverged + still serves AND that
vacuous no-op upgrade silently passing (F1d-2). HC1: prev→PR-head may NOT bump the version label, the deployment is genuinely the PR-head code under test (HC1) — non-vacuously (guarding F1d-2).
so a MOVE is ANY of: version-label change, image change, or a chaos label now present (a chaos
deploy stamps the PR-head commit — THE proof the code under test was deployed).""" The chaos deploy stamps `coop-cloud.<stack>.chaos-version` = the deployed recipe commit. When the
before = op_state().get("upgrade", {}).get("before") or {} intended PR-head commit is known (head_ref), require the deployed chaos commit to MATCH it — THE
proof the code under test was deployed, and non-vacuous: a stale prev-checkout chaos redeploy would
stamp prev's commit, not head_ref, and fail here. When head_ref is unknown, fall back to requiring
a move vs the pre-upgrade state (version/image/chaos changed)."""
st = op_state().get("upgrade", {})
before = st.get("before") or {}
head_ref = st.get("head_ref")
assert_serving(domain, meta) assert_serving(domain, meta)
after = lifecycle.deployed_identity(domain) after = lifecycle.deployed_identity(domain)
chaos = after.get("chaos")
if head_ref:
assert chaos, (
f"{domain}: upgrade left no chaos label — `abra app deploy --chaos` did not deploy the "
"PR-head checkout (the code under test was not exercised by the upgrade)"
)
# chaos-version is an abbreviated commit (e.g. '8a026066'); head_ref may be full or short.
assert head_ref.startswith(chaos) or chaos.startswith(head_ref), (
f"{domain}: upgrade deployed chaos commit {chaos!r}, not the intended PR-head "
f"{head_ref[:12]!r} — the re-checkout to the code under test failed, so the upgrade is "
"not exercising the PR's changes (HC1)"
)
return
moved = ( moved = (
(before.get("version") and after.get("version") and before["version"] != after["version"]) (before.get("version") and after.get("version") and before["version"] != after["version"])
or (before.get("image") and after.get("image") and before["image"] != after["image"]) or (before.get("image") and after.get("image") and before["image"] != after["image"])
or (after.get("chaos") and after.get("chaos") != before.get("chaos")) or (chaos and chaos != before.get("chaos"))
) )
assert moved, ( assert moved, (
f"{domain}: upgrade did not move the deployment " f"{domain}: upgrade did not move the deployment "
f"(version {before.get('version')}->{after.get('version')}, " f"(version {before.get('version')}->{after.get('version')}, "
f"image {before.get('image')}->{after.get('image')}, " f"image {before.get('image')}->{after.get('image')}, "
f"chaos {before.get('chaos')}->{after.get('chaos')}) — " f"chaos {before.get('chaos')}->{chaos}) — "
"not a real upgrade to the code under test (HC1/DG2 must be non-vacuous)" "not a real upgrade to the code under test (HC1/DG2 must be non-vacuous)"
) )
@ -195,12 +214,25 @@ def assert_restore_healthy(domain: str, meta: dict) -> None:
# ---- Op primitives (orchestrator-only; perform the op once, never assert) -------------------- # ---- Op primitives (orchestrator-only; perform the op once, never assert) --------------------
def perform_upgrade(domain: str, target: str | None) -> dict[str, str | None]: def perform_upgrade(domain: str, recipe: str, head_ref: str | None) -> dict[str, str | None]:
"""Perform the UPGRADE op once (in place). E1 baseline: `abra app upgrade` -> target. (HC1/E2 """Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
redefines this as a chaos redeploy of the PR-head checkout.) Returns the pre-upgrade identity so PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
the orchestrator can record it for `assert_upgraded`'s move check.""" to redeploy the running app at that checkout. This is the real upgrade the PR's changes are
exercised by (vs the old 'upgrade to newest published tag', which never deployed PR-head code).
Returns the pre-upgrade identity so the orchestrator records it for `assert_upgraded`'s move check
— after the chaos deploy the `chaos`(-version) label carries the PR-head commit, proving it."""
before = lifecycle.deployed_identity(domain) before = lifecycle.deployed_identity(domain)
lifecycle.upgrade_app(domain, version=target) if head_ref:
lifecycle.recipe_checkout_ref(recipe, head_ref)
lifecycle.chaos_redeploy(domain)
after = lifecycle.deployed_identity(domain)
# Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
# PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.
print(
f" upgrade→PR-head: head_ref={(head_ref or '')[:8] or None} "
f"chaos-version={after.get('chaos')} version={before.get('version')}{after.get('version')}",
flush=True,
)
return before return before

View File

@ -252,11 +252,11 @@ def deployed_identity(domain: str, service: str = "app") -> dict[str, str | None
- `version` = the `coop-cloud.<stack>.version` label (bumped per published recipe version). - `version` = the `coop-cloud.<stack>.version` label (bumped per published recipe version).
- `image` = the running container image (usually bumps with a published version). - `image` = the running container image (usually bumps with a published version).
- `chaos` = the chaos label value (a chaos deploy stamps the recipe git commit/dirty state here) - `chaos` = the chaos deploy's recipe git commit. abra stamps `coop-cloud.<stack>.chaos-version`
— present after `abra app deploy --chaos`, absent on a clean pinned-tag deploy. For prev→PR-head = the deployed recipe commit (e.g. "91b27ceb") + `coop-cloud.<stack>.chaos`="true" on a
this is THE proof PR-head was deployed even when the version label is unbumped (HC1). The exact `--chaos` deploy; both are absent on a clean pinned-tag deploy. We prefer the `.chaos-version`
chaos label key varies by abra version, so we capture any `coop-cloud.<stack>.*` label whose key commit — for prev→PR-head it IS the proof the PR-head code under test was deployed even when the
contains "chaos".""" version label is unbumped (HC1); fall back to the `.chaos` flag if no commit is present."""
name = f"{_stack_name(domain)}_{service}" name = f"{_stack_name(domain)}_{service}"
proc = subprocess.run( proc = subprocess.run(
[ [
@ -274,22 +274,43 @@ def deployed_identity(domain: str, service: str = "app") -> dict[str, str | None
if "|" not in out: if "|" not in out:
return {"version": None, "image": None, "chaos": None} return {"version": None, "image": None, "chaos": None}
labels_json, _, image = out.partition("|") labels_json, _, image = out.partition("|")
ver = chaos = None ver = chaos = chaos_flag = None
with contextlib.suppress(ValueError, json.JSONDecodeError): with contextlib.suppress(ValueError, json.JSONDecodeError):
for k, v in json.loads(labels_json).items(): for k, v in json.loads(labels_json).items():
if not k.startswith("coop-cloud."): if not k.startswith("coop-cloud."):
continue continue
if k.endswith(".version"): if k.endswith(".version"):
ver = v ver = v
elif "chaos" in k: elif k.endswith(".chaos-version"):
chaos = v chaos = v # the deployed recipe commit — the strongest signal
return {"version": ver, "image": image.strip() or None, "chaos": chaos} elif k.endswith(".chaos"):
chaos_flag = v
return {"version": ver, "image": image.strip() or None, "chaos": chaos or chaos_flag}
def upgrade_app(domain: str, version: str | None = None) -> None: def upgrade_app(domain: str, version: str | None = None) -> None:
abra.upgrade(domain, version=version) abra.upgrade(domain, version=version)
def recipe_head_commit(recipe: str) -> str | None:
"""The recipe checkout's current HEAD commit (captured right after fetch, before any version-tag
checkout) so the upgrade tier can re-checkout the PR head for the chaos redeploy (HC1)."""
return abra.recipe_head_commit(recipe)
def recipe_checkout_ref(recipe: str, ref: str) -> None:
"""git-checkout the recipe to an arbitrary ref/commit (HC1: restore the PR-head checkout before
the chaos upgrade — the prev-tag base deploy reset it to the published tag)."""
abra.recipe_checkout(recipe, ref)
def chaos_redeploy(domain: str) -> None:
"""In-place `abra app deploy --chaos`: redeploy the running app at the CURRENT recipe checkout
(HC1: the PR-head code under test). This is the upgrade op, not a fresh install — it does NOT go
through deploy_app, so the deploy-count guard (DG4.1) is not incremented."""
abra.deploy(domain, chaos=True)
def backup_app(domain: str) -> str: def backup_app(domain: str) -> str:
"""Create a backup; return the abra/restic output (carries the produced snapshot_id).""" """Create a backup; return the abra/restic output (carries the produced snapshot_id)."""
return abra.backup_create(domain) return abra.backup_create(domain)
@ -326,10 +347,26 @@ def _app_container(domain: str, service: str = "app", timeout: int = 60) -> str:
time.sleep(3) time.sleep(3)
def exec_in_app(domain: str, cmd: list[str], service: str = "app") -> str: def exec_in_app(domain: str, cmd: list[str], service: str = "app", timeout: int = 90) -> str:
cid = _app_container(domain, service) """Run `docker exec` in the app's container and return stdout. Hardened (Adversary F1e-1): a
proc = subprocess.run(["docker", "exec", cid, *cmd], capture_output=True, text=True) lifecycle op (backup/restore) cycles the container, so a freshly-resolved container can be
return proc.stdout mid-transition and `docker exec` FAILS — poll (re-resolving the container each try) until the exec
succeeds (returncode 0) or timeout, then RAISE. Never silently return '' on a failed exec: that
masked a container-cycle race as empty data, flipping a healthy recipe RED under opt-out (no
accidental generic-pytest timing buffer) — and could mask a real failure as a pass elsewhere."""
deadline = time.time() + timeout
last = ""
while True:
cid = _app_container(domain, service)
proc = subprocess.run(["docker", "exec", cid, *cmd], capture_output=True, text=True)
if proc.returncode == 0:
return proc.stdout
last = (proc.stderr or proc.stdout).strip()
if time.time() >= deadline:
raise RuntimeError(
f"docker exec in {domain}/{service} failed (rc={proc.returncode}) after {timeout}s: {last}"
)
time.sleep(3)
def http_body(domain: str, path: str = "/", timeout: int = 15) -> str: def http_body(domain: str, path: str = "/", timeout: int = 15) -> str:

View File

@ -211,13 +211,14 @@ def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, met
sys.path.remove(d) sys.path.remove(d)
def _perform_op(op: str, domain: str, target: str | None, op_state: dict) -> None: def _perform_op(op: str, domain: str, recipe: str, head_ref: str | None, op_state: dict) -> None:
"""Perform the single mutating op ONCE (the harness owns the op, HC3). install has no op. Records """Perform the single mutating op ONCE (the harness owns the op, HC3). install has no op. Records
what the assertions need (pre-upgrade identity, backup snapshot_id) into op_state. None of these what the assertions need (pre-upgrade identity, backup snapshot_id) into op_state. None of these
call deploy_app, so the deploy-count guard (DG4.1) stays 1 — the in-place upgrade is not a new call deploy_app, so the deploy-count guard (DG4.1) stays 1 — the in-place chaos upgrade is not a
install (HC1 reconciliation).""" new install (HC1 reconciliation)."""
if op == "upgrade": if op == "upgrade":
op_state["upgrade"] = {"before": generic.perform_upgrade(domain, target)} before = generic.perform_upgrade(domain, recipe, head_ref)
op_state["upgrade"] = {"before": before, "head_ref": head_ref}
elif op == "backup": elif op == "backup":
op_state["backup"] = {"snapshot_id": generic.perform_backup(domain)} op_state["backup"] = {"snapshot_id": generic.perform_backup(domain)}
elif op == "restore": elif op == "restore":
@ -231,12 +232,13 @@ def run_lifecycle_tier(
repo_local: str | None, repo_local: str | None,
domain: str, domain: str,
meta: dict, meta: dict,
target: str | None, head_ref: str | None,
op_state: dict, op_state: dict,
) -> str: ) -> str:
"""Additive lifecycle tier (HC3): seed (pre-op hook) → perform the op ONCE → run the generic """Additive lifecycle tier (HC3): seed (pre-op hook) → perform the op ONCE → run the generic
assertion file (unless opted out) AND the overlay assertion file, both against the shared post-op assertion file (unless opted out) AND the overlay assertion file, both against the shared post-op
deployment. Returns 'pass' | 'fail' | 'skip'.""" deployment. The upgrade op redeploys the PR head (head_ref) via chaos (HC1). Returns
'pass' | 'fail' | 'skip'."""
overlay = discovery.resolve_overlay_op(recipe, op, repo_local) overlay = discovery.resolve_overlay_op(recipe, op, repo_local)
skip_gen = _skip_generic(op, meta) skip_gen = _skip_generic(op, meta)
files: list[tuple[str, str]] = [] files: list[tuple[str, str]] = []
@ -257,7 +259,7 @@ def run_lifecycle_tier(
# 1) pre-op seed hook + 2) the op ONCE (harness-owned). A failure here is an op failure → tier fail. # 1) pre-op seed hook + 2) the op ONCE (harness-owned). A failure here is an op failure → tier fail.
try: try:
_run_pre_hook(recipe, op, repo_local, domain, meta) _run_pre_hook(recipe, op, repo_local, domain, meta)
_perform_op(op, domain, target, op_state) _perform_op(op, domain, recipe, head_ref, op_state)
with open(os.environ["CCCI_OP_STATE_FILE"], "w") as f: with open(os.environ["CCCI_OP_STATE_FILE"], "w") as f:
json.dump(op_state, f) json.dump(op_state, f)
except Exception as e: # noqa: BLE001 — a failed op is a reported tier failure, not a crash except Exception as e: # noqa: BLE001 — a failed op is a reported tier failure, not a crash
@ -312,6 +314,10 @@ def main() -> int:
f"== cc-ci run: recipe={recipe} ref={ref} pr={os.environ.get('PR', '0')} stages={sorted(stages)}" f"== cc-ci run: recipe={recipe} ref={ref} pr={os.environ.get('PR', '0')} stages={sorted(stages)}"
) )
fetch_recipe(recipe, ref, src) fetch_recipe(recipe, ref, src)
# The PR-head commit the upgrade tier re-checks out for the chaos redeploy to the code under test
# (HC1). Prefer the explicit PR head sha ($REF) — robust + exact; fall back to the recipe checkout
# HEAD (the catalogue current) for a non-PR `!testme`. Captured before any version-tag checkout.
head_ref = ref or lifecycle.recipe_head_commit(recipe)
repo_local = snapshot_recipe_tests(recipe) repo_local = snapshot_recipe_tests(recipe)
meta = _load_meta(recipe) meta = _load_meta(recipe)
domain = naming.app_domain(recipe, os.environ.get("PR", "0"), ref) domain = naming.app_domain(recipe, os.environ.get("PR", "0"), ref)
@ -361,7 +367,7 @@ def main() -> int:
# ---- INSTALL tier (always; additive generic + overlay, no op) ---- # ---- INSTALL tier (always; additive generic + overlay, no op) ----
if "install" in stages: if "install" in stages:
results["install"] = ( results["install"] = (
run_lifecycle_tier(recipe, "install", repo_local, domain, meta, target, op_state) run_lifecycle_tier(recipe, "install", repo_local, domain, meta, head_ref, op_state)
if deploy_ok if deploy_ok
else "fail" else "fail"
) )
@ -379,7 +385,9 @@ def main() -> int:
# ---- BACKUP + RESTORE tiers (backup-capable only; else clean N/A) ---- # ---- BACKUP + RESTORE tiers (backup-capable only; else clean N/A) ----
if "backup" in stages: if "backup" in stages:
results["backup"] = ( results["backup"] = (
run_lifecycle_tier(recipe, "backup", repo_local, domain, meta, target, op_state) run_lifecycle_tier(
recipe, "backup", repo_local, domain, meta, head_ref, op_state
)
if backup_cap if backup_cap
else "skip" else "skip"
) )