Files
cc-ci/machine-docs/BACKLOG-1e.md

51 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BACKLOG — Phase 1e (generic-harness corrections)
Phase-namespaced backlog. Builder edits `## Build backlog`; Adversary edits `## Adversary findings`.
## Build backlog
- [x] **E0 / HC2** — repo-local approval allowlist (`tests/repo-local-approved.txt`, default-deny);
gate `discovery.resolve_op`/`custom_tests`/`install_steps` behind `repo_local_approved(recipe)`;
update unit tests (`tests/unit/test_discovery.py`) for approved vs non-approved.
- [x] **E1 / HC3** — generic-by-default (additive); op/assertion split. Orchestrator performs each
mutating op once; runs generic test_<op>.py (unless opt-out) + overlay test_<op>.py. Opt-out:
`CCCI_SKIP_GENERIC` / `CCCI_SKIP_GENERIC_<OP>` / `recipe_meta.SKIP_GENERIC`. Pre-op seed via
optional `tests/<recipe>/ops.py`. Migrate generic + overlays to assertion-only. Keep count==1.
- [ ] **E2 / HC1** — upgrade to PR head via `abra app deploy --chaos`: deploy prev, re-checkout PR
head, chaos redeploy in place; adapt moved-assertion (chaos label proof); reconcile deploy-count.
- [ ] **E3 / HC4** — docs (docs/testing.md, enroll-recipe.md) + DECISIONS; claim gates; await Adversary
cold-verify of HC1HC4; flip STATUS-1e → ## DONE on full PASS.
## Adversary findings
- [ ] **F1e-1 [adversary]** — *opt-out is NOT behavior-neutral: the backup/restore data-continuity
overlays are racy and `exec_in_app` silently swallows a failed exec → a healthy recipe goes RED.*
Found cold-verifying E1/HC3 (commit b7e6cbd). My cold e2e of custom-html
`STAGES=install,upgrade,backup,restore,custom`:
- **default** (generic additive): all tiers PASS, deploy-count=1. ✓
- **`CCCI_SKIP_GENERIC=1`** (opt-out): generic skipped on every tier (0 `_generic/` files ran),
deploy-count=1 ✓ — BUT **backup=FAIL**: `tests/custom-html/test_backup.py::test_backup_captures_state`
`AssertionError: '' == 'original'` (the `exec_in_app(... cat ci-marker.txt)` returned **empty**).
**Root cause (static):** `lifecycle.exec_in_app` runs `docker exec <cid> …` and returns
`proc.stdout` **without checking `returncode`**. When backup-bot cycles the app container during
the backup op, `_app_container` resolves a container that is mid-transition, `docker exec` fails,
stdout is empty, and the failure is silently returned as `''`. The backup/restore overlays read
the marker via `exec_in_app` **immediately after** the container-cycling op with **no readiness
wait/retry**, despite their docstrings claiming immunity ("immune/robust to the post-backup/restore
serving race"). In the **default** path the generic `assert_backup_artifact` pytest runs first
(~1s spawn), an accidental timing buffer that lets the container settle; **opt-out removes that
buffer and the race surfaces.** So `CCCI_SKIP_GENERIC` changes observable behavior and can flip a
GREEN recipe to RED — contradicting "additive/opt-out is safe" and the Builder's E1 claim that the
opt-out run was "clean."
**Why it matters:** (1) a flaky false-RED blocks legitimate PRs and erodes trust; (2) `exec_in_app`
swallowing a failed exec is itself unsafe (an exec error masquerades as empty data — could also
make a real failure *pass* in a different assertion). Per plan guardrails: add real readiness/retry
robustness to the harness (and check the exec returncode / raise on failure), do **not** weaken or
delete the assertion.
**Repro:** `cd <repo> && CCCI_SKIP_GENERIC=1 RECIPE=custom-html STAGES=install,backup,restore
cc-ci-run runner/run_recipe_ci.py` → backup tier intermittently `'' == 'original'`.
**Status:** isolated (no-concurrency) reproduction in flight to rule out the confound that the
Builder was running parallel custom-html e2e at the same time (which would ALSO be a finding —
concurrent runs must not collide on backup-bot, §6/D-gate). Closing this finding requires: exec
returncode checked + a bounded readiness/retry on the post-op volume read, re-verified cold under
opt-out (and concurrency). **E1/HC3 PASS withheld pending fix.**