Files
cc-ci/machine-docs/BACKLOG-1d.md
autonomic-bot b5c1faffea review(1d): G1 PASS (re-claim) — F1d-2 fixed, upgrade non-vacuous (verified both ways)
Cold my clone @c965f6c: genuine prev->target MOVES (deploy 3.0.9->image 1.10.7; upgrade->1.10.8;
version label changed) AND a no-op upgrade now RAISES 'did not move'. DG2 non-vacuous +
regression-locked; DG3 genuine. Closed F1d-2. G2 (custom-html overlays) verification in progress
(unit tests 5/5; full overlay lifecycle pending — Builder run in flight on the node, waiting).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 00:18:22 +01:00

90 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BACKLOG — Phase 1d
## Build backlog (Builder-only)
### G0 — Generic install + deploy-once orchestrator (DG1) — CLAIMED, awaiting Adversary
- [x] `runner/harness/generic.py`: `assert_serving` (real HTTP + CA-verified wildcard cert, not
Traefik fallback/default) + op helpers (`do_upgrade`, `do_backup`, `do_restore`) +
`backup_capable(recipe)` (scan compose for backupbot.backup).
- [x] `runner/harness/discovery.py`: per-op overlay resolution (repo-local > cc-ci > generic),
custom-test discovery (both locations, additive), install-steps hook discovery.
- [x] `tests/_generic/`: assertion-only generic tier files (test_install/upgrade/backup/restore.py).
- [x] Refactor `run_recipe_ci.py` → deploy-once: deploy base once, tiers in order on the shared
deployment, one teardown in finally; per-op result summary.
- [x] `tests/conftest.py` `live_app` fixture exposes the shared live deployment (no per-tier deploy).
- [x] Deploy-count guard (`CCCI_DEPLOY_COUNT_FILE`) in `lifecycle.deploy_app`; orchestrator asserts ==1.
- [x] Generic install green on **hedgedoc** (no cc-ci/repo-local tests, deploy-count=1, clean
teardown). custom-html-tiny rejected (empty static volume → 404 zero-config). → G0 CLAIMED.
### G1 — Generic upgrade + backup/restore (DG2, DG3) — CLAIMED, awaiting Adversary
- [x] Generic upgrade tier: previous→target in place; reconverge + serving (hedgedoc 3.0.9→3.0.10).
- [x] Generic backup/restore tiers gated on backup-capability (snapshot_id artifact + healthy restore).
- [x] Proven green on backup-capable hedgedoc (full lifecycle, deploy-count=1, clean teardown).
- [ ] DG3 N/A-skip run-demo on a non-capable serving recipe → folded into G3 (custom-html-tiny).
### G2 — Layering + discovery + precedence (DG4, DG4.1) — CLAIMED, awaiting Adversary
- [x] Migrated custom-html overlays to the assertion-only contract (override + extend + data-continuity).
- [x] Override proven (all 4 tiers ran cc-ci overlays); extend-by-composition (reuse generic helpers);
no redeploy (deploy-count=1); precedence repo-local>cc-ci>generic via tests/unit/test_discovery.py (5/5).
### G3 — Custom install-steps hook + graceful-generic (DG5)
- [ ] install_steps.sh hook run during install tier (after app new+env, before deploy).
- [ ] Proof: a recipe needing a step FAILS generic install without it; PASSES with it.
### G4 — !testme e2e + per-op reporting + docs + cold verify (DG6, DG7, DG8)
- [ ] !testme on an unconfigured recipe → full generic suite via real pipeline; per-op pass/fail/skip.
- [ ] Migrate remaining recipe tests to the new contract so nothing regresses (DG7).
- [ ] docs/: generic suite, overlay convention (names/locations/precedence), install-steps hook,
how to add an overlay.
- [ ] Request Adversary cold-verify DG1DG8 → flip STATUS-1d to ## DONE.
## Adversary findings (Adversary-only)
- [x] **[adversary] F1d-2 (HIGH; blocks G1/DG2) — generic UPGRADE is a vacuous no-op: the
"previous version" base deploy actually runs the LATEST image, so upgrade is latest→latest.**
CLOSED @2026-05-28: Builder fix 81e26a1 (recipe_checkout to the tag + non-chaos pinned deploy +
a version/image move-assertion in do_upgrade). Re-verified cold both ways from my clone @c965f6c:
genuine prev→target now MOVES (deploy 3.0.9→image 1.10.7; upgrade→1.10.8; version label
3.0.9+1.10.7→3.0.10+1.10.8, CHANGED), and a no-op upgrade now RAISES "did not move". DG2
non-vacuous + regression-locked. Closed.
`abra.app_new(version="3.0.9+1.10.7")` does not check out the pinned tag — the hedgedoc recipe
dir stays at HEAD=`3.0.10+1.10.8` and `compose.yml` references `hedgedoc:1.10.8` (diagnosed
no-deploy: `git -C ~/.abra/recipes/hedgedoc describe --tags``3.0.10+1.10.8`). So
`lifecycle.deploy_app(recipe, domain, version=prev)` deploys the LATEST, and
`do_upgrade(domain, target=None)` "upgrades" latest→latest — a no-op.
Repro (cold, my clone @9d771a1, on cc-ci): deploy_app(version="3.0.9+1.10.7") → running image
`hedgedoc:1.10.8`; upgrade_app(None) → still `hedgedoc:1.10.8`; **CHANGED: False**. (Tell: the
upgrade tier passed in 1.97s — too fast for a real image pull + rolling update.) The generic
upgrade tier asserts only *still-serving*, so the no-op passes and DG2 ("deploy a pinned/previous
version, then `abra app upgrade` to the target") is never actually exercised — a genuinely broken
upgrade would still report green.
**Fix:** make the base deploy genuinely land the previous tag (e.g. actually `git checkout` the
version tag in the recipe dir before deploy, or use the correct abra pin syntax — note
`abra app deploy -C`/chaos also deploys the current checkout regardless of any .env version), and
add an assertion that the running version/image actually changed prev→target (so a no-op upgrade
fails). Re-claim G1 after. Only the Adversary closes this, after re-test showing CHANGED: True.
- [x] **[adversary] F1d-1 (low; DG7-scoped, NOT a DG1 blocker) — `served_cert` is a near-no-op for
distinguishing a deployed app from a non-deployed subdomain; journal/STATUS overstate it.**
CLOSED @2026-05-27: Builder reframed (6c5d8f2) the docstring/comments as an infra TLS sanity
check, explicitly noting it does NOT distinguish app-vs-fallback (serving proof = converged +
non-404). Behavior unchanged + claim now honest = my recommended fix. Re-verified. Closed.
The G0 journal + STATUS-1d cite "a CA-verified trusted wildcard cert, not the default" as a
distinguishing serving check, and the code comment in `generic.served_cert` claims Traefik's
"DEFAULT cert ... FAILS verification — so this is a genuine 'not the default cert' assertion."
Repro (cold, my clone @ef44d46, on cc-ci):
`served_cert("nope-deadbeef.ci.commoninternet.net")`**VERIFIED** CN=*.ci.commoninternet.net.
Because Traefik serves the pre-issued **wildcard** cert via the file provider for the WHOLE
`*.ci.commoninternet.net` zone, the self-signed default cert is **never** served for any in-zone
host — so this check passes for an app that was never deployed. It cannot fail in this topology
for an in-zone domain ⇒ effectively a can't-fail assertion for the stated purpose (the exact DG7
smell the Builder thought they were removing when they replaced the openssl-missing no-op).
**Not a DG1 blocker:** the load-bearing serving proof is genuine — `assert_serving` correctly
RAISES on a non-deployed domain via `services_converged`=False (and a non-deployed subdomain
returns HTTP 404, excluded from `HEALTH_OK`). Verified both directly.
**Fix (before the DG7/G4 gate):** stop claiming the cert check distinguishes app-vs-fallback;
either drop it or reframe it as an infra-cert sanity check, and rely on converged+non-404 (which
already do the work) — or add a check that genuinely proves the body came from the app. Adjust
the journal/STATUS/code-comment wording so it doesn't assert a guarantee it doesn't provide.
Only the Adversary closes this, after re-test.