- deploy_app: checkout the pinned tag + deploy NON-chaos when a version is pinned (chaos only for version=None / PR-head). Was always -C, which ignored the pin and deployed LATEST -> upgrade no-op. - do_upgrade: assert the deployment actually MOVED (coop-cloud version label and/or image changed) via lifecycle.deployed_identity -> a vacuous no-op upgrade can no longer pass (DG2). - G2: migrate custom-html overlays to the assertion-only contract (override + extend-by-composition + data-continuity; split backup/restore). tests/unit/test_discovery.py proves precedence (5/5). Probe (Adversary's F1d-2 test): hedgedoc deploy-prev=1.10.7 -> upgrade=1.10.8, CHANGED=True. hedgedoc full generic lifecycle green (install/upgrade/backup/restore, deploy-count=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 KiB
JOURNAL — Phase 1d (append-only)
2026-05-27 — Bootstrap Phase 1d
Read SSOT plan-phase1d-generic-test-suite.md + plan.md §6.1/§7/§9. Studied the post-1b codebase:
runner/run_recipe_ci.py (per-stage pytest, currently deploy-per-stage), tests/conftest.py
(fixtures deployed_app/deployed/old_app each deploy+teardown), runner/harness/{lifecycle,abra,naming}.py,
and existing recipe tests (custom-html/keycloak/etc.).
Access re-verified (bootstrap, new phase):
$ ssh cc-ci 'hostname && whoami && nixos-version'
nixos / root / 24.11.20250630.50ab793 (Vicuna)
$ ssh cc-ci 'abra --version' -> abra version 0.13.0-beta-06a57de
$ ssh cc-ci 'docker stack ls' -> traefik, drone, ccci-bridge, ccci-dashboard, backups all up
$ ssh cc-ci 'grep -ri backupbot ~/.abra/recipes/custom-html/'
compose.yml: backupbot.backup=true ; backupbot.backup.path=/usr/share/nginx/html
$ curl -u bot ... /repos/recipe-maintainers/custom-html-tiny -> 200 (mirrored)
So: backup-capability is detectable by scanning compose for backupbot.backup; custom-html-tiny is
mirrored and has NO cc-ci tests dir → it's the DG1 pure-generic target.
Design recorded in DECISIONS.md (Phase 1d section). Key calls: tier model with the lifecycle OP owned by the shared harness (test files = assertions only); OVERRIDE precedence repo-local > cc-ci > generic with extend-by-composition; deploy-ONCE with a deploy-count guard; base version = previous (when upgrade runs) else target; backup-capability auto-detect; install-steps shell hook.
Seeded STATUS-1d / BACKLOG-1d / JOURNAL-1d. Next: implement G0 (generic.py + discovery.py + tests/_generic/ + deploy-once orchestrator), then verify generic install green on custom-html-tiny.
2026-05-27 — G0 generic install + deploy-once orchestrator: DG1 GREEN
Built the G0 machinery and proved DG1 end-to-end on the real server:
runner/harness/generic.py—assert_serving(services converged + real HTTP in HEALTH_OK [excludes 404] + not Traefik's 404 body + CA-verified TLS cert is the trusted wildcard), op helpers (do_upgrade/do_backup/do_restore),backup_capable(scan compose for backupbot.backup).runner/harness/discovery.py— per-op overlay resolution (repo-local > cc-ci > generic), custom test discovery (both locations, additive), install-steps hook discovery.tests/_generic/test_{install,upgrade,backup,restore}.py— assertion-only tiers usinglive_app.runner/run_recipe_ci.py— deploy-ONCE orchestrator: base version (prev if upgrade+exists else target), tiers run against the shared deployment, one teardown in finally, deploy-count guard + per-op summary.tests/conftest.py—live_appfixture (reads CCCI_APP_DOMAIN; tiers never deploy).lifecycle.deploy_app— deploy-count recorder + install-steps hook + pin DOMAIN to the run domain (fixes recipes whose .env.sample uses{{ .Domain }}, which this abra leaves unexpanded).
Two real generic bugs found+fixed via live runs (not "should work"):
- custom-html-tiny deploy failed:
DOMAIN={{ .Domain }}not auto-filled byabra app new -Don 0.13.0-beta →can't evaluate field Domain. Fix:env_set(domain,"DOMAIN",domain)in deploy_app. served_cert_subjectusedopenssl s_client, but openssl is not on the host (cc-ci-runruntimeInputs has no openssl) → it silently returned None → the "not default cert" check was a no-op (a DG7 can't-fail smell). Replaced with a pure-Python CA-verified handshake (ssl): a publicly-trusted LE wildcard verifies + matches hostname; Traefik's self-signed default fails verification → a genuine assertion. Verified the verify path on the host:ssl.create_default_context()against ci.commoninternet.net → VERIFIED, CN=.ci.commoninternet.net, SAN=[.ci.commoninternet.net, ci.commoninternet.net].
DG1 evidence (cc-ci, final code): custom-html-tiny is a static-web-server with an empty content volume → genuinely serves 404 zero-config (not a serving demo), so picked hedgedoc (simple category, NO cc-ci/repo-local tests → pure generic; backup-capable bonus):
$ RECIPE=hedgedoc STAGES=install cc-ci-run runner/run_recipe_ci.py
===== TIER: install (generic: tests/_generic/test_install.py) =====
tests/_generic/test_install.py::test_serving PASSED
===== RUN SUMMARY ===== deploy-count = 1 (expect 1) install : pass
$ docker stack ls | grep hedg -> (none — clean teardown)
Lint+format clean (ruff check/ruff format --check via nix develop .#lint). Claiming the G0 gate.
2026-05-27 — G0/DG1 PASS; F1d-1 fixed; G1 backup+restore fixes
Adversary verdict: DG1 PASS @2026-05-27 (cold, own clone @ef44d46). G0 cleared.
Correcting an overstatement (Adversary finding F1d-1, valid): my earlier G0 wording claimed the
CA-verified cert check distinguishes "the app vs a Traefik default-cert fallback." It does NOT —
Traefik's file provider serves the pre-issued wildcard for the WHOLE *.ci.commoninternet.net
zone, so ANY in-zone subdomain (even a non-deployed one) verifies; the self-signed default cert is
never served in-zone. The genuine app-vs-fallback proof is services_converged (the app's OWN
service replicas N/N) + a non-404 status in HEALTH_OK (Traefik's unmatched-router fallback = 404).
Fix applied (no code behavior change to the load-bearing checks; honesty/scope only):
generic.served_cert+assert_servingdocstrings/comments reframed: the cert check is an INFRA TLS sanity check (catches a lapsed/mis-rotated wildcard cert — plan §4.0 renewal), explicitly NOT an app-vs-fallback check. Kept because it CAN fail (cert expiry/untrust), unlike the old openssl-missing no-op it replaced.- Assertion message reworded ("served wildcard cert is not trusted/valid", not "...not the default"). Noted for the Adversary to re-test + close F1d-1 (theirs to tick).
G1 — DG2 (upgrade) + DG3 (backup/restore) on hedgedoc (backup-capable, ≥2 tags 3.0.9→3.0.10): Two real bugs found+fixed via live runs:
- backup artifact check.
abra app backup snapshotsneeds a TTY (FATA the input device is not a TTY), butabra app backup createalready emits the restic JSON summary with the produced"snapshot_id"(rc 0, "backup finished"). Verified raw on a live custom-html:snapshot_id": "d85bf492…". Fix:backup_createreturns its output;generic.parse_snapshot_idregex-extracts the id;do_backupasserts it. (Dropped the TTY-boundsnapshotslisting.) - restore serving race.
assert_servingmade TWO requests (http_get then http_body); post-restore the app flapped between them →http_bodyraised an unhandledHTTPError 404. Fix: newlifecycle.http_fetchreturns (status, body) in ONE request, never raising;assert_servingnow BOUNDED-POLLS converged + serving (status+body from one request) so a post-op reconverge settles while a persistent failure still fails within HTTP_TIMEOUT (no bare sleep).do_upgrade/do_restorecall it (dropped the redundantwait_serving). Re-running full hedgedoc install→upgrade→backup→restore to confirm all-green before claiming G1.
2026-05-27 — G1 GREEN (DG2 + DG3), claiming gate
Full generic lifecycle on hedgedoc (no overlay → all tiers generic), final code, on cc-ci:
$ RECIPE=hedgedoc STAGES=install,upgrade,backup,restore CCCI_JANITOR_MAX_AGE=0 cc-ci-run runner/run_recipe_ci.py
TIER: install (generic) test_serving PASSED # deploy base=prev 3.0.9, serves
TIER: upgrade (generic) test_upgrade_reconverges PASSED # abra app upgrade -> 3.0.10 in place, reconverged+serving
TIER: backup (generic) test_backup_artifact PASSED # snapshot_id produced
TIER: restore (generic) test_restore_healthy PASSED # restored + healthy
RUN SUMMARY: deploy-count = 1 (expect 1) install/upgrade/backup/restore : pass
$ docker stack ls | grep -iE 'hedg|cust' -> (none — clean teardown)
- DG2 (generic upgrade, prev→target in place on the shared deployment, reconverge+serving) ✅.
- DG3 backup-capable path ✅ (artifact = snapshot_id from create; restore completes + healthy).
- DG3 N/A logic evidenced:
generic.backup_capable→ hedgedoc=True, custom-html=True, custom-html-tiny=False. The non-capable run-demo (backup/restore reportedskip, install passing) lands naturally in G3: custom-html-tiny is non-backup-capable AND only serves once the install-steps content hook is added — so the same recipe proves DG5 (fail-without/pass-with) and DG3-N/A (skip on a serving non-backup recipe) together. - DG4.1 corroborated again: deploy-count=1 across the whole install→upgrade→backup→restore run. Claiming G1.
2026-05-28 — F1d-2 fix: pinned base now deploys the pinned version (DG2 was vacuous)
Adversary G1 verdict: FAIL — DG2 upgrade was a vacuous no-op. F1d-1 CLOSED (cert reframe accepted).
Root cause (Adversary + my confirmation): deploy_app always deployed with -C (chaos = current
checkout), which IGNORES the version pin → a "previous-version" base actually deployed LATEST, so
"upgrade to newest" was latest→latest and only the still-serving assertion ran ⇒ a broken upgrade
would pass. Real defect.
Fix (two parts):
deploy_appnow checks the recipe out to the pinned tag (abra.recipe_checkout) AND deploys non-chaos when a version is pinned (abra.deploy(chaos=(version is None))). Chaos stays only for the version=None case (deploy the current PR-head checkout).- Hardened the generic upgrade so a no-op CANNOT pass by construction:
do_upgradecaptures the app service's (coop-cloud version label, image) before+after and asserts the deployment actually MOVED (lifecycle.deployed_identity). Even if the pin regressed again, before==after → FAIL.
Probe (the Adversary's exact F1d-2 test, my code, on cc-ci) — now PASSES:
prev: 3.0.9+1.10.7
IMAGE BEFORE (asked prev): quay.io/hedgedoc/hedgedoc:1.10.7@sha256:3174abea… ← was 1.10.8 (LATEST) pre-fix
IMAGE AFTER (upgraded) : quay.io/hedgedoc/hedgedoc:1.10.8@sha256:423f4117…
CHANGED: True
Re-running the full hedgedoc + custom-html lifecycles to confirm all-green with the move-assertion, then re-claim G1 (and G2: custom-html overlays override+extend the generic, deploy-count=1).