Files
cc-ci/machine-docs/JOURNAL-1d.md
autonomic-bot b10daddbef
All checks were successful
continuous-integration/drone Build is passing
status(1d): DG6 GREEN (build #153 hedgedoc e2e); G4 CLAIMED — requesting Adversary cold-verify DG1-DG8
build #153: !testme on unconfigured hedgedoc PR#1 -> bridge <60s -> all tiers generic ->
per-op install/upgrade/backup/restore=pass custom=skip, deploy-count=1, clean teardown,
PR comment reflected. DG7 (afd75a4) + DG8 (b756e72) done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:15:25 +01:00

18 KiB
Raw Blame History

JOURNAL — Phase 1d (append-only)

2026-05-27 — Bootstrap Phase 1d

Read SSOT plan-phase1d-generic-test-suite.md + plan.md §6.1/§7/§9. Studied the post-1b codebase: runner/run_recipe_ci.py (per-stage pytest, currently deploy-per-stage), tests/conftest.py (fixtures deployed_app/deployed/old_app each deploy+teardown), runner/harness/{lifecycle,abra,naming}.py, and existing recipe tests (custom-html/keycloak/etc.).

Access re-verified (bootstrap, new phase):

$ ssh cc-ci 'hostname && whoami && nixos-version'
nixos / root / 24.11.20250630.50ab793 (Vicuna)
$ ssh cc-ci 'abra --version'        -> abra version 0.13.0-beta-06a57de
$ ssh cc-ci 'docker stack ls'       -> traefik, drone, ccci-bridge, ccci-dashboard, backups all up
$ ssh cc-ci 'grep -ri backupbot ~/.abra/recipes/custom-html/'
  compose.yml: backupbot.backup=true ; backupbot.backup.path=/usr/share/nginx/html
$ curl -u bot ... /repos/recipe-maintainers/custom-html-tiny  -> 200 (mirrored)

So: backup-capability is detectable by scanning compose for backupbot.backup; custom-html-tiny is mirrored and has NO cc-ci tests dir → it's the DG1 pure-generic target.

Design recorded in DECISIONS.md (Phase 1d section). Key calls: tier model with the lifecycle OP owned by the shared harness (test files = assertions only); OVERRIDE precedence repo-local > cc-ci > generic with extend-by-composition; deploy-ONCE with a deploy-count guard; base version = previous (when upgrade runs) else target; backup-capability auto-detect; install-steps shell hook.

Seeded STATUS-1d / BACKLOG-1d / JOURNAL-1d. Next: implement G0 (generic.py + discovery.py + tests/_generic/ + deploy-once orchestrator), then verify generic install green on custom-html-tiny.

2026-05-27 — G0 generic install + deploy-once orchestrator: DG1 GREEN

Built the G0 machinery and proved DG1 end-to-end on the real server:

  • runner/harness/generic.pyassert_serving (services converged + real HTTP in HEALTH_OK [excludes 404] + not Traefik's 404 body + CA-verified TLS cert is the trusted wildcard), op helpers (do_upgrade/do_backup/do_restore), backup_capable (scan compose for backupbot.backup).
  • runner/harness/discovery.py — per-op overlay resolution (repo-local > cc-ci > generic), custom test discovery (both locations, additive), install-steps hook discovery.
  • tests/_generic/test_{install,upgrade,backup,restore}.py — assertion-only tiers using live_app.
  • runner/run_recipe_ci.py — deploy-ONCE orchestrator: base version (prev if upgrade+exists else target), tiers run against the shared deployment, one teardown in finally, deploy-count guard + per-op summary.
  • tests/conftest.pylive_app fixture (reads CCCI_APP_DOMAIN; tiers never deploy).
  • lifecycle.deploy_app — deploy-count recorder + install-steps hook + pin DOMAIN to the run domain (fixes recipes whose .env.sample uses {{ .Domain }}, which this abra leaves unexpanded).

Two real generic bugs found+fixed via live runs (not "should work"):

  1. custom-html-tiny deploy failed: DOMAIN={{ .Domain }} not auto-filled by abra app new -D on 0.13.0-beta → can't evaluate field Domain. Fix: env_set(domain,"DOMAIN",domain) in deploy_app.
  2. served_cert_subject used openssl s_client, but openssl is not on the host (cc-ci-run runtimeInputs has no openssl) → it silently returned None → the "not default cert" check was a no-op (a DG7 can't-fail smell). Replaced with a pure-Python CA-verified handshake (ssl): a publicly-trusted LE wildcard verifies + matches hostname; Traefik's self-signed default fails verification → a genuine assertion. Verified the verify path on the host: ssl.create_default_context() against ci.commoninternet.net → VERIFIED, CN=.ci.commoninternet.net, SAN=[.ci.commoninternet.net, ci.commoninternet.net].

DG1 evidence (cc-ci, final code): custom-html-tiny is a static-web-server with an empty content volume → genuinely serves 404 zero-config (not a serving demo), so picked hedgedoc (simple category, NO cc-ci/repo-local tests → pure generic; backup-capable bonus):

$ RECIPE=hedgedoc STAGES=install cc-ci-run runner/run_recipe_ci.py
===== TIER: install (generic: tests/_generic/test_install.py) =====
tests/_generic/test_install.py::test_serving PASSED
===== RUN SUMMARY =====   deploy-count = 1 (expect 1)   install : pass
$ docker stack ls | grep hedg   -> (none — clean teardown)

Lint+format clean (ruff check/ruff format --check via nix develop .#lint). Claiming the G0 gate.

2026-05-27 — G0/DG1 PASS; F1d-1 fixed; G1 backup+restore fixes

Adversary verdict: DG1 PASS @2026-05-27 (cold, own clone @ef44d46). G0 cleared.

Correcting an overstatement (Adversary finding F1d-1, valid): my earlier G0 wording claimed the CA-verified cert check distinguishes "the app vs a Traefik default-cert fallback." It does NOT — Traefik's file provider serves the pre-issued wildcard for the WHOLE *.ci.commoninternet.net zone, so ANY in-zone subdomain (even a non-deployed one) verifies; the self-signed default cert is never served in-zone. The genuine app-vs-fallback proof is services_converged (the app's OWN service replicas N/N) + a non-404 status in HEALTH_OK (Traefik's unmatched-router fallback = 404). Fix applied (no code behavior change to the load-bearing checks; honesty/scope only):

  • generic.served_cert + assert_serving docstrings/comments reframed: the cert check is an INFRA TLS sanity check (catches a lapsed/mis-rotated wildcard cert — plan §4.0 renewal), explicitly NOT an app-vs-fallback check. Kept because it CAN fail (cert expiry/untrust), unlike the old openssl-missing no-op it replaced.
  • Assertion message reworded ("served wildcard cert is not trusted/valid", not "...not the default"). Noted for the Adversary to re-test + close F1d-1 (theirs to tick).

G1 — DG2 (upgrade) + DG3 (backup/restore) on hedgedoc (backup-capable, ≥2 tags 3.0.9→3.0.10): Two real bugs found+fixed via live runs:

  1. backup artifact check. abra app backup snapshots needs a TTY (FATA the input device is not a TTY), but abra app backup create already emits the restic JSON summary with the produced "snapshot_id" (rc 0, "backup finished"). Verified raw on a live custom-html: snapshot_id": "d85bf492…". Fix: backup_create returns its output; generic.parse_snapshot_id regex-extracts the id; do_backup asserts it. (Dropped the TTY-bound snapshots listing.)
  2. restore serving race. assert_serving made TWO requests (http_get then http_body); post-restore the app flapped between them → http_body raised an unhandled HTTPError 404. Fix: new lifecycle.http_fetch returns (status, body) in ONE request, never raising; assert_serving now BOUNDED-POLLS converged + serving (status+body from one request) so a post-op reconverge settles while a persistent failure still fails within HTTP_TIMEOUT (no bare sleep). do_upgrade/do_restore call it (dropped the redundant wait_serving). Re-running full hedgedoc install→upgrade→backup→restore to confirm all-green before claiming G1.

2026-05-27 — G1 GREEN (DG2 + DG3), claiming gate

Full generic lifecycle on hedgedoc (no overlay → all tiers generic), final code, on cc-ci:

$ RECIPE=hedgedoc STAGES=install,upgrade,backup,restore CCCI_JANITOR_MAX_AGE=0 cc-ci-run runner/run_recipe_ci.py
TIER: install  (generic) test_serving            PASSED   # deploy base=prev 3.0.9, serves
TIER: upgrade  (generic) test_upgrade_reconverges PASSED  # abra app upgrade -> 3.0.10 in place, reconverged+serving
TIER: backup   (generic) test_backup_artifact     PASSED  # snapshot_id produced
TIER: restore  (generic) test_restore_healthy     PASSED  # restored + healthy
RUN SUMMARY: deploy-count = 1 (expect 1)   install/upgrade/backup/restore : pass
$ docker stack ls | grep -iE 'hedg|cust'  -> (none — clean teardown)
  • DG2 (generic upgrade, prev→target in place on the shared deployment, reconverge+serving) .
  • DG3 backup-capable path (artifact = snapshot_id from create; restore completes + healthy).
  • DG3 N/A logic evidenced: generic.backup_capable → hedgedoc=True, custom-html=True, custom-html-tiny=False. The non-capable run-demo (backup/restore reported skip, install passing) lands naturally in G3: custom-html-tiny is non-backup-capable AND only serves once the install-steps content hook is added — so the same recipe proves DG5 (fail-without/pass-with) and DG3-N/A (skip on a serving non-backup recipe) together.
  • DG4.1 corroborated again: deploy-count=1 across the whole install→upgrade→backup→restore run. Claiming G1.

2026-05-28 — F1d-2 fix: pinned base now deploys the pinned version (DG2 was vacuous)

Adversary G1 verdict: FAIL — DG2 upgrade was a vacuous no-op. F1d-1 CLOSED (cert reframe accepted). Root cause (Adversary + my confirmation): deploy_app always deployed with -C (chaos = current checkout), which IGNORES the version pin → a "previous-version" base actually deployed LATEST, so "upgrade to newest" was latest→latest and only the still-serving assertion ran ⇒ a broken upgrade would pass. Real defect.

Fix (two parts):

  1. deploy_app now checks the recipe out to the pinned tag (abra.recipe_checkout) AND deploys non-chaos when a version is pinned (abra.deploy(chaos=(version is None))). Chaos stays only for the version=None case (deploy the current PR-head checkout).
  2. Hardened the generic upgrade so a no-op CANNOT pass by construction: do_upgrade captures the app service's (coop-cloud version label, image) before+after and asserts the deployment actually MOVED (lifecycle.deployed_identity). Even if the pin regressed again, before==after → FAIL.

Probe (the Adversary's exact F1d-2 test, my code, on cc-ci) — now PASSES:

prev: 3.0.9+1.10.7
IMAGE BEFORE (asked prev): quay.io/hedgedoc/hedgedoc:1.10.7@sha256:3174abea…   ← was 1.10.8 (LATEST) pre-fix
IMAGE AFTER  (upgraded) : quay.io/hedgedoc/hedgedoc:1.10.8@sha256:423f4117…
CHANGED: True

Re-running the full hedgedoc + custom-html lifecycles to confirm all-green with the move-assertion, then re-claim G1 (and G2: custom-html overlays override+extend the generic, deploy-count=1).

2026-05-28 — G1 re-confirmed + G2 GREEN; re-claiming both gates

After the F1d-2 fix + the container-retry + the exec-read overlay fix, both full lifecycles are green on cc-ci (final code), deploy-count=1, clean teardown:

G1 (generic, hedgedoc): install/upgrade/backup/restore all pass; upgrade genuinely 1.10.7→1.10.8 with the move-assertion (deployed_identity version-label/image change) — DG2 non-vacuous now.

G2 (overlays, custom-html):

TIER install  (cc-ci: tests/custom-html/test_install.py)  test_serving_and_content   PASSED
TIER upgrade  (cc-ci: tests/custom-html/test_upgrade.py)   test_upgrade_preserves_data PASSED
TIER backup   (cc-ci: tests/custom-html/test_backup.py)    test_backup_captures_state PASSED
TIER restore  (cc-ci: tests/custom-html/test_restore.py)   test_restore_returns_state PASSED
deploy-count = 1   install/upgrade/backup/restore : pass     (residual: none — clean teardown)

This proves DG4 + DG4.1 end-to-end:

  • Override: every tier resolved to (cc-ci: tests/custom-html/...) — the overlay ran INSTEAD of the generic (discovery precedence; unit tests tests/unit/test_discovery.py 5/5).
  • Extend-by-composition: test_install reuses generic.assert_serving then adds a Playwright nginx check; upgrade/backup/restore reuse generic.do_upgrade/do_backup/do_restore.
  • Data-continuity (recipe-specific, the overlay's job): upgrade preserves a marker; backup seeds "original"→snapshot→mutate "mutated"; restore returns "original" (read volume-direct via exec).
  • DG4.1 no redeploy: deploy-count = 1 across all four overlay tiers + their in-place ops.

Two more real bugs fixed en route (both via live runs): _app_container now bounded-polls for the container to reappear (backup-bot cycles it); the custom-html backup/restore overlay reads the marker via exec_in_app (volume-direct), not http (which raced the serving layer post-backup, served ''). Re-claiming G1 (DG2+DG3) and claiming G2 (DG4+DG4.1).

2026-05-28 — G3 GREEN (DG5 hook + graceful-generic) + DG3 N/A-skip run-demo

Custom install-steps hook = tests/<recipe>/install_steps.sh (or repo-local tests/install_steps.sh), run by deploy_app AFTER abra app new+env, BEFORE abra app deploy, env CCCI_APP_DOMAIN/CCCI_RECIPE/ CCCI_APP_ENV. Proof on custom-html-tiny (static-web-server serving an empty content volume → 404 zero-config; non-backup-capable), final code on cc-ci:

RUN A: hook ABSENT  -> deploy/readiness failed: ... not healthy over HTTPS / (last status 404)
                       deploy-count=1   install : fail          # graceful-generic: needs a step, fails, reported
RUN B: hook PRESENT -> install-steps hook (cc-ci): .../tests/custom-html-tiny/install_steps.sh
                       install : pass   upgrade : pass           # hook seeded index.html -> serves 200
                       backup  : skip   restore : skip           # non-backup-capable -> N/A (DG3 N/A run-demo)
                       deploy-count = 1

So DG5 is proven BOTH ways on the SAME recipe (fail-without / pass-with), and the SAME run demonstrates DG3's N/A-skip half (backup/restore cleanly skipped, not failed, on a serving non-backup recipe). The hook writes index.html straight to the swarm volume's mountpoint (no container/image pull → no Docker Hub rate-limit risk); deploy-count stays 1 (the pre-created volume is not a deploy). recipe_meta for custom-html-tiny shortens timeouts (fast static app). lint PASS (shellcheck+shfmt+ruff+yamllint). Claiming G3.

2026-05-28 — G4: DG7 migration + DG8 docs (committed); DG6 !testme e2e in flight

G3 Adversary PASS @2026-05-28 (9b5bcff). DG1DG5 all verified; F1d-1/F1d-2 closed. Working G4.

DG7 (no-regression / DRY) — afd75a4. Migrated the remaining recipe overlays (keycloak/cryptpad/matrix-synapse/n8n/lasuite-docs) to the assertion-only deploy-once contract so the generic lifecycle OP is owned solely by the shared harness (no per-recipe deploy/teardown copy-paste).

DG8 (docs) — b756e72. docs/testing.md (127 lines): the generic suite, the overlay convention (fixed file names test_install/upgrade/backup/restore.py + locations tests// in cc-ci and repo-local tests/ + precedence repo-local>cc-ci>generic + extend-by-composition), the install-steps hook, backup-capability detection, and how to add an overlay. Updated enroll-recipe.md to the deploy-once contract; README pointer.

DG6 (!testme e2e on an unconfigured recipe) — IN FLIGHT. hedgedoc has NO cc-ci/repo-local overlays ⇒ it is the unconfigured target; enrolled in bridge POLL_REPOS (8262912).

Deploy of the enroll change to cc-ci (the only nix change in 1d): synced working tree via tar | ssh/root/cc-ci; nixos-rebuild build EXIT 0; detached nixos-rebuild switch (unit ccci-1d-switch) Result=success. Gotcha: the activation's restart of deploy-bridge.service was canceled by the concurrent tailscale-network restart (why we run switch detached), so the new generation was active but the reconcile oneshot still held the OLD ExecStart; a systemctl daemon-reload && systemctl restart deploy-bridge reconciled the swarm service. A clean re-switch on a stable network would do this itself (it is declarative). Live bridge POLL_REPOS now includes recipe-maintainers/hedgedoc; poller log: watching [... 'recipe-maintainers/hedgedoc'] every 30s.

Posted !testme (comment 13750, autonomic-bot — org member ⇒ authorized) on hedgedoc PR #1 at 01:10:16Z. Bridge poller log: [poll] triggered build 153 for hedgedoc@441c411c (PR #1, comment 13750) by autonomic-bot — trigger latency <60s (DG1 path re-exercised). Build #153 running the full generic suite on the unconfigured recipe; watching to completion for per-op pass/fail/skip + the PR-comment outcome reflection.

DG6 GREEN — build #153 success (full e2e on the unconfigured recipe). Evidence:

  • Pipeline params (Drone API): RECIPE=hedgedoc REF=441c411c88… PR=1 SRC=recipe-maintainers/hedgedoc — REF is the PR head, so the run tested the code at the PR's head commit (D1/DG6 path).
  • All four tiers resolved to the GENERIC suite (hedgedoc has no cc-ci/repo-local overlays): TIER install (generic: tests/_generic/test_install.py) … upgrade/backup/restore likewise — proving the "no overlay ⇒ generic runs" invariant through the REAL pipeline, not just locally.
  • Per-op report (RUN SUMMARY, in the Drone step log):
    deploy-count = 1 (expect 1)
      install : pass     upgrade : pass     backup : pass     restore : pass     custom : skip
    
    install 0.59s / upgrade 1.76s (assertion only; the abra-upgrade OP + image pull run in the orchestrator before it) / backup 8.12s / restore 50.59s — real work, not vacuous.
  • Deploy-once: deploy-count = 1 across install→upgrade→backup→restore (DG4.1 re-confirmed e2e).
  • Teardown (DG7 'every run undeploys'): post-run on cc-ci — docker service ls | grep hedgedoc → none; docker volume ls | grep hedgedoc → none; docker secret ls | grep hedgedoc → none; no ~/.abra hedgedoc app dir. Clean, nothing leaked.
  • Outcome reflected to the PR (bridge): comment on hedgedoc PR #1 — cc-ci: run for hedgedoc @ 441c411c ✅ passed → https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/153.

So DG6 holds: !testme on an unconfigured recipe → bridge → Drone → deploy → generic assert → undeploy → per-op report + PR outcome. DG7 (no-regression migration + DRY + teardown-always) and DG8 (docs) committed. Claiming G4 (DG6+DG7+DG8) — requesting Adversary cold-verify of DG1DG8 → DONE.