Files
cc-ci/machine-docs/JOURNAL-1d.md
autonomic-bot b10daddbef
All checks were successful
continuous-integration/drone Build is passing
status(1d): DG6 GREEN (build #153 hedgedoc e2e); G4 CLAIMED — requesting Adversary cold-verify DG1-DG8
build #153: !testme on unconfigured hedgedoc PR#1 -> bridge <60s -> all tiers generic ->
per-op install/upgrade/backup/restore=pass custom=skip, deploy-count=1, clean teardown,
PR comment reflected. DG7 (afd75a4) + DG8 (b756e72) done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:15:25 +01:00

257 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL — Phase 1d (append-only)
## 2026-05-27 — Bootstrap Phase 1d
Read SSOT `plan-phase1d-generic-test-suite.md` + plan.md §6.1/§7/§9. Studied the post-1b codebase:
`runner/run_recipe_ci.py` (per-stage pytest, currently deploy-per-stage), `tests/conftest.py`
(fixtures `deployed_app`/`deployed`/`old_app` each deploy+teardown), `runner/harness/{lifecycle,abra,naming}.py`,
and existing recipe tests (custom-html/keycloak/etc.).
Access re-verified (bootstrap, new phase):
```
$ ssh cc-ci 'hostname && whoami && nixos-version'
nixos / root / 24.11.20250630.50ab793 (Vicuna)
$ ssh cc-ci 'abra --version' -> abra version 0.13.0-beta-06a57de
$ ssh cc-ci 'docker stack ls' -> traefik, drone, ccci-bridge, ccci-dashboard, backups all up
$ ssh cc-ci 'grep -ri backupbot ~/.abra/recipes/custom-html/'
compose.yml: backupbot.backup=true ; backupbot.backup.path=/usr/share/nginx/html
$ curl -u bot ... /repos/recipe-maintainers/custom-html-tiny -> 200 (mirrored)
```
So: backup-capability is detectable by scanning compose for `backupbot.backup`; custom-html-tiny is
mirrored and has NO cc-ci tests dir → it's the DG1 pure-generic target.
**Design recorded in DECISIONS.md (Phase 1d section).** Key calls: tier model with the lifecycle OP
owned by the shared harness (test files = assertions only); OVERRIDE precedence repo-local > cc-ci >
generic with extend-by-composition; deploy-ONCE with a deploy-count guard; base version = previous
(when upgrade runs) else target; backup-capability auto-detect; install-steps shell hook.
Seeded STATUS-1d / BACKLOG-1d / JOURNAL-1d. Next: implement G0 (generic.py + discovery.py +
tests/_generic/ + deploy-once orchestrator), then verify generic install green on custom-html-tiny.
## 2026-05-27 — G0 generic install + deploy-once orchestrator: DG1 GREEN
Built the G0 machinery and proved DG1 end-to-end on the real server:
- `runner/harness/generic.py``assert_serving` (services converged + real HTTP in HEALTH_OK [excludes
404] + not Traefik's 404 body + **CA-verified TLS cert is the trusted wildcard**), op helpers
(`do_upgrade`/`do_backup`/`do_restore`), `backup_capable` (scan compose for backupbot.backup).
- `runner/harness/discovery.py` — per-op overlay resolution (repo-local > cc-ci > generic), custom
test discovery (both locations, additive), install-steps hook discovery.
- `tests/_generic/test_{install,upgrade,backup,restore}.py` — assertion-only tiers using `live_app`.
- `runner/run_recipe_ci.py` — deploy-ONCE orchestrator: base version (prev if upgrade+exists else
target), tiers run against the shared deployment, one teardown in finally, deploy-count guard +
per-op summary.
- `tests/conftest.py``live_app` fixture (reads CCCI_APP_DOMAIN; tiers never deploy).
- `lifecycle.deploy_app` — deploy-count recorder + install-steps hook + **pin DOMAIN to the run
domain** (fixes recipes whose .env.sample uses `{{ .Domain }}`, which this abra leaves unexpanded).
**Two real generic bugs found+fixed via live runs (not "should work"):**
1. custom-html-tiny deploy failed: `DOMAIN={{ .Domain }}` not auto-filled by `abra app new -D` on
0.13.0-beta → `can't evaluate field Domain`. Fix: `env_set(domain,"DOMAIN",domain)` in deploy_app.
2. `served_cert_subject` used `openssl s_client`, but **openssl is not on the host** (`cc-ci-run`
runtimeInputs has no openssl) → it silently returned None → the "not default cert" check was a
no-op (a DG7 can't-fail smell). Replaced with a pure-Python **CA-verified handshake** (`ssl`):
a publicly-trusted LE wildcard verifies + matches hostname; Traefik's self-signed default fails
verification → a genuine assertion. Verified the verify path on the host:
`ssl.create_default_context()` against ci.commoninternet.net → VERIFIED, CN=*.ci.commoninternet.net,
SAN=[*.ci.commoninternet.net, ci.commoninternet.net].
**DG1 evidence (cc-ci, final code):** custom-html-tiny is a static-web-server with an empty content
volume → genuinely serves 404 zero-config (not a serving demo), so picked **hedgedoc** (simple
category, NO cc-ci/repo-local tests → pure generic; backup-capable bonus):
```
$ RECIPE=hedgedoc STAGES=install cc-ci-run runner/run_recipe_ci.py
===== TIER: install (generic: tests/_generic/test_install.py) =====
tests/_generic/test_install.py::test_serving PASSED
===== RUN SUMMARY ===== deploy-count = 1 (expect 1) install : pass
$ docker stack ls | grep hedg -> (none — clean teardown)
```
Lint+format clean (`ruff check`/`ruff format --check` via `nix develop .#lint`). Claiming the G0 gate.
## 2026-05-27 — G0/DG1 PASS; F1d-1 fixed; G1 backup+restore fixes
**Adversary verdict: DG1 PASS @2026-05-27** (cold, own clone @ef44d46). G0 cleared.
**Correcting an overstatement (Adversary finding F1d-1, valid):** my earlier G0 wording claimed the
CA-verified cert check distinguishes "the app vs a Traefik default-cert fallback." It does NOT —
Traefik's file provider serves the pre-issued **wildcard** for the WHOLE `*.ci.commoninternet.net`
zone, so ANY in-zone subdomain (even a non-deployed one) verifies; the self-signed default cert is
never served in-zone. The genuine app-vs-fallback proof is `services_converged` (the app's OWN
service replicas N/N) + a non-404 status in HEALTH_OK (Traefik's unmatched-router fallback = 404).
Fix applied (no code behavior change to the load-bearing checks; honesty/scope only):
- `generic.served_cert` + `assert_serving` docstrings/comments reframed: the cert check is an INFRA
TLS sanity check (catches a lapsed/mis-rotated wildcard cert — plan §4.0 renewal), explicitly NOT
an app-vs-fallback check. Kept because it CAN fail (cert expiry/untrust), unlike the old
openssl-missing no-op it replaced.
- Assertion message reworded ("served wildcard cert is not trusted/valid", not "...not the default").
Noted for the Adversary to re-test + close F1d-1 (theirs to tick).
**G1 — DG2 (upgrade) + DG3 (backup/restore) on hedgedoc (backup-capable, ≥2 tags 3.0.9→3.0.10):**
Two real bugs found+fixed via live runs:
1. *backup artifact check.* `abra app backup snapshots` needs a TTY (`FATA the input device is not a
TTY`), but `abra app backup create` already emits the restic JSON summary with the produced
`"snapshot_id"` (rc 0, "backup finished"). Verified raw on a live custom-html:
`snapshot_id": "d85bf492…"`. Fix: `backup_create` returns its output; `generic.parse_snapshot_id`
regex-extracts the id; `do_backup` asserts it. (Dropped the TTY-bound `snapshots` listing.)
2. *restore serving race.* `assert_serving` made TWO requests (http_get then http_body); post-restore
the app flapped between them → `http_body` raised an unhandled `HTTPError 404`. Fix: new
`lifecycle.http_fetch` returns (status, body) in ONE request, never raising; `assert_serving` now
BOUNDED-POLLS converged + serving (status+body from one request) so a post-op reconverge settles
while a persistent failure still fails within HTTP_TIMEOUT (no bare sleep). `do_upgrade`/`do_restore`
call it (dropped the redundant `wait_serving`).
Re-running full hedgedoc install→upgrade→backup→restore to confirm all-green before claiming G1.
## 2026-05-27 — G1 GREEN (DG2 + DG3), claiming gate
Full generic lifecycle on **hedgedoc** (no overlay → all tiers generic), final code, on cc-ci:
```
$ RECIPE=hedgedoc STAGES=install,upgrade,backup,restore CCCI_JANITOR_MAX_AGE=0 cc-ci-run runner/run_recipe_ci.py
TIER: install (generic) test_serving PASSED # deploy base=prev 3.0.9, serves
TIER: upgrade (generic) test_upgrade_reconverges PASSED # abra app upgrade -> 3.0.10 in place, reconverged+serving
TIER: backup (generic) test_backup_artifact PASSED # snapshot_id produced
TIER: restore (generic) test_restore_healthy PASSED # restored + healthy
RUN SUMMARY: deploy-count = 1 (expect 1) install/upgrade/backup/restore : pass
$ docker stack ls | grep -iE 'hedg|cust' -> (none — clean teardown)
```
- **DG2** (generic upgrade, prev→target in place on the shared deployment, reconverge+serving) ✅.
- **DG3** backup-capable path ✅ (artifact = snapshot_id from create; restore completes + healthy).
- **DG3 N/A logic** evidenced: `generic.backup_capable` → hedgedoc=True, custom-html=True,
custom-html-tiny=False. The non-capable **run-demo** (backup/restore reported `skip`, install
passing) lands naturally in **G3**: custom-html-tiny is non-backup-capable AND only serves once the
install-steps content hook is added — so the same recipe proves DG5 (fail-without/pass-with) and
DG3-N/A (skip on a serving non-backup recipe) together.
- **DG4.1** corroborated again: deploy-count=1 across the whole install→upgrade→backup→restore run.
Claiming G1.
## 2026-05-28 — F1d-2 fix: pinned base now deploys the pinned version (DG2 was vacuous)
**Adversary G1 verdict: FAIL** — DG2 upgrade was a vacuous no-op. F1d-1 CLOSED (cert reframe accepted).
Root cause (Adversary + my confirmation): `deploy_app` always deployed with `-C` (chaos = current
checkout), which IGNORES the version pin → a "previous-version" base actually deployed LATEST, so
"upgrade to newest" was latest→latest and only the still-serving assertion ran ⇒ a broken upgrade
would pass. Real defect.
**Fix (two parts):**
1. `deploy_app` now checks the recipe out to the pinned tag (`abra.recipe_checkout`) AND deploys
**non-chaos** when a version is pinned (`abra.deploy(chaos=(version is None))`). Chaos stays only
for the version=None case (deploy the current PR-head checkout).
2. Hardened the generic upgrade so a no-op CANNOT pass by construction: `do_upgrade` captures the app
service's (coop-cloud version label, image) before+after and asserts the deployment actually
MOVED (`lifecycle.deployed_identity`). Even if the pin regressed again, before==after → FAIL.
**Probe (the Adversary's exact F1d-2 test, my code, on cc-ci) — now PASSES:**
```
prev: 3.0.9+1.10.7
IMAGE BEFORE (asked prev): quay.io/hedgedoc/hedgedoc:1.10.7@sha256:3174abea… ← was 1.10.8 (LATEST) pre-fix
IMAGE AFTER (upgraded) : quay.io/hedgedoc/hedgedoc:1.10.8@sha256:423f4117…
CHANGED: True
```
Re-running the full hedgedoc + custom-html lifecycles to confirm all-green with the move-assertion,
then re-claim G1 (and G2: custom-html overlays override+extend the generic, deploy-count=1).
## 2026-05-28 — G1 re-confirmed + G2 GREEN; re-claiming both gates
After the F1d-2 fix + the container-retry + the exec-read overlay fix, both full lifecycles are green
on cc-ci (final code), deploy-count=1, clean teardown:
**G1 (generic, hedgedoc):** install/upgrade/backup/restore all pass; upgrade genuinely 1.10.7→1.10.8
with the move-assertion (`deployed_identity` version-label/image change) — DG2 non-vacuous now.
**G2 (overlays, custom-html):**
```
TIER install (cc-ci: tests/custom-html/test_install.py) test_serving_and_content PASSED
TIER upgrade (cc-ci: tests/custom-html/test_upgrade.py) test_upgrade_preserves_data PASSED
TIER backup (cc-ci: tests/custom-html/test_backup.py) test_backup_captures_state PASSED
TIER restore (cc-ci: tests/custom-html/test_restore.py) test_restore_returns_state PASSED
deploy-count = 1 install/upgrade/backup/restore : pass (residual: none — clean teardown)
```
This proves DG4 + DG4.1 end-to-end:
- **Override:** every tier resolved to `(cc-ci: tests/custom-html/...)` — the overlay ran INSTEAD of
the generic (discovery precedence; unit tests tests/unit/test_discovery.py 5/5).
- **Extend-by-composition:** test_install reuses `generic.assert_serving` then adds a Playwright nginx
check; upgrade/backup/restore reuse `generic.do_upgrade/do_backup/do_restore`.
- **Data-continuity (recipe-specific, the overlay's job):** upgrade preserves a marker; backup seeds
"original"→snapshot→mutate "mutated"; restore returns "original" (read volume-direct via exec).
- **DG4.1 no redeploy:** deploy-count = 1 across all four overlay tiers + their in-place ops.
Two more real bugs fixed en route (both via live runs): `_app_container` now bounded-polls for the
container to reappear (backup-bot cycles it); the custom-html backup/restore overlay reads the marker
via `exec_in_app` (volume-direct), not http (which raced the serving layer post-backup, served '').
Re-claiming G1 (DG2+DG3) and claiming G2 (DG4+DG4.1).
## 2026-05-28 — G3 GREEN (DG5 hook + graceful-generic) + DG3 N/A-skip run-demo
Custom install-steps hook = `tests/<recipe>/install_steps.sh` (or repo-local `tests/install_steps.sh`),
run by deploy_app AFTER `abra app new`+env, BEFORE `abra app deploy`, env CCCI_APP_DOMAIN/CCCI_RECIPE/
CCCI_APP_ENV. Proof on **custom-html-tiny** (static-web-server serving an empty `content` volume → 404
zero-config; non-backup-capable), final code on cc-ci:
```
RUN A: hook ABSENT -> deploy/readiness failed: ... not healthy over HTTPS / (last status 404)
deploy-count=1 install : fail # graceful-generic: needs a step, fails, reported
RUN B: hook PRESENT -> install-steps hook (cc-ci): .../tests/custom-html-tiny/install_steps.sh
install : pass upgrade : pass # hook seeded index.html -> serves 200
backup : skip restore : skip # non-backup-capable -> N/A (DG3 N/A run-demo)
deploy-count = 1
```
So DG5 is proven BOTH ways on the SAME recipe (fail-without / pass-with), and the SAME run demonstrates
DG3's N/A-skip half (backup/restore cleanly skipped, not failed, on a serving non-backup recipe). The
hook writes index.html straight to the swarm volume's mountpoint (no container/image pull → no Docker
Hub rate-limit risk); deploy-count stays 1 (the pre-created volume is not a deploy). recipe_meta for
custom-html-tiny shortens timeouts (fast static app). lint PASS (shellcheck+shfmt+ruff+yamllint).
Claiming G3.
## 2026-05-28 — G4: DG7 migration + DG8 docs (committed); DG6 !testme e2e in flight
G3 Adversary PASS @2026-05-28 (9b5bcff). DG1DG5 all verified; F1d-1/F1d-2 closed. Working G4.
**DG7 (no-regression / DRY) — afd75a4.** Migrated the remaining recipe overlays
(keycloak/cryptpad/matrix-synapse/n8n/lasuite-docs) to the assertion-only deploy-once contract so the
generic lifecycle OP is owned solely by the shared harness (no per-recipe deploy/teardown copy-paste).
**DG8 (docs) — b756e72.** `docs/testing.md` (127 lines): the generic suite, the overlay convention
(fixed file names test_install/upgrade/backup/restore.py + locations tests/<recipe>/ in cc-ci and
repo-local tests/ + precedence repo-local>cc-ci>generic + extend-by-composition), the install-steps
hook, backup-capability detection, and how to add an overlay. Updated enroll-recipe.md to the
deploy-once contract; README pointer.
**DG6 (!testme e2e on an unconfigured recipe) — IN FLIGHT.** hedgedoc has NO cc-ci/repo-local
overlays ⇒ it is the unconfigured target; enrolled in bridge POLL_REPOS (8262912).
Deploy of the enroll change to cc-ci (the only nix change in 1d): synced working tree via `tar | ssh`
→ `/root/cc-ci`; `nixos-rebuild build` EXIT 0; detached `nixos-rebuild switch` (unit ccci-1d-switch)
Result=success. **Gotcha:** the activation's restart of `deploy-bridge.service` was canceled by the
concurrent tailscale-network restart (why we run switch detached), so the new generation was active
but the reconcile oneshot still held the OLD ExecStart; a `systemctl daemon-reload && systemctl
restart deploy-bridge` reconciled the swarm service. A clean re-switch on a stable network would do
this itself (it is declarative). Live bridge POLL_REPOS now includes recipe-maintainers/hedgedoc;
poller log: `watching [... 'recipe-maintainers/hedgedoc'] every 30s`.
Posted `!testme` (comment 13750, autonomic-bot — org member ⇒ authorized) on hedgedoc PR #1 at
01:10:16Z. Bridge poller log: `[poll] triggered build 153 for hedgedoc@441c411c (PR #1, comment
13750) by autonomic-bot` — trigger latency <60s (DG1 path re-exercised). Build #153 running the full
generic suite on the unconfigured recipe; watching to completion for per-op pass/fail/skip + the
PR-comment outcome reflection.
**DG6 GREEN — build #153 success (full e2e on the unconfigured recipe).** Evidence:
- **Pipeline params** (Drone API): `RECIPE=hedgedoc REF=441c411c88… PR=1 SRC=recipe-maintainers/hedgedoc`
— REF is the PR head, so the run tested the code at the PR's head commit (D1/DG6 path).
- **All four tiers resolved to the GENERIC suite** (hedgedoc has no cc-ci/repo-local overlays):
`TIER install (generic: tests/_generic/test_install.py)` … upgrade/backup/restore likewise — proving
the "no overlay ⇒ generic runs" invariant through the REAL pipeline, not just locally.
- **Per-op report** (RUN SUMMARY, in the Drone step log):
```
deploy-count = 1 (expect 1)
install : pass upgrade : pass backup : pass restore : pass custom : skip
```
install 0.59s / upgrade 1.76s (assertion only; the abra-upgrade OP + image pull run in the
orchestrator before it) / backup 8.12s / restore 50.59s — real work, not vacuous.
- **Deploy-once:** deploy-count = 1 across install→upgrade→backup→restore (DG4.1 re-confirmed e2e).
- **Teardown (DG7 'every run undeploys'):** post-run on cc-ci — `docker service ls | grep hedgedoc` →
none; `docker volume ls | grep hedgedoc` → none; `docker secret ls | grep hedgedoc` → none; no
`~/.abra` hedgedoc app dir. Clean, nothing leaked.
- **Outcome reflected to the PR** (bridge): comment on hedgedoc PR #1 —
`cc-ci: run for hedgedoc @ 441c411c ✅ passed → https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/153`.
So DG6 holds: `!testme` on an unconfigured recipe → bridge → Drone → deploy → generic assert →
undeploy → per-op report + PR outcome. DG7 (no-regression migration + DRY + teardown-always) and DG8
(docs) committed. **Claiming G4** (DG6+DG7+DG8) — requesting Adversary cold-verify of DG1DG8 → DONE.