All checks were successful
continuous-integration/drone Build is passing
build #153: !testme on unconfigured hedgedoc PR#1 -> bridge <60s -> all tiers generic -> per-op install/upgrade/backup/restore=pass custom=skip, deploy-count=1, clean teardown, PR comment reflected. DG7 (afd75a4) + DG8 (b756e72) done. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
257 lines
18 KiB
Markdown
257 lines
18 KiB
Markdown
# JOURNAL — Phase 1d (append-only)
|
||
|
||
## 2026-05-27 — Bootstrap Phase 1d
|
||
|
||
Read SSOT `plan-phase1d-generic-test-suite.md` + plan.md §6.1/§7/§9. Studied the post-1b codebase:
|
||
`runner/run_recipe_ci.py` (per-stage pytest, currently deploy-per-stage), `tests/conftest.py`
|
||
(fixtures `deployed_app`/`deployed`/`old_app` each deploy+teardown), `runner/harness/{lifecycle,abra,naming}.py`,
|
||
and existing recipe tests (custom-html/keycloak/etc.).
|
||
|
||
Access re-verified (bootstrap, new phase):
|
||
```
|
||
$ ssh cc-ci 'hostname && whoami && nixos-version'
|
||
nixos / root / 24.11.20250630.50ab793 (Vicuna)
|
||
$ ssh cc-ci 'abra --version' -> abra version 0.13.0-beta-06a57de
|
||
$ ssh cc-ci 'docker stack ls' -> traefik, drone, ccci-bridge, ccci-dashboard, backups all up
|
||
$ ssh cc-ci 'grep -ri backupbot ~/.abra/recipes/custom-html/'
|
||
compose.yml: backupbot.backup=true ; backupbot.backup.path=/usr/share/nginx/html
|
||
$ curl -u bot ... /repos/recipe-maintainers/custom-html-tiny -> 200 (mirrored)
|
||
```
|
||
So: backup-capability is detectable by scanning compose for `backupbot.backup`; custom-html-tiny is
|
||
mirrored and has NO cc-ci tests dir → it's the DG1 pure-generic target.
|
||
|
||
**Design recorded in DECISIONS.md (Phase 1d section).** Key calls: tier model with the lifecycle OP
|
||
owned by the shared harness (test files = assertions only); OVERRIDE precedence repo-local > cc-ci >
|
||
generic with extend-by-composition; deploy-ONCE with a deploy-count guard; base version = previous
|
||
(when upgrade runs) else target; backup-capability auto-detect; install-steps shell hook.
|
||
|
||
Seeded STATUS-1d / BACKLOG-1d / JOURNAL-1d. Next: implement G0 (generic.py + discovery.py +
|
||
tests/_generic/ + deploy-once orchestrator), then verify generic install green on custom-html-tiny.
|
||
|
||
## 2026-05-27 — G0 generic install + deploy-once orchestrator: DG1 GREEN
|
||
|
||
Built the G0 machinery and proved DG1 end-to-end on the real server:
|
||
- `runner/harness/generic.py` — `assert_serving` (services converged + real HTTP in HEALTH_OK [excludes
|
||
404] + not Traefik's 404 body + **CA-verified TLS cert is the trusted wildcard**), op helpers
|
||
(`do_upgrade`/`do_backup`/`do_restore`), `backup_capable` (scan compose for backupbot.backup).
|
||
- `runner/harness/discovery.py` — per-op overlay resolution (repo-local > cc-ci > generic), custom
|
||
test discovery (both locations, additive), install-steps hook discovery.
|
||
- `tests/_generic/test_{install,upgrade,backup,restore}.py` — assertion-only tiers using `live_app`.
|
||
- `runner/run_recipe_ci.py` — deploy-ONCE orchestrator: base version (prev if upgrade+exists else
|
||
target), tiers run against the shared deployment, one teardown in finally, deploy-count guard +
|
||
per-op summary.
|
||
- `tests/conftest.py` — `live_app` fixture (reads CCCI_APP_DOMAIN; tiers never deploy).
|
||
- `lifecycle.deploy_app` — deploy-count recorder + install-steps hook + **pin DOMAIN to the run
|
||
domain** (fixes recipes whose .env.sample uses `{{ .Domain }}`, which this abra leaves unexpanded).
|
||
|
||
**Two real generic bugs found+fixed via live runs (not "should work"):**
|
||
1. custom-html-tiny deploy failed: `DOMAIN={{ .Domain }}` not auto-filled by `abra app new -D` on
|
||
0.13.0-beta → `can't evaluate field Domain`. Fix: `env_set(domain,"DOMAIN",domain)` in deploy_app.
|
||
2. `served_cert_subject` used `openssl s_client`, but **openssl is not on the host** (`cc-ci-run`
|
||
runtimeInputs has no openssl) → it silently returned None → the "not default cert" check was a
|
||
no-op (a DG7 can't-fail smell). Replaced with a pure-Python **CA-verified handshake** (`ssl`):
|
||
a publicly-trusted LE wildcard verifies + matches hostname; Traefik's self-signed default fails
|
||
verification → a genuine assertion. Verified the verify path on the host:
|
||
`ssl.create_default_context()` against ci.commoninternet.net → VERIFIED, CN=*.ci.commoninternet.net,
|
||
SAN=[*.ci.commoninternet.net, ci.commoninternet.net].
|
||
|
||
**DG1 evidence (cc-ci, final code):** custom-html-tiny is a static-web-server with an empty content
|
||
volume → genuinely serves 404 zero-config (not a serving demo), so picked **hedgedoc** (simple
|
||
category, NO cc-ci/repo-local tests → pure generic; backup-capable bonus):
|
||
```
|
||
$ RECIPE=hedgedoc STAGES=install cc-ci-run runner/run_recipe_ci.py
|
||
===== TIER: install (generic: tests/_generic/test_install.py) =====
|
||
tests/_generic/test_install.py::test_serving PASSED
|
||
===== RUN SUMMARY ===== deploy-count = 1 (expect 1) install : pass
|
||
$ docker stack ls | grep hedg -> (none — clean teardown)
|
||
```
|
||
Lint+format clean (`ruff check`/`ruff format --check` via `nix develop .#lint`). Claiming the G0 gate.
|
||
|
||
## 2026-05-27 — G0/DG1 PASS; F1d-1 fixed; G1 backup+restore fixes
|
||
|
||
**Adversary verdict: DG1 PASS @2026-05-27** (cold, own clone @ef44d46). G0 cleared.
|
||
|
||
**Correcting an overstatement (Adversary finding F1d-1, valid):** my earlier G0 wording claimed the
|
||
CA-verified cert check distinguishes "the app vs a Traefik default-cert fallback." It does NOT —
|
||
Traefik's file provider serves the pre-issued **wildcard** for the WHOLE `*.ci.commoninternet.net`
|
||
zone, so ANY in-zone subdomain (even a non-deployed one) verifies; the self-signed default cert is
|
||
never served in-zone. The genuine app-vs-fallback proof is `services_converged` (the app's OWN
|
||
service replicas N/N) + a non-404 status in HEALTH_OK (Traefik's unmatched-router fallback = 404).
|
||
Fix applied (no code behavior change to the load-bearing checks; honesty/scope only):
|
||
- `generic.served_cert` + `assert_serving` docstrings/comments reframed: the cert check is an INFRA
|
||
TLS sanity check (catches a lapsed/mis-rotated wildcard cert — plan §4.0 renewal), explicitly NOT
|
||
an app-vs-fallback check. Kept because it CAN fail (cert expiry/untrust), unlike the old
|
||
openssl-missing no-op it replaced.
|
||
- Assertion message reworded ("served wildcard cert is not trusted/valid", not "...not the default").
|
||
Noted for the Adversary to re-test + close F1d-1 (theirs to tick).
|
||
|
||
**G1 — DG2 (upgrade) + DG3 (backup/restore) on hedgedoc (backup-capable, ≥2 tags 3.0.9→3.0.10):**
|
||
Two real bugs found+fixed via live runs:
|
||
1. *backup artifact check.* `abra app backup snapshots` needs a TTY (`FATA the input device is not a
|
||
TTY`), but `abra app backup create` already emits the restic JSON summary with the produced
|
||
`"snapshot_id"` (rc 0, "backup finished"). Verified raw on a live custom-html:
|
||
`snapshot_id": "d85bf492…"`. Fix: `backup_create` returns its output; `generic.parse_snapshot_id`
|
||
regex-extracts the id; `do_backup` asserts it. (Dropped the TTY-bound `snapshots` listing.)
|
||
2. *restore serving race.* `assert_serving` made TWO requests (http_get then http_body); post-restore
|
||
the app flapped between them → `http_body` raised an unhandled `HTTPError 404`. Fix: new
|
||
`lifecycle.http_fetch` returns (status, body) in ONE request, never raising; `assert_serving` now
|
||
BOUNDED-POLLS converged + serving (status+body from one request) so a post-op reconverge settles
|
||
while a persistent failure still fails within HTTP_TIMEOUT (no bare sleep). `do_upgrade`/`do_restore`
|
||
call it (dropped the redundant `wait_serving`).
|
||
Re-running full hedgedoc install→upgrade→backup→restore to confirm all-green before claiming G1.
|
||
|
||
## 2026-05-27 — G1 GREEN (DG2 + DG3), claiming gate
|
||
|
||
Full generic lifecycle on **hedgedoc** (no overlay → all tiers generic), final code, on cc-ci:
|
||
```
|
||
$ RECIPE=hedgedoc STAGES=install,upgrade,backup,restore CCCI_JANITOR_MAX_AGE=0 cc-ci-run runner/run_recipe_ci.py
|
||
TIER: install (generic) test_serving PASSED # deploy base=prev 3.0.9, serves
|
||
TIER: upgrade (generic) test_upgrade_reconverges PASSED # abra app upgrade -> 3.0.10 in place, reconverged+serving
|
||
TIER: backup (generic) test_backup_artifact PASSED # snapshot_id produced
|
||
TIER: restore (generic) test_restore_healthy PASSED # restored + healthy
|
||
RUN SUMMARY: deploy-count = 1 (expect 1) install/upgrade/backup/restore : pass
|
||
$ docker stack ls | grep -iE 'hedg|cust' -> (none — clean teardown)
|
||
```
|
||
- **DG2** (generic upgrade, prev→target in place on the shared deployment, reconverge+serving) ✅.
|
||
- **DG3** backup-capable path ✅ (artifact = snapshot_id from create; restore completes + healthy).
|
||
- **DG3 N/A logic** evidenced: `generic.backup_capable` → hedgedoc=True, custom-html=True,
|
||
custom-html-tiny=False. The non-capable **run-demo** (backup/restore reported `skip`, install
|
||
passing) lands naturally in **G3**: custom-html-tiny is non-backup-capable AND only serves once the
|
||
install-steps content hook is added — so the same recipe proves DG5 (fail-without/pass-with) and
|
||
DG3-N/A (skip on a serving non-backup recipe) together.
|
||
- **DG4.1** corroborated again: deploy-count=1 across the whole install→upgrade→backup→restore run.
|
||
Claiming G1.
|
||
|
||
## 2026-05-28 — F1d-2 fix: pinned base now deploys the pinned version (DG2 was vacuous)
|
||
|
||
**Adversary G1 verdict: FAIL** — DG2 upgrade was a vacuous no-op. F1d-1 CLOSED (cert reframe accepted).
|
||
Root cause (Adversary + my confirmation): `deploy_app` always deployed with `-C` (chaos = current
|
||
checkout), which IGNORES the version pin → a "previous-version" base actually deployed LATEST, so
|
||
"upgrade to newest" was latest→latest and only the still-serving assertion ran ⇒ a broken upgrade
|
||
would pass. Real defect.
|
||
|
||
**Fix (two parts):**
|
||
1. `deploy_app` now checks the recipe out to the pinned tag (`abra.recipe_checkout`) AND deploys
|
||
**non-chaos** when a version is pinned (`abra.deploy(chaos=(version is None))`). Chaos stays only
|
||
for the version=None case (deploy the current PR-head checkout).
|
||
2. Hardened the generic upgrade so a no-op CANNOT pass by construction: `do_upgrade` captures the app
|
||
service's (coop-cloud version label, image) before+after and asserts the deployment actually
|
||
MOVED (`lifecycle.deployed_identity`). Even if the pin regressed again, before==after → FAIL.
|
||
|
||
**Probe (the Adversary's exact F1d-2 test, my code, on cc-ci) — now PASSES:**
|
||
```
|
||
prev: 3.0.9+1.10.7
|
||
IMAGE BEFORE (asked prev): quay.io/hedgedoc/hedgedoc:1.10.7@sha256:3174abea… ← was 1.10.8 (LATEST) pre-fix
|
||
IMAGE AFTER (upgraded) : quay.io/hedgedoc/hedgedoc:1.10.8@sha256:423f4117…
|
||
CHANGED: True
|
||
```
|
||
Re-running the full hedgedoc + custom-html lifecycles to confirm all-green with the move-assertion,
|
||
then re-claim G1 (and G2: custom-html overlays override+extend the generic, deploy-count=1).
|
||
|
||
## 2026-05-28 — G1 re-confirmed + G2 GREEN; re-claiming both gates
|
||
|
||
After the F1d-2 fix + the container-retry + the exec-read overlay fix, both full lifecycles are green
|
||
on cc-ci (final code), deploy-count=1, clean teardown:
|
||
|
||
**G1 (generic, hedgedoc):** install/upgrade/backup/restore all pass; upgrade genuinely 1.10.7→1.10.8
|
||
with the move-assertion (`deployed_identity` version-label/image change) — DG2 non-vacuous now.
|
||
|
||
**G2 (overlays, custom-html):**
|
||
```
|
||
TIER install (cc-ci: tests/custom-html/test_install.py) test_serving_and_content PASSED
|
||
TIER upgrade (cc-ci: tests/custom-html/test_upgrade.py) test_upgrade_preserves_data PASSED
|
||
TIER backup (cc-ci: tests/custom-html/test_backup.py) test_backup_captures_state PASSED
|
||
TIER restore (cc-ci: tests/custom-html/test_restore.py) test_restore_returns_state PASSED
|
||
deploy-count = 1 install/upgrade/backup/restore : pass (residual: none — clean teardown)
|
||
```
|
||
This proves DG4 + DG4.1 end-to-end:
|
||
- **Override:** every tier resolved to `(cc-ci: tests/custom-html/...)` — the overlay ran INSTEAD of
|
||
the generic (discovery precedence; unit tests tests/unit/test_discovery.py 5/5).
|
||
- **Extend-by-composition:** test_install reuses `generic.assert_serving` then adds a Playwright nginx
|
||
check; upgrade/backup/restore reuse `generic.do_upgrade/do_backup/do_restore`.
|
||
- **Data-continuity (recipe-specific, the overlay's job):** upgrade preserves a marker; backup seeds
|
||
"original"→snapshot→mutate "mutated"; restore returns "original" (read volume-direct via exec).
|
||
- **DG4.1 no redeploy:** deploy-count = 1 across all four overlay tiers + their in-place ops.
|
||
|
||
Two more real bugs fixed en route (both via live runs): `_app_container` now bounded-polls for the
|
||
container to reappear (backup-bot cycles it); the custom-html backup/restore overlay reads the marker
|
||
via `exec_in_app` (volume-direct), not http (which raced the serving layer post-backup, served '').
|
||
Re-claiming G1 (DG2+DG3) and claiming G2 (DG4+DG4.1).
|
||
|
||
## 2026-05-28 — G3 GREEN (DG5 hook + graceful-generic) + DG3 N/A-skip run-demo
|
||
|
||
Custom install-steps hook = `tests/<recipe>/install_steps.sh` (or repo-local `tests/install_steps.sh`),
|
||
run by deploy_app AFTER `abra app new`+env, BEFORE `abra app deploy`, env CCCI_APP_DOMAIN/CCCI_RECIPE/
|
||
CCCI_APP_ENV. Proof on **custom-html-tiny** (static-web-server serving an empty `content` volume → 404
|
||
zero-config; non-backup-capable), final code on cc-ci:
|
||
```
|
||
RUN A: hook ABSENT -> deploy/readiness failed: ... not healthy over HTTPS / (last status 404)
|
||
deploy-count=1 install : fail # graceful-generic: needs a step, fails, reported
|
||
RUN B: hook PRESENT -> install-steps hook (cc-ci): .../tests/custom-html-tiny/install_steps.sh
|
||
install : pass upgrade : pass # hook seeded index.html -> serves 200
|
||
backup : skip restore : skip # non-backup-capable -> N/A (DG3 N/A run-demo)
|
||
deploy-count = 1
|
||
```
|
||
So DG5 is proven BOTH ways on the SAME recipe (fail-without / pass-with), and the SAME run demonstrates
|
||
DG3's N/A-skip half (backup/restore cleanly skipped, not failed, on a serving non-backup recipe). The
|
||
hook writes index.html straight to the swarm volume's mountpoint (no container/image pull → no Docker
|
||
Hub rate-limit risk); deploy-count stays 1 (the pre-created volume is not a deploy). recipe_meta for
|
||
custom-html-tiny shortens timeouts (fast static app). lint PASS (shellcheck+shfmt+ruff+yamllint).
|
||
Claiming G3.
|
||
|
||
## 2026-05-28 — G4: DG7 migration + DG8 docs (committed); DG6 !testme e2e in flight
|
||
|
||
G3 Adversary PASS @2026-05-28 (9b5bcff). DG1–DG5 all verified; F1d-1/F1d-2 closed. Working G4.
|
||
|
||
**DG7 (no-regression / DRY) — afd75a4.** Migrated the remaining recipe overlays
|
||
(keycloak/cryptpad/matrix-synapse/n8n/lasuite-docs) to the assertion-only deploy-once contract so the
|
||
generic lifecycle OP is owned solely by the shared harness (no per-recipe deploy/teardown copy-paste).
|
||
|
||
**DG8 (docs) — b756e72.** `docs/testing.md` (127 lines): the generic suite, the overlay convention
|
||
(fixed file names test_install/upgrade/backup/restore.py + locations tests/<recipe>/ in cc-ci and
|
||
repo-local tests/ + precedence repo-local>cc-ci>generic + extend-by-composition), the install-steps
|
||
hook, backup-capability detection, and how to add an overlay. Updated enroll-recipe.md to the
|
||
deploy-once contract; README pointer.
|
||
|
||
**DG6 (!testme e2e on an unconfigured recipe) — IN FLIGHT.** hedgedoc has NO cc-ci/repo-local
|
||
overlays ⇒ it is the unconfigured target; enrolled in bridge POLL_REPOS (8262912).
|
||
|
||
Deploy of the enroll change to cc-ci (the only nix change in 1d): synced working tree via `tar | ssh`
|
||
→ `/root/cc-ci`; `nixos-rebuild build` EXIT 0; detached `nixos-rebuild switch` (unit ccci-1d-switch)
|
||
Result=success. **Gotcha:** the activation's restart of `deploy-bridge.service` was canceled by the
|
||
concurrent tailscale-network restart (why we run switch detached), so the new generation was active
|
||
but the reconcile oneshot still held the OLD ExecStart; a `systemctl daemon-reload && systemctl
|
||
restart deploy-bridge` reconciled the swarm service. A clean re-switch on a stable network would do
|
||
this itself (it is declarative). Live bridge POLL_REPOS now includes recipe-maintainers/hedgedoc;
|
||
poller log: `watching [... 'recipe-maintainers/hedgedoc'] every 30s`.
|
||
|
||
Posted `!testme` (comment 13750, autonomic-bot — org member ⇒ authorized) on hedgedoc PR #1 at
|
||
01:10:16Z. Bridge poller log: `[poll] triggered build 153 for hedgedoc@441c411c (PR #1, comment
|
||
13750) by autonomic-bot` — trigger latency <60s (DG1 path re-exercised). Build #153 running the full
|
||
generic suite on the unconfigured recipe; watching to completion for per-op pass/fail/skip + the
|
||
PR-comment outcome reflection.
|
||
|
||
**DG6 GREEN — build #153 success (full e2e on the unconfigured recipe).** Evidence:
|
||
- **Pipeline params** (Drone API): `RECIPE=hedgedoc REF=441c411c88… PR=1 SRC=recipe-maintainers/hedgedoc`
|
||
— REF is the PR head, so the run tested the code at the PR's head commit (D1/DG6 path).
|
||
- **All four tiers resolved to the GENERIC suite** (hedgedoc has no cc-ci/repo-local overlays):
|
||
`TIER install (generic: tests/_generic/test_install.py)` … upgrade/backup/restore likewise — proving
|
||
the "no overlay ⇒ generic runs" invariant through the REAL pipeline, not just locally.
|
||
- **Per-op report** (RUN SUMMARY, in the Drone step log):
|
||
```
|
||
deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : skip
|
||
```
|
||
install 0.59s / upgrade 1.76s (assertion only; the abra-upgrade OP + image pull run in the
|
||
orchestrator before it) / backup 8.12s / restore 50.59s — real work, not vacuous.
|
||
- **Deploy-once:** deploy-count = 1 across install→upgrade→backup→restore (DG4.1 re-confirmed e2e).
|
||
- **Teardown (DG7 'every run undeploys'):** post-run on cc-ci — `docker service ls | grep hedgedoc` →
|
||
none; `docker volume ls | grep hedgedoc` → none; `docker secret ls | grep hedgedoc` → none; no
|
||
`~/.abra` hedgedoc app dir. Clean, nothing leaked.
|
||
- **Outcome reflected to the PR** (bridge): comment on hedgedoc PR #1 —
|
||
`cc-ci: run for hedgedoc @ 441c411c ✅ passed → https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/153`.
|
||
|
||
So DG6 holds: `!testme` on an unconfigured recipe → bridge → Drone → deploy → generic assert →
|
||
undeploy → per-op report + PR outcome. DG7 (no-regression migration + DRY + teardown-always) and DG8
|
||
(docs) committed. **Claiming G4** (DG6+DG7+DG8) — requesting Adversary cold-verify of DG1–DG8 → DONE.
|