Files
cc-ci/machine-docs/JOURNAL-cfold.md
autonomic-bot abe5e33dde
Some checks failed
continuous-integration/drone/push Build is failing
claim(cfold): claim M2 full sweep green
2026-06-13 04:04:14 +00:00

19 KiB

JOURNAL — phase cfold

2026-06-11 — Phase cfold start

Investigation findings

Pre-existing test layout:

  • 60 files in functional/ subdirs across 20 recipes
  • 4 files in playwright/ subdirs (cryptpad, custom-html, uptime-kuma)
  • Helper modules to move: _discourse.py, _ghost.py, _mailu.py, _mm.py, _mumble_proto.py, drone/functional/__init__.py
  • mailu/test_backup.py, test_restore.py, ops.py explicitly add functional/ to sys.path — need updating to custom/

Decision: deprecated aliases

Per plan §2 option (RECOMMENDED): keep recognizing functional//playwright/ as deprecated aliases AND emit a loud one-line warning when a test is found in a deprecated folder. Using warnings.warn() at import time of discovery or print() directly. Will use print() (stderr) so it shows up in CI logs without needing to configure warning filters.

Implementation: subdirs = ("custom", "functional", "playwright") — canonical first — and after finding a test in functional/ or playwright/, emit: print(f"WARNING [cfold]: test found in deprecated folder '{sub}/' — move to custom/: {path}", flush=True, file=sys.stderr)

This way:

  • custom/ is canonical and gets discovered first
  • Old folders still work (zero breakage for repo-local tests) but emit a loud warning
  • No silent coverage loss possible

2026-06-12 — M1 checkpoint: canonical custom/ layout landed locally

Code/work completed:

  • runner/harness/discovery.py: canonical custom/ discovery, deprecated alias warnings, and custom_subdir_label() normalization helper.
  • runner/harness/manifest.py: custom-test counts now normalize to canonical custom.
  • all cc-ci custom tests/helper modules moved from tests/<recipe>/{functional,playwright}/ into tests/<recipe>/custom/.
  • helper-import fallout fixed where needed (tests/mailu/{ops.py,test_backup.py,test_restore.py}).
  • docs updated to describe custom/ as the canonical layout and explain the alias-compatibility window.

Mechanical move summary:

  • 64 custom test files relocated into custom/
  • helper modules relocated too: _discourse.py, _ghost.py, _mailu.py, _mm.py, _mumble_proto.py, tests/drone/custom/__init__.py

Verification:

nix shell nixpkgs#python312Packages.pytest --command pytest \
  tests/unit/test_discovery.py tests/unit/test_discovery_phase2.py tests/unit/test_manifest.py -q
# ..................
# 18 passed in 0.09s

Post-move grep state:

  • remaining functional/ / playwright/ matches in live code are intentional: alias-policy docs, deprecated-folder assertions in the unit tests, and discovery comments describing the alias behavior.
  • the pre-migration inventory in BACKLOG-cfold.md is intentionally unchanged because it is the M1 baseline record the Adversary will compare against.

2026-06-12 — M1 coverage proof assembled

Verification commands + observed outputs:

$ git ls-files "tests/*/custom/test_*.py" | wc -l
64

$ git ls-files "tests/*/functional/*" "tests/*/playwright/*"
# no output

$ for recipe in bluesky-pds cryptpad custom-html custom-html-tiny discourse drone ghost hedgedoc immich keycloak lasuite-docs lasuite-drive lasuite-meet mailu matrix-synapse mattermost-lts mumble n8n plausible uptime-kuma; do count=$(git ls-files "tests/$recipe/custom/test_*.py" | wc -l); printf "%s %s\n" "$recipe" "$count"; done
bluesky-pds 4
cryptpad 4
custom-html 4
custom-html-tiny 1
discourse 3
drone 1
ghost 4
hedgedoc 2
immich 3
keycloak 3
lasuite-docs 5
lasuite-drive 3
lasuite-meet 3
mailu 3
matrix-synapse 3
mattermost-lts 3
mumble 5
n8n 4
plausible 2
uptime-kuma 4

$ nix shell nixpkgs#python311Packages.pytest -c pytest tests/unit/test_discovery.py tests/unit/test_discovery_phase2.py tests/unit/test_manifest.py -q
..................
18 passed in 0.14s

Conclusion: the migrated tree still contains the exact same 64 custom test files with the same per-recipe cardinality as the pre-cfold baseline in BACKLOG-cfold.md; only the folder paths changed.

2026-06-12 — Adversary M1 PASS received

Pulled review(cfold): M1 PASS cold verification (4b4d665). Confirmed in REVIEW-cfold.md:

  • total canonical custom tests = 64
  • old tracked functional/ / playwright/ trees = none
  • per-recipe counts match the baseline exactly
  • focused unit suite = 18 passed
  • deprecated-alias warning probe works
  • normalized (recipe, filename) before/after set = exact match (missing [], extra [])

No fix-forward required. Phase advances to M2 baseline assembly.

2026-06-12 — M2 sweep snapshot: 19 fresh greens, Ghost upgrade regression remains

Bootstrap/access re-checks before the live sweep:

$ ssh cc-ci "hostname && whoami && nixos-version"
nixos
root
24.11.20250630.50ab793 (Vicuna)

$ set -a; . /srv/cc-ci/.testenv; set +a; curl -fsS "https://$GITEA_URL/api/v1/version"
{"version":"1.24.2"}

$ getent hosts "probe-$RANDOM.ci.commoninternet.net"
91.98.47.73     probe-4360.ci.commoninternet.net

Open-PR inventory before triggering uncovered recipes showed 16 enrolled repos already had live PRs; custom-html, keycloak, cryptpad, and mumble did not. I reopened reusable closed PRs for the first three (custom-html#2, keycloak#3, cryptpad#5) and created a minimal sweep-only mumble#1 probe PR via the Gitea API.

Fresh post-cfold success set gathered from the live server (/var/lib/cc-ci-runs/<build>/results.json):

506  drone            L5
510  custom-html-tiny L5
521  discourse        L5
522  immich           L5
523  lasuite-docs     L5
524  lasuite-drive    L5
525  lasuite-meet     L5
526  mailu            L5
527  matrix-synapse   L5
528  n8n              L5
529  mattermost-lts   L5
530  plausible        L5
531  uptime-kuma      L5
541  custom-html      L5
553  keycloak         L5
554  cryptpad         L5
555  hedgedoc         L5
556  bluesky-pds      L5
558  mumble           L5

Ghost is the lone non-green outlier:

557  ghost PR#4 @ d88f5801  -> L1 (install pass, upgrade fail, backup/restore/custom pass)
559  ghost PR#5 @ d42d0f7c  -> L1 (same failure shape on last known-green Ghost head)
185  ghost PR#4 @ d42d0f7c  -> L4 / pre-lint-era green baseline on 2026-06-05

The critical Ghost comparison is the same ref d42d0f7c:

  • historical build 185 (2026-06-05): upgrade passed at d42d0f7c
  • fresh probe build 559 (2026-06-12): same d42d0f7c now fails upgrade with swarm UpdateStatus='paused'

That isolates the regression away from cfold itself. In both fresh Ghost failures (557, 559), the custom tier still discovered and passed all four tests/ghost/custom/test_*.py files, while the upgrade op failed before upgrade assertions could run:

!! upgrade op failed: <ghost-domain>: upgrade redeploy did NOT converge to the head spec — swarm UpdateStatus='paused'.
The recipe's app service uses update_config failure_action=rollback/pause; the NEW (head) task failed swarm's update monitor,
so the service reverted/paused and the RUNNING spec is the previous version, not the code under test.

Adversary update pulled during this pass:

  • review(cfold) commit 93f56ae added only an idle audit entry to REVIEW-cfold.md
  • no finding filed
  • no M2 PASS yet because no claim(cfold): M2 ... commit exists

2026-06-12 — Follow-up Ghost artifact audit (same-ref historical pass vs fresh fail)

Focused cold checks after the M2 sweep snapshot:

$ ssh cc-ci "jq '{level,recipe,ref,results,rungs,stages:(.stages|map({name,status}))}' /var/lib/cc-ci-runs/185/results.json"
{
  "level": 4,
  "recipe": "ghost",
  "ref": "d42d0f7c7cf9",
  "results": {
    "backup": "pass",
    "custom": "pass",
    "install": "pass",
    "restore": "pass",
    "upgrade": "pass"
  },
  "rungs": {
    "backup_restore": "pass",
    "functional": "pass",
    "install": "pass",
    "integration": "na",
    "recipe_local": "na",
    "upgrade": "pass"
  },
  "stages": [
    {"name": "install", "status": "pass"},
    {"name": "upgrade", "status": "pass"},
    {"name": "backup", "status": "pass"},
    {"name": "restore", "status": "pass"},
    {"name": "custom", "status": "pass"}
  ]
}

$ ssh cc-ci "jq '{level,recipe,stages:(.stages|map({name,status,summary}))}' /var/lib/cc-ci-runs/559/results.json"
{
  "level": 1,
  "recipe": "ghost",
  "stages": [
    {"name": "install", "status": "pass", "summary": null},
    {"name": "backup", "status": "pass", "summary": null},
    {"name": "restore", "status": "pass", "summary": null},
    {"name": "custom", "status": "pass", "summary": null},
    {"name": "lint", "status": "pass", "summary": null}
  ]
}

$ ssh cc-ci "grep -R -n \"start_period\" /var/lib/cc-ci-runs/559/abra/recipes/ghost"
/var/lib/cc-ci-runs/559/abra/recipes/ghost/compose.yml:60:      start_period: 15m
/var/lib/cc-ci-runs/559/abra/recipes/ghost/compose.yml:84:      start_period: 1m
/var/lib/cc-ci-runs/559/abra/recipes/ghost/compose.ccci.yml:35:      start_period: 15m
/var/lib/cc-ci-runs/559/abra/recipes/ghost/compose.ccci.yml:38:      start_period: 15m

Conclusion:

  • Historical build 185 passed the full Ghost lifecycle on the SAME ref now used in probe build 559 (d42d0f7c7cf9), so the current M2 blocker is not tied to the custom/ folder migration.
  • Fresh failing runs still execute the canonical 4-file tests/ghost/custom/ suite and pass every non-upgrade stage; the missing upgrade junit output remains the key symptom.
  • The current repo does not show an obvious cfold-local fix to apply: the Ghost-specific overlay is unchanged, the recipe artifact still carries the expected compose.ccci.yml file, and the failure remains in the live upgrade path rather than discovery/custom-test coverage.
  • Net: cfold remains blocked on a cfold-neutral Ghost upgrade regression / flake. No repo-local code change was justified by that audit alone.

2026-06-13 — Ghost PR #3 fresh probe after reopen: same upgrade-only failure, plus duplicate trigger signal

I looked for the smallest allowed M2 step that did not touch recipe code: reuse an existing Ghost PR head that had historically gone green and rerun it through the live !testme path.

Actions taken:

$ set -a && . /srv/cc-ci/.testenv && set +a
$ curl -fsS -u "$GITEA_USERNAME:$GITEA_PASSWORD" -X PATCH \
    -H 'Content-Type: application/json' \
    -d '{"state":"open"}' \
    "https://$GITEA_URL/api/v1/repos/recipe-maintainers/ghost/pulls/3"
# PR #3 reopened; head remains 720faa0bebc46a34857b2933df1924ccabbd4087

$ curl -fsS -u "$GITEA_USERNAME:$GITEA_PASSWORD" -X POST \
    -H 'Content-Type: application/json' \
    -d '{"body":"!testme"}' \
    "https://$GITEA_URL/api/v1/repos/recipe-maintainers/ghost/issues/3/comments"
# comment 14497 created at 2026-06-13T00:07:50Z

Fresh live outcomes:

$ ssh cc-ci 'jq "{run_id, pr, recipe, ref, level, results, stages: (.stages | map({name,status,summary}))}" /var/lib/cc-ci-runs/568/results.json'
{
  "run_id": "568",
  "pr": "3",
  "recipe": "ghost",
  "ref": "720faa0bebc4",
  "level": 1,
  "results": {
    "backup": "pass",
    "custom": "pass",
    "install": "pass",
    "restore": "pass",
    "upgrade": "fail"
  },
  "stages": [
    {"name": "install", "status": "pass", "summary": null},
    {"name": "backup", "status": "pass", "summary": null},
    {"name": "restore", "status": "pass", "summary": null},
    {"name": "custom", "status": "pass", "summary": null},
    {"name": "lint", "status": "pass", "summary": null}
  ]
}

$ ssh cc-ci 'jq "{run_id, pr, recipe, ref, level, finished, results, stages: (.stages | map({name,status}))}" /var/lib/cc-ci-runs/569/results.json'
{
  "run_id": "569",
  "pr": "3",
  "recipe": "ghost",
  "ref": "720faa0bebc4",
  "level": 1,
  "finished": 1781309502.5494862,
  "results": {
    "backup": "pass",
    "custom": "pass",
    "install": "pass",
    "restore": "pass",
    "upgrade": "fail"
  },
  "stages": [
    {"name": "install", "status": "pass"},
    {"name": "backup", "status": "pass"},
    {"name": "restore", "status": "pass"},
    {"name": "custom", "status": "pass"},
    {"name": "lint", "status": "pass"}
  ]
}

Comment-stream evidence for duplicate triggers from one !testme:

$ curl -fsS -u "$GITEA_USERNAME:$GITEA_PASSWORD" \
    "https://$GITEA_URL/api/v1/repos/recipe-maintainers/ghost/issues/3/comments?limit=20"
# ...
# 14497: !testme (2026-06-13T00:07:50Z)
# 14498: cc-ci failure comment for run 568 (2026-06-13T00:08:05Z)
# 14499: cc-ci in-progress comment for run 569 (2026-06-13T00:08:05Z)
# 14500: cc-ci in-progress comment for run 570 (2026-06-13T00:08:05Z)

Takeaways:

  • Ghost is now freshly red post-cfold on three distinct PR heads (720faa0b, d88f5801, d42d0f7c), all with the same upgrade-only failure shape while custom discovery stays green.
  • That further weakens any cfold-local explanation; the blocker remains in Ghost's live upgrade path.
  • There is also likely a separate trigger dedupe problem: one !testme comment spawned runs 568, 569, and 570. I did not broaden into a D1 investigation in this loop step because cfold M2 is already hard-blocked by Ghost's repeated upgrade failures, but the evidence is now recorded.

2026-06-13 — Root-caused Ghost triple-trigger replay; bridge fix authored with unit coverage

Pulled the Adversary's latest cfold audit (review(cfold) ddefc96). It was not an M2 verdict or a finding; it confirmed the sweep is still unclaimable while teardown remains clean (live_pr_apps=0).

I then closed out the duplicate-run side observation from the Ghost PR #3 retrigger.

Evidence:

$ ssh cc-ci 'docker logs --since "2026-06-13T00:07:30" --until "2026-06-13T00:08:30" c54c433972ac 2>&1'
[poll] triggered build 568 for ghost@720faa0b (PR #3, comment 14029) by autonomic-bot
[poll] triggered build 569 for ghost@720faa0b (PR #3, comment 14032) by autonomic-bot
[poll] triggered build 570 for ghost@720faa0b (PR #3, comment 14497) by autonomic-bot

$ ssh cc-ci 'docker service ps ccci-bridge_app --no-trunc'
# single running replica only; no restart near the incident

$ ssh cc-ci 'docker ps --format "{{.ID}} {{.Names}} {{.Status}}" | grep ccci-bridge || true'
c54c433972ac ccci-bridge_app.1.u5msezm603izeyf7kizqxq97j Up 22 hours

Conclusion: this was NOT one comment id deduped incorrectly inside a single process. It was the poller correctly treating THREE distinct comment ids as unseen after PR #3 was reopened:

  • 14029 and 14032 were historical !testme comments from when PR #3 had been open earlier.
  • PR #3 was closed when the current bridge process started, so those comments were not covered by the startup pass that marks pre-existing comments seen.
  • When PR #3 was reopened, the poller saw those old comments for the first time and replayed them, then also processed the fresh comment 14497.

Repo fix authored:

  • bridge/bridge.py: added _PROCESS_STARTED_AT and _is_preexisting_comment() so the poller now marks any trigger comment older than the current bridge process as already-seen, even if the PR was closed at startup and only becomes visible later via reopen.
  • tests/unit/test_bridge_trigger.py: added focused tests for pre-start vs post-start comment handling.

Verification:

$ nix shell nixpkgs#python311Packages.pytest -c pytest tests/unit/test_bridge_trigger.py -q
..........                                                               [100%]
10 passed in 0.04s

$ ssh cc-ci 'nixos-rebuild switch --flake "git+file:///root/cfold-deploy?submodules=1#cc-ci"'
# rebuild succeeded; deploy-bridge.service restarted and rolled the bridge task

$ ssh cc-ci 'docker service inspect ccci-bridge_app --format "{{.Spec.TaskTemplate.ContainerSpec.Image}}"'
cc-ci-bridge:eb32876581d9

$ ssh cc-ci 'curl -fsS https://ci.commoninternet.net/hook/healthz'
ok

$ ssh cc-ci 'docker logs --since 5m 2088e44a0534 2>&1 | sed -n "1,80p"'
poller (primary) watching ['recipe-maintainers/cc-ci', ..., 'recipe-maintainers/drone'] every 30s
comment-bridge listening on 0.0.0.0:8080 (poll primary + optional webhook)

This fix addresses the replay hole exposed during cfold's Ghost retrigger. It does not change the cfold bottom line: Ghost's upgrade tier remains the lone M2 blocker, while custom discovery continues to pass.

2026-06-13 — Ghost upgrade blocker fixed in cc-ci; same-ref real CI rerun now green

I stayed on the Ghost blocker until I had a same-ref real-!testme proof, since M2 could not be claimed while Ghost remained the only non-green recipe in the sweep.

Focused investigation sequence:

  • Preserved-current-code repros showed the old failure mode honestly: during the base->head crossover, the new Ghost app task could start before the replacement mysql service was usable, exiting on ENOTFOUND / ECONNREFUSED against ${STACK_NAME}_db, which made swarm pause the update before the head spec settled.
  • My first attempt (restart_policy.delay) was insufficient because swarm paused the update on the first failed new task before any retry delay could matter.
  • My second attempt (wrapping Ghost in command: sh -ec ...) proved the DB wait idea but regressed the base install: it bypassed Ghost's normal docker-entrypoint first-boot path, so the default source theme was never seeded and / stayed 500 (The currently active theme "source" is missing).
  • Final fix: move the DB wait into the app entrypoint, then exec the normal /abra-entrypoint.sh node current/index.js path. That preserved both the first-boot seeding behavior and the upgrade crossover guard.

The finished overlay in tests/ghost/compose.ccci.yml now does three things and nothing more:

  1. keep the existing 15m app healthcheck grace,
  2. keep the existing 15m db healthcheck grace,
  3. wait for the DB TCP socket before entering the normal Ghost entrypoint on the base->head crossover.

Verification:

$ ssh cc-ci 'jq -r ".results, .stages" /var/lib/cc-ci-runs/ghost-repro-cfold-3/results.json'
{
  "install": "pass",
  "upgrade": "pass"
}
[
  {"name":"install","status":"pass",...},
  {"name":"upgrade","status":"pass",...},
  {"name":"lint","status":"pass",...}
]

$ ssh cc-ci 'tok=$(cat /run/secrets/bridge_drone_token); curl -fsS -H "Authorization: Bearer $tok" https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds/585 | jq -r "[.number,.status,.after,.params.RECIPE,.params.PR,.params.REF] | @tsv"'
585	success	d44f799de945d0775933aad58726d46509154a64	ghost	5	d42d0f7c7cf9946077a583ffa3f7c96abfe94a77

$ ssh cc-ci 'jq -r "{level,recipe,ref,results,stages:(.stages|map({name,status}))}" /var/lib/cc-ci-runs/585/results.json'
{
  "level": 5,
  "recipe": "ghost",
  "ref": "d42d0f7c7cf9",
  "results": {
    "backup": "pass",
    "custom": "pass",
    "install": "pass",
    "restore": "pass",
    "upgrade": "pass"
  },
  "stages": [
    {"name":"install","status":"pass"},
    {"name":"upgrade","status":"pass"},
    {"name":"backup","status":"pass"},
    {"name":"restore","status":"pass"},
    {"name":"custom","status":"pass"},
    {"name":"lint","status":"pass"}
  ]
}

$ ssh cc-ci 'printf "ghost custom junit="; ls /var/lib/cc-ci-runs/585/junit/custom__cc-ci__*.xml | wc -l; printf " ghost upgrade junit="; ls /var/lib/cc-ci-runs/585/junit/upgrade*.xml | wc -l'
ghost custom junit=4
 ghost upgrade junit=2

$ ssh cc-ci 'printf "live_pr_apps="; docker stack ls --format "{{.Name}}" | grep -c -- "-pr" || true'
live_pr_apps=0

Outcome:

  • Ghost is no longer the M2 blocker.
  • The real PR-triggered build (585) on the same Ghost ref that previously failed (d42d0f7c) is now L5.
  • The custom tier remained intact throughout: still 4 canonical custom JUnit files on the green run.
  • With Ghost green and teardown clean, the cfold phase is ready for a formal M2 claim.