recipe-maintainers/cc-ci

Fork 0

Files

autonomic-bot 07cce4ed17

continuous-integration/drone/push Build is failing

Details

status(cfold): record live bridge rollout

2026-06-13 00:31:19 +00:00

16 KiB

Raw Blame History

JOURNAL — phase cfold

2026-06-11 — Phase cfold start

Investigation findings

Pre-existing test layout:

60 files in functional/ subdirs across 20 recipes
4 files in playwright/ subdirs (cryptpad, custom-html, uptime-kuma)
Helper modules to move: _discourse.py, _ghost.py, _mailu.py, _mm.py, _mumble_proto.py, drone/functional/__init__.py
mailu/test_backup.py, test_restore.py, ops.py explicitly add functional/ to sys.path — need updating to custom/

Decision: deprecated aliases

Per plan §2 option (RECOMMENDED): keep recognizing functional//playwright/ as deprecated aliases AND emit a loud one-line warning when a test is found in a deprecated folder. Using warnings.warn() at import time of discovery or print() directly. Will use print() (stderr) so it shows up in CI logs without needing to configure warning filters.

Implementation: subdirs = ("custom", "functional", "playwright") — canonical first — and after finding a test in functional/ or playwright/, emit: print(f"WARNING [cfold]: test found in deprecated folder '{sub}/' — move to custom/: {path}", flush=True, file=sys.stderr)

This way:

custom/ is canonical and gets discovered first
Old folders still work (zero breakage for repo-local tests) but emit a loud warning
No silent coverage loss possible

2026-06-12 — M1 checkpoint: canonical `custom/` layout landed locally

Code/work completed:

runner/harness/discovery.py: canonical custom/ discovery, deprecated alias warnings, and custom_subdir_label() normalization helper.
runner/harness/manifest.py: custom-test counts now normalize to canonical custom.
all cc-ci custom tests/helper modules moved from tests/<recipe>/{functional,playwright}/ into tests/<recipe>/custom/.
helper-import fallout fixed where needed (tests/mailu/{ops.py,test_backup.py,test_restore.py}).
docs updated to describe custom/ as the canonical layout and explain the alias-compatibility window.

Mechanical move summary:

64 custom test files relocated into custom/
helper modules relocated too: _discourse.py, _ghost.py, _mailu.py, _mm.py, _mumble_proto.py, tests/drone/custom/__init__.py

Verification:

nix shell nixpkgs#python312Packages.pytest --command pytest \
  tests/unit/test_discovery.py tests/unit/test_discovery_phase2.py tests/unit/test_manifest.py -q
# ..................
# 18 passed in 0.09s

Post-move grep state:

remaining functional/ / playwright/ matches in live code are intentional: alias-policy docs, deprecated-folder assertions in the unit tests, and discovery comments describing the alias behavior.
the pre-migration inventory in BACKLOG-cfold.md is intentionally unchanged because it is the M1 baseline record the Adversary will compare against.

2026-06-12 — M1 coverage proof assembled

Verification commands + observed outputs:

$ git ls-files "tests/*/custom/test_*.py" | wc -l
64

$ git ls-files "tests/*/functional/*" "tests/*/playwright/*"
# no output

$ for recipe in bluesky-pds cryptpad custom-html custom-html-tiny discourse drone ghost hedgedoc immich keycloak lasuite-docs lasuite-drive lasuite-meet mailu matrix-synapse mattermost-lts mumble n8n plausible uptime-kuma; do count=$(git ls-files "tests/$recipe/custom/test_*.py" | wc -l); printf "%s %s\n" "$recipe" "$count"; done
bluesky-pds 4
cryptpad 4
custom-html 4
custom-html-tiny 1
discourse 3
drone 1
ghost 4
hedgedoc 2
immich 3
keycloak 3
lasuite-docs 5
lasuite-drive 3
lasuite-meet 3
mailu 3
matrix-synapse 3
mattermost-lts 3
mumble 5
n8n 4
plausible 2
uptime-kuma 4

$ nix shell nixpkgs#python311Packages.pytest -c pytest tests/unit/test_discovery.py tests/unit/test_discovery_phase2.py tests/unit/test_manifest.py -q
..................
18 passed in 0.14s

Conclusion: the migrated tree still contains the exact same 64 custom test files with the same per-recipe cardinality as the pre-cfold baseline in BACKLOG-cfold.md; only the folder paths changed.

2026-06-12 — Adversary M1 PASS received

Pulled review(cfold): M1 PASS cold verification (4b4d665). Confirmed in REVIEW-cfold.md:

total canonical custom tests = 64
old tracked functional/ / playwright/ trees = none
per-recipe counts match the baseline exactly
focused unit suite = 18 passed
deprecated-alias warning probe works
normalized (recipe, filename) before/after set = exact match (missing [], extra [])

No fix-forward required. Phase advances to M2 baseline assembly.

2026-06-12 — M2 sweep snapshot: 19 fresh greens, Ghost upgrade regression remains

Bootstrap/access re-checks before the live sweep:

$ ssh cc-ci "hostname && whoami && nixos-version"
nixos
root
24.11.20250630.50ab793 (Vicuna)

$ set -a; . /srv/cc-ci/.testenv; set +a; curl -fsS "https://$GITEA_URL/api/v1/version"
{"version":"1.24.2"}

$ getent hosts "probe-$RANDOM.ci.commoninternet.net"
91.98.47.73     probe-4360.ci.commoninternet.net

Open-PR inventory before triggering uncovered recipes showed 16 enrolled repos already had live PRs; custom-html, keycloak, cryptpad, and mumble did not. I reopened reusable closed PRs for the first three (custom-html#2, keycloak#3, cryptpad#5) and created a minimal sweep-only mumble#1 probe PR via the Gitea API.

Fresh post-cfold success set gathered from the live server (/var/lib/cc-ci-runs/<build>/results.json):

506  drone            L5
510  custom-html-tiny L5
521  discourse        L5
522  immich           L5
523  lasuite-docs     L5
524  lasuite-drive    L5
525  lasuite-meet     L5
526  mailu            L5
527  matrix-synapse   L5
528  n8n              L5
529  mattermost-lts   L5
530  plausible        L5
531  uptime-kuma      L5
541  custom-html      L5
553  keycloak         L5
554  cryptpad         L5
555  hedgedoc         L5
556  bluesky-pds      L5
558  mumble           L5

Ghost is the lone non-green outlier:

557  ghost PR#4 @ d88f5801  -> L1 (install pass, upgrade fail, backup/restore/custom pass)
559  ghost PR#5 @ d42d0f7c  -> L1 (same failure shape on last known-green Ghost head)
185  ghost PR#4 @ d42d0f7c  -> L4 / pre-lint-era green baseline on 2026-06-05

The critical Ghost comparison is the same ref d42d0f7c:

historical build 185 (2026-06-05): upgrade passed at d42d0f7c
fresh probe build 559 (2026-06-12): same d42d0f7c now fails upgrade with swarm UpdateStatus='paused'

That isolates the regression away from cfold itself. In both fresh Ghost failures (557, 559), the custom tier still discovered and passed all four tests/ghost/custom/test_*.py files, while the upgrade op failed before upgrade assertions could run:

!! upgrade op failed: <ghost-domain>: upgrade redeploy did NOT converge to the head spec — swarm UpdateStatus='paused'.
The recipe's app service uses update_config failure_action=rollback/pause; the NEW (head) task failed swarm's update monitor,
so the service reverted/paused and the RUNNING spec is the previous version, not the code under test.

Adversary update pulled during this pass:

review(cfold) commit 93f56ae added only an idle audit entry to REVIEW-cfold.md
no finding filed
no M2 PASS yet because no claim(cfold): M2 ... commit exists

2026-06-12 — Follow-up Ghost artifact audit (same-ref historical pass vs fresh fail)

Focused cold checks after the M2 sweep snapshot:

$ ssh cc-ci "jq '{level,recipe,ref,results,rungs,stages:(.stages|map({name,status}))}' /var/lib/cc-ci-runs/185/results.json"
{
  "level": 4,
  "recipe": "ghost",
  "ref": "d42d0f7c7cf9",
  "results": {
    "backup": "pass",
    "custom": "pass",
    "install": "pass",
    "restore": "pass",
    "upgrade": "pass"
  },
  "rungs": {
    "backup_restore": "pass",
    "functional": "pass",
    "install": "pass",
    "integration": "na",
    "recipe_local": "na",
    "upgrade": "pass"
  },
  "stages": [
    {"name": "install", "status": "pass"},
    {"name": "upgrade", "status": "pass"},
    {"name": "backup", "status": "pass"},
    {"name": "restore", "status": "pass"},
    {"name": "custom", "status": "pass"}
  ]
}

$ ssh cc-ci "jq '{level,recipe,stages:(.stages|map({name,status,summary}))}' /var/lib/cc-ci-runs/559/results.json"
{
  "level": 1,
  "recipe": "ghost",
  "stages": [
    {"name": "install", "status": "pass", "summary": null},
    {"name": "backup", "status": "pass", "summary": null},
    {"name": "restore", "status": "pass", "summary": null},
    {"name": "custom", "status": "pass", "summary": null},
    {"name": "lint", "status": "pass", "summary": null}
  ]
}

$ ssh cc-ci "grep -R -n \"start_period\" /var/lib/cc-ci-runs/559/abra/recipes/ghost"
/var/lib/cc-ci-runs/559/abra/recipes/ghost/compose.yml:60:      start_period: 15m
/var/lib/cc-ci-runs/559/abra/recipes/ghost/compose.yml:84:      start_period: 1m
/var/lib/cc-ci-runs/559/abra/recipes/ghost/compose.ccci.yml:35:      start_period: 15m
/var/lib/cc-ci-runs/559/abra/recipes/ghost/compose.ccci.yml:38:      start_period: 15m

Conclusion:

Historical build 185 passed the full Ghost lifecycle on the SAME ref now used in probe build 559 (d42d0f7c7cf9), so the current M2 blocker is not tied to the custom/ folder migration.
Fresh failing runs still execute the canonical 4-file tests/ghost/custom/ suite and pass every non-upgrade stage; the missing upgrade junit output remains the key symptom.
The current repo does not show an obvious cfold-local fix to apply: the Ghost-specific overlay is unchanged, the recipe artifact still carries the expected compose.ccci.yml file, and the failure remains in the live upgrade path rather than discovery/custom-test coverage.
Net: cfold remains blocked on a cfold-neutral Ghost upgrade regression / flake. No repo-local code change was justified by that audit alone.

2026-06-13 — Ghost PR #3 fresh probe after reopen: same upgrade-only failure, plus duplicate trigger signal

I looked for the smallest allowed M2 step that did not touch recipe code: reuse an existing Ghost PR head that had historically gone green and rerun it through the live !testme path.

Actions taken:

$ set -a && . /srv/cc-ci/.testenv && set +a
$ curl -fsS -u "$GITEA_USERNAME:$GITEA_PASSWORD" -X PATCH \
    -H 'Content-Type: application/json' \
    -d '{"state":"open"}' \
    "https://$GITEA_URL/api/v1/repos/recipe-maintainers/ghost/pulls/3"
# PR #3 reopened; head remains 720faa0bebc46a34857b2933df1924ccabbd4087

$ curl -fsS -u "$GITEA_USERNAME:$GITEA_PASSWORD" -X POST \
    -H 'Content-Type: application/json' \
    -d '{"body":"!testme"}' \
    "https://$GITEA_URL/api/v1/repos/recipe-maintainers/ghost/issues/3/comments"
# comment 14497 created at 2026-06-13T00:07:50Z

Fresh live outcomes:

$ ssh cc-ci 'jq "{run_id, pr, recipe, ref, level, results, stages: (.stages | map({name,status,summary}))}" /var/lib/cc-ci-runs/568/results.json'
{
  "run_id": "568",
  "pr": "3",
  "recipe": "ghost",
  "ref": "720faa0bebc4",
  "level": 1,
  "results": {
    "backup": "pass",
    "custom": "pass",
    "install": "pass",
    "restore": "pass",
    "upgrade": "fail"
  },
  "stages": [
    {"name": "install", "status": "pass", "summary": null},
    {"name": "backup", "status": "pass", "summary": null},
    {"name": "restore", "status": "pass", "summary": null},
    {"name": "custom", "status": "pass", "summary": null},
    {"name": "lint", "status": "pass", "summary": null}
  ]
}

$ ssh cc-ci 'jq "{run_id, pr, recipe, ref, level, finished, results, stages: (.stages | map({name,status}))}" /var/lib/cc-ci-runs/569/results.json'
{
  "run_id": "569",
  "pr": "3",
  "recipe": "ghost",
  "ref": "720faa0bebc4",
  "level": 1,
  "finished": 1781309502.5494862,
  "results": {
    "backup": "pass",
    "custom": "pass",
    "install": "pass",
    "restore": "pass",
    "upgrade": "fail"
  },
  "stages": [
    {"name": "install", "status": "pass"},
    {"name": "backup", "status": "pass"},
    {"name": "restore", "status": "pass"},
    {"name": "custom", "status": "pass"},
    {"name": "lint", "status": "pass"}
  ]
}

Comment-stream evidence for duplicate triggers from one !testme:

$ curl -fsS -u "$GITEA_USERNAME:$GITEA_PASSWORD" \
    "https://$GITEA_URL/api/v1/repos/recipe-maintainers/ghost/issues/3/comments?limit=20"
# ...
# 14497: !testme (2026-06-13T00:07:50Z)
# 14498: cc-ci failure comment for run 568 (2026-06-13T00:08:05Z)
# 14499: cc-ci in-progress comment for run 569 (2026-06-13T00:08:05Z)
# 14500: cc-ci in-progress comment for run 570 (2026-06-13T00:08:05Z)

Takeaways:

Ghost is now freshly red post-cfold on three distinct PR heads (720faa0b, d88f5801, d42d0f7c), all with the same upgrade-only failure shape while custom discovery stays green.
That further weakens any cfold-local explanation; the blocker remains in Ghost's live upgrade path.
There is also likely a separate trigger dedupe problem: one !testme comment spawned runs 568, 569, and 570. I did not broaden into a D1 investigation in this loop step because cfold M2 is already hard-blocked by Ghost's repeated upgrade failures, but the evidence is now recorded.

2026-06-13 — Root-caused Ghost triple-trigger replay; bridge fix authored with unit coverage

Pulled the Adversary's latest cfold audit (review(cfold) ddefc96). It was not an M2 verdict or a finding; it confirmed the sweep is still unclaimable while teardown remains clean (live_pr_apps=0).

I then closed out the duplicate-run side observation from the Ghost PR #3 retrigger.

Evidence:

$ ssh cc-ci 'docker logs --since "2026-06-13T00:07:30" --until "2026-06-13T00:08:30" c54c433972ac 2>&1'
[poll] triggered build 568 for ghost@720faa0b (PR #3, comment 14029) by autonomic-bot
[poll] triggered build 569 for ghost@720faa0b (PR #3, comment 14032) by autonomic-bot
[poll] triggered build 570 for ghost@720faa0b (PR #3, comment 14497) by autonomic-bot

$ ssh cc-ci 'docker service ps ccci-bridge_app --no-trunc'
# single running replica only; no restart near the incident

$ ssh cc-ci 'docker ps --format "{{.ID}} {{.Names}} {{.Status}}" | grep ccci-bridge || true'
c54c433972ac ccci-bridge_app.1.u5msezm603izeyf7kizqxq97j Up 22 hours

Conclusion: this was NOT one comment id deduped incorrectly inside a single process. It was the poller correctly treating THREE distinct comment ids as unseen after PR #3 was reopened:

14029 and 14032 were historical !testme comments from when PR #3 had been open earlier.
PR #3 was closed when the current bridge process started, so those comments were not covered by the startup pass that marks pre-existing comments seen.
When PR #3 was reopened, the poller saw those old comments for the first time and replayed them, then also processed the fresh comment 14497.

Repo fix authored:

bridge/bridge.py: added _PROCESS_STARTED_AT and _is_preexisting_comment() so the poller now marks any trigger comment older than the current bridge process as already-seen, even if the PR was closed at startup and only becomes visible later via reopen.
tests/unit/test_bridge_trigger.py: added focused tests for pre-start vs post-start comment handling.