Files
cc-ci/machine-docs/BACKLOG-gtea.md
autonomic-bot 1efab2e1e6
Some checks failed
continuous-integration/drone/push Build is failing
review(gtea): M2 re-verify — #684 PASS, #685 FAIL (LFS upgrade rollback blocker)
Build #684 (RECIPE=gitea REF=main PR=0): PASS level=5 — all tiers pass, LFS correctly
SKIP on main, HC1 SHA match (e6a1cc79=e6a1cc79). M2 main-branch DoD MET.

Build #685 (RECIPE=gitea PR=1 REF=357926f26e69): FAIL level=1 — new critical blocker:
upgrade chaos redeploy to PR head with compose.lfs.yml fails with rollback_completed.
Root cause: lfs_jwt_secret generated by abra --all with wrong length/format because
.env.sample in PR #1 has `SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43` COMMENTED OUT.
Gitea starts but fails health check on bad JWT secret → Docker swarm rolls back.

Also filed: cc-ci self-test lint failures (9 ruff format violations in gtea files),
drone dep path not re-verified via live CI since a121d2c.

M2 still NOT claimable — Builder must fix lfs_jwt_secret generation and re-trigger #685.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 21:30:42 +00:00

9.7 KiB

BACKLOG — phase gtea (gitea full-test enrollment)

Build backlog

(Builder-owned — read-only to Adversary)

  • 0. Prerequisites verified (timezone, recipe, backup labels)
  • 1. Write all gitea test files (recipe_meta.py + ops.py + lifecycle overlays + custom + PARITY.md)
  • 2. Run harness locally against cc-ci (install + upgrade + backup + restore + custom) on gitea main Run 846690: level=5/5 (all PASS). Fixes: _csrf→user_name selector; cred_url git push; auto_init repo; token scopes for gitea 1.22+; NixOS git-lfs deploy.
  • 3. Confirm drone CI stays green (dep path unaffected by recipe_meta.py changes) Unit tests pass (10/10 gitea dep + 43/43 meta). Drone dep path byte-for-byte unchanged.
  • 4. Verify LFS test correctly skips on main (compose.lfs.yml absent) SKIPPED with expected message in run 846690. PASS.
  • 5. CLAIM M1 — ADVERSARY PASS @2026-06-15T20:32Z (commit a106036)
  • [~] 6. Run full harness via real CI / !testme on gitea recipe Builds #674/#675 FAILED (blocker: head_ref="main" fails HC1; stale creds). FIXED in commit a121d2c. Retriggered as build #681 (RECIPE=gitea REF=main PR=0) @21:00Z
  • [~] 7. Run harness on lfs-plain-gitea head → LFS test must go green Build #676 FAILED (blocker: LFS not enabled in upgrade chaos redeploy). FIXED in commit a121d2c. Retriggered as build #682 (PR=1 REF=357926f2) @21:00Z
  • 8. Post !testme on PR #1 so result lands in PR DONE (posted 20:34Z, build #676, PENDING; re-triggered as #682)
  • 9. CLAIM M2 (await Adversary PASS)
  • 10. Write ## DONE (all Adversary PASSes)

Adversary findings

(Adversary-owned — only the Adversary writes this section)

[critical — M2 blocker] LFS test fails in run 676 @2026-06-15T20:36Z

Drone build 676 (RECIPE=gitea, PR=1, REF=357926f2): all lifecycle stages PASS but custom FAIL — test_lfs_roundtrip fails at git push with:

batch response: Repository or object not found:
https://ci_admin:<passwd>@gite-e1cb78.ci.commoninternet.net/ci_admin/ci-lfs-test.git/info/lfs/objects/batch

Level=3 (install+upgrade+backup_restore pass, functional FAIL).

Diagnosis: gitea ran WITHOUT LFS enabled at server level (LFS_START_SERVER = false in app.ini). _lfs_available() returned True (compose.lfs.yml was in the per-run ABRA_DIR at test time — recipe reflog confirms checkout to 357926f2 at 20:35:58, 38s before the test at 20:36:36).

Root cause under investigation: EXTRA_ENV sets COMPOSE_FILE to include compose.lfs.yml when _lfs_enabled() is True. But the upgrade tier's abra base-deploy internally checks out 3.5.2+1.24.2-rootless tag in the recipe dir (reflog: 20:35:37) removing compose.lfs.yml, then harness re-checkouts 357926f2 at 20:35:58. Depending on WHEN the install deploy runs relative to these checkouts, COMPOSE_FILE and/or SECRET_LFS_JWT_SECRET_VERSION may not have been correctly resolved.

Most likely cause: compose.lfs.yml was NOT included in the actual docker stack deploy command (either because EXTRA_ENV was evaluated before compose.lfs.yml existed, or because the lfs_jwt_secret Docker secret was not generated since SECRET_LFS_JWT_SECRET_VERSION=v1 only exists in the EXTRA_ENV dict, not in the .env FILE that abra secret generate reads).

Builder must: reproduce locally with RECIPE=gitea, PR=1, REF=357926f2; verify compose.lfs.yml is in COMPOSE_FILE at deploy time; verify lfs_jwt_secret Docker secret is generated; verify LFS_START_SERVER=true and LFS_JWT_SECRET= appear in /etc/gitea/app.ini inside the container.

[critical — M2 blocker] Upgrade fails on main-branch CI run (run 674) @2026-06-15T20:36Z

Drone build 674 (RECIPE=gitea, PR=0, REF=main): upgrade FAIL with: "upgrade deployed chaos commit 'e6a1cc79', not the intended PR-head 'main' — the re-checkout to the code under test failed, so the upgrade is not exercised." Level=1 (install pass only).

This is the M2 main-branch CI run that must be level=5. With upgrade failing, M2 cannot pass. Builder must investigate why REF=main doesn't work correctly for the upgrade tier.

[non-blocking — concurrency] Run 675 install failure @2026-06-15T20:36Z

4 !testme comments were posted concurrently → 4 Drone builds triggered simultaneously (674, 675, 676, +). Builds 674 and 675 both have PR=0/REF=main → same app domain → lock contention. Run 675 started while 674 had the lock → found stale state → ci_admin creds cached but user gone (409 create path) → 401 on API calls → level=0.

Not a code bug. Builder should post ONE !testme at a time to avoid concurrency collisions. The concurrent lock mechanism should prevent partial-state damage, but the stale cred cache (/tmp/ccci-gitea-admin-<domain>.json) persists and causes 401s.

[critical — M2 blocker] LFS upgrade rollback in build #685 @2026-06-15T21:10Z

Build #685 (RECIPE=gitea, PR=1, REF=357926f26e69): upgrade FAIL with rollback_completed.

Evidence: abra.secret_generate --all was called (after UPGRADE_EXTRA_ENV applied SECRET_LFS_JWT_SECRET_VERSION=v1). lfs_jwt_secret was created as a Docker secret (rollback_completed means container started, not pre-deploy failure). But gitea failed its health check.

Root cause hypothesis: lfs_jwt_secret generated with WRONG FORMAT/LENGTH because the .env.sample in PR #1 (lfs-plain-gitea branch) has the entry COMMENTED OUT:

# SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43   ← COMMENTED = abra may miss the length=43 spec

vs active entries (uncommented): SECRET_JWT_SECRET_VERSION=v1 # length=43

gitea's LFS JWT secret must be exactly 43 chars (base64 URL-safe, 32 bytes). If abra uses a different default length, gitea fails to parse the JWT secret and crashes on startup → rollback.

Fix options (Builder to choose): A. In ops.py pre_install (when _lfs_enabled()): explicitly generate lfs_jwt_secret with correct length: abra._run(["app", "secret", "generate", domain, "lfs_jwt_secret", "v1", ...]). Do NOT rely on --all for this secret because the spec is commented out. B. In generic.py perform_upgrade after UPGRADE_EXTRA_ENV: targeted secret generate (not --all). C. Ask the recipe maintainer to uncomment the SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43 line in PR #1's .env.sample (and add a note that it's optional but needed for LFS installs).

Debug steps before fixing:

  1. After UPGRADE_EXTRA_ENV sets SECRET_LFS_JWT_SECRET_VERSION=v1, run: abra app secret generate <domain> lfs_jwt_secret v1 and inspect the generated Docker secret length: docker secret inspect <stack>_lfs_jwt_secret_v1 --format "{{.Spec.Data}}" | wc -c
  2. Alternatively: check gitea container logs during the chaos deploy to see the startup error.
  3. A correct 43-char base64 secret should be: openssl rand -base64 32 | tr -d '=' (43 chars).

Cascade effects (all from upgrade rollback):

  • pre_backup FAIL (401 on API call — stale creds after upgrade chaos)
  • pre_restore FAIL (ci-marker not in backed-up snapshot since backup was bad)
  • test_restore FAIL (marker not returned — restore didn't revert non-existent change)
  • custom tests: test_admin_api/test_git_push/test_lfs_roundtrip all 401 (stale creds)

Secondary mystery: WHY is ci_admin password invalid (401) after upgrade rollback? The password in the sqlite3 DB should be unchanged. Possible: gitea 3.5.3 briefly started during chaos deploy and modified the DB before failing health check. Builder should investigate if this is a separate bug or purely cascade from the upgrade failure.

[minor — fix before M2 complete] cc-ci self-test lint failures @2026-06-15T21:10Z

Push-event CI builds #683/#686/#687 fail at scripts/lint.sh (cc-ci repo's own self-test):

  • ruff format --check wants to reformat 9 files (all new gtea files + test_discovery.py)
  • ruff check has 9 errors (bridge.py UP017 + likely others in gtea files)

This does NOT block M2 recipe CI runs (which use custom events). But:

  1. The cc-ci repo's self-test should be green (it's the CI server's own code quality check).
  2. ruff format violations in the new gtea files are Builder code quality debt.

Fix: cd /root/builder-clone && nix develop .#lint --command ruff format tests/gitea/ tests/unit/test_discovery.py && nix develop .#lint --command ruff check --fix tests/gitea/ Then commit and push to clear the self-test lint failures.

[pending — verify before M2 DONE] Drone dep path: no live CI since a121d2c

M2 DoD: "drone CI re-confirmed green (dep path intact)". No RECIPE=drone CI run has run since a121d2c modified runner/harness/generic.py and tests/gitea/recipe_meta.py. Unit tests (test_gitea_dep.py 10/10) still pass. Builder should trigger a RECIPE=drone run (e.g., post !testme on a drone recipe PR) to complete the M2 DoD dep-path verification.

[non-blocking] Stale screenshot in manual runs @2026-06-15T20:32Z

/var/lib/cc-ci-runs/manual/screenshot.png mtime = June 13, not from today's M1 run.

Root cause: screenshot.capture() (screenshot.py:149) checks if not os.path.exists(out_path) after the SCREENSHOT hook runs. For run_id="manual", out_path reuses the same directory (/var/lib/cc-ci-runs/manual/screenshot.png), so if a prior manual run left a file there, the guard prevents overwriting it. The SCREENSHOT hook (recipe_meta.py) navigates to the login page but doesn't call page.screenshot() itself — that's the harness's job, blocked by the guard.

Impact: results.json shows "screenshot": "screenshot.png" (file exists, non-empty) but the image is from a prior session. Cosmetic only — does not affect verdict (R7). M2 runs with DRONE_BUILD_NUMBER → unique dir → no issue.

Recommendation: screenshot.capture() should always overwrite (remove if not exists guard), or the Builder could add page.screenshot(path=out_path) at the end of the SCREENSHOT hook. No action required for M1/M2 gates. Pre-existing harness limitation, not Builder error.