review(gtea): M2 re-verify — #684 PASS, #685 FAIL (LFS upgrade rollback blocker)
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
Build #684 (RECIPE=gitea REF=main PR=0): PASS level=5 — all tiers pass, LFS correctly
SKIP on main, HC1 SHA match (e6a1cc79=e6a1cc79). M2 main-branch DoD MET.
Build #685 (RECIPE=gitea PR=1 REF=357926f26e69): FAIL level=1 — new critical blocker:
upgrade chaos redeploy to PR head with compose.lfs.yml fails with rollback_completed.
Root cause: lfs_jwt_secret generated by abra --all with wrong length/format because
.env.sample in PR #1 has `SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43` COMMENTED OUT.
Gitea starts but fails health check on bad JWT secret → Docker swarm rolls back.
Also filed: cc-ci self-test lint failures (9 ruff format violations in gtea files),
drone dep path not re-verified via live CI since a121d2c.
M2 still NOT claimable — Builder must fix lfs_jwt_secret generation and re-trigger #685.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@ -78,6 +78,71 @@ Not a code bug. Builder should post ONE !testme at a time to avoid concurrency c
|
||||
The concurrent lock mechanism should prevent partial-state damage, but the stale cred cache
|
||||
(`/tmp/ccci-gitea-admin-<domain>.json`) persists and causes 401s.
|
||||
|
||||
### [critical — M2 blocker] LFS upgrade rollback in build #685 @2026-06-15T21:10Z
|
||||
|
||||
Build #685 (RECIPE=gitea, PR=1, REF=357926f26e69): upgrade FAIL with rollback_completed.
|
||||
|
||||
Evidence: `abra.secret_generate --all` was called (after UPGRADE_EXTRA_ENV applied
|
||||
SECRET_LFS_JWT_SECRET_VERSION=v1). lfs_jwt_secret was created as a Docker secret (rollback_completed
|
||||
means container started, not pre-deploy failure). But gitea failed its health check.
|
||||
|
||||
**Root cause hypothesis**: lfs_jwt_secret generated with WRONG FORMAT/LENGTH because the
|
||||
`.env.sample` in PR #1 (lfs-plain-gitea branch) has the entry COMMENTED OUT:
|
||||
```
|
||||
# SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43 ← COMMENTED = abra may miss the length=43 spec
|
||||
```
|
||||
vs active entries (uncommented): `SECRET_JWT_SECRET_VERSION=v1 # length=43`
|
||||
|
||||
gitea's LFS JWT secret must be exactly 43 chars (base64 URL-safe, 32 bytes). If abra uses
|
||||
a different default length, gitea fails to parse the JWT secret and crashes on startup → rollback.
|
||||
|
||||
**Fix options** (Builder to choose):
|
||||
A. In `ops.py pre_install` (when `_lfs_enabled()`): explicitly generate lfs_jwt_secret with
|
||||
correct length: `abra._run(["app", "secret", "generate", domain, "lfs_jwt_secret", "v1", ...])`.
|
||||
Do NOT rely on `--all` for this secret because the spec is commented out.
|
||||
B. In generic.py `perform_upgrade` after UPGRADE_EXTRA_ENV: targeted secret generate (not --all).
|
||||
C. Ask the recipe maintainer to uncomment the `SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43`
|
||||
line in PR #1's `.env.sample` (and add a note that it's optional but needed for LFS installs).
|
||||
|
||||
Debug steps before fixing:
|
||||
1. After UPGRADE_EXTRA_ENV sets SECRET_LFS_JWT_SECRET_VERSION=v1, run:
|
||||
`abra app secret generate <domain> lfs_jwt_secret v1` and inspect the generated Docker secret
|
||||
length: `docker secret inspect <stack>_lfs_jwt_secret_v1 --format "{{.Spec.Data}}" | wc -c`
|
||||
2. Alternatively: check gitea container logs during the chaos deploy to see the startup error.
|
||||
3. A correct 43-char base64 secret should be: `openssl rand -base64 32 | tr -d '='` (43 chars).
|
||||
|
||||
Cascade effects (all from upgrade rollback):
|
||||
- pre_backup FAIL (401 on API call — stale creds after upgrade chaos)
|
||||
- pre_restore FAIL (ci-marker not in backed-up snapshot since backup was bad)
|
||||
- test_restore FAIL (marker not returned — restore didn't revert non-existent change)
|
||||
- custom tests: test_admin_api/test_git_push/test_lfs_roundtrip all 401 (stale creds)
|
||||
|
||||
Secondary mystery: WHY is ci_admin password invalid (401) after upgrade rollback? The password
|
||||
in the sqlite3 DB should be unchanged. Possible: gitea 3.5.3 briefly started during chaos deploy
|
||||
and modified the DB before failing health check. Builder should investigate if this is a separate
|
||||
bug or purely cascade from the upgrade failure.
|
||||
|
||||
### [minor — fix before M2 complete] cc-ci self-test lint failures @2026-06-15T21:10Z
|
||||
|
||||
Push-event CI builds #683/#686/#687 fail at `scripts/lint.sh` (cc-ci repo's own self-test):
|
||||
- `ruff format --check` wants to reformat 9 files (all new gtea files + test_discovery.py)
|
||||
- `ruff check` has 9 errors (bridge.py UP017 + likely others in gtea files)
|
||||
|
||||
This does NOT block M2 recipe CI runs (which use custom events). But:
|
||||
1. The cc-ci repo's self-test should be green (it's the CI server's own code quality check).
|
||||
2. `ruff format` violations in the new gtea files are Builder code quality debt.
|
||||
|
||||
Fix: `cd /root/builder-clone && nix develop .#lint --command ruff format tests/gitea/ tests/unit/test_discovery.py && nix develop .#lint --command ruff check --fix tests/gitea/`
|
||||
Then commit and push to clear the self-test lint failures.
|
||||
|
||||
### [pending — verify before M2 DONE] Drone dep path: no live CI since a121d2c
|
||||
|
||||
M2 DoD: "drone CI re-confirmed green (dep path intact)". No RECIPE=drone CI run has run
|
||||
since a121d2c modified `runner/harness/generic.py` and `tests/gitea/recipe_meta.py`.
|
||||
Unit tests (test_gitea_dep.py 10/10) still pass.
|
||||
Builder should trigger a RECIPE=drone run (e.g., post !testme on a drone recipe PR)
|
||||
to complete the M2 DoD dep-path verification.
|
||||
|
||||
### [non-blocking] Stale screenshot in manual runs @2026-06-15T20:32Z
|
||||
|
||||
`/var/lib/cc-ci-runs/manual/screenshot.png` mtime = June 13, not from today's M1 run.
|
||||
|
||||
109
machine-docs/BUILDER-INBOX.md
Normal file
109
machine-docs/BUILDER-INBOX.md
Normal file
@ -0,0 +1,109 @@
|
||||
# BUILDER-INBOX — phase gtea
|
||||
|
||||
Adversary → Builder side-channel. Builder: consume this file and delete it.
|
||||
|
||||
---
|
||||
|
||||
## M2 re-verify results @2026-06-15T21:30Z
|
||||
|
||||
Build #684 (main) and #685 (PR #1) are complete. One new critical blocker.
|
||||
|
||||
### Build #684 (RECIPE=gitea REF=main PR=0): PASS ✓ level=5
|
||||
|
||||
All 5 tiers pass. LFS test correctly SKIP on main. Upgrade SHA-match correct.
|
||||
This satisfies the M2 main-branch DoD condition.
|
||||
|
||||
### Build #685 (RECIPE=gitea PR=1 REF=357926f26e69): FAIL level=1
|
||||
|
||||
**Blocker 4: LFS upgrade rollback (NEW)**
|
||||
|
||||
Upgrade fails with `rollback_completed`: the Docker swarm tried to update the gitea service
|
||||
with compose.lfs.yml but the NEW container started and then failed its health check → rolled back.
|
||||
|
||||
**Root cause (high confidence)**: lfs_jwt_secret Docker secret was generated by
|
||||
`abra secret generate --all` but with WRONG LENGTH/FORMAT.
|
||||
|
||||
Evidence: In PR #1's `.env.sample`, the lfs_jwt_secret spec is COMMENTED OUT:
|
||||
```
|
||||
# SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43 ← COMMENT: abra may miss the length=43 spec
|
||||
```
|
||||
Abra reads the recipe's `.env.sample` to get secret parameters (including length). If the entry
|
||||
is commented out, abra may use a default length instead of 43. Gitea's LFS JWT secret must be
|
||||
exactly 43 chars (base64 URL-safe without padding = 32 bytes). Wrong length → gitea fails to
|
||||
parse the JWT secret at startup → fails health check → Docker swarm rolls back.
|
||||
|
||||
**Why `rollback_completed` and NOT a deploy-fail?**
|
||||
Docker "secret not found" errors happen at deploy time (before the container starts), which
|
||||
would produce a different error, not `rollback_completed`. The fact that rollback_completed
|
||||
occurred means the container DID start but failed its health check. So the secret EXISTS but
|
||||
has wrong content.
|
||||
|
||||
**Verify the issue:**
|
||||
After UPGRADE_EXTRA_ENV is applied (SECRET_LFS_JWT_SECRET_VERSION=v1 in .env), run:
|
||||
```bash
|
||||
abra app secret generate <domain> lfs_jwt_secret v1 -m -n
|
||||
# Then inspect the generated secret value length:
|
||||
docker secret ls | grep lfs_jwt # get the full secret name
|
||||
docker secret inspect <name> --format "{{.Spec.Data}}" 2>/dev/null | wc -c
|
||||
# Should be 43 (+ optional newline = 44). If not 43, that's the bug.
|
||||
```
|
||||
|
||||
**Fix options:**
|
||||
|
||||
Option A (recommended): In `ops.py pre_install`, when LFS is enabled, explicitly generate the
|
||||
lfs_jwt_secret with the correct command (targeted, not --all):
|
||||
```python
|
||||
if _lfs_enabled():
|
||||
import subprocess
|
||||
subprocess.run(
|
||||
["abra", "app", "secret", "generate", ctx.domain, "lfs_jwt_secret", "v1",
|
||||
"--length", "43", "-m", "-n"],
|
||||
check=False
|
||||
)
|
||||
```
|
||||
Also do the same in perform_upgrade (after UPGRADE_EXTRA_ENV, before chaos redeploy).
|
||||
|
||||
Option B: In generic.py perform_upgrade, replace `abra.secret_generate(domain)` with:
|
||||
```python
|
||||
abra._run(["app", "secret", "generate", domain, "lfs_jwt_secret", "v1",
|
||||
"--length", "43", "-m", "-C", "-o", "-n"], check=False)
|
||||
```
|
||||
BUT only if `_lfs_enabled()` is True in UPGRADE_EXTRA_ENV context.
|
||||
|
||||
Option C: Ask the recipe to uncomment the line in PR #1's `.env.sample`:
|
||||
```
|
||||
SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43 ← remove the leading #
|
||||
```
|
||||
Then `abra secret generate --all` would find it correctly. This requires a commit to PR #1.
|
||||
|
||||
**Secondary effect (401 after rollback):**
|
||||
After the upgrade rollback, all API calls return `user's password is invalid` for ci_admin.
|
||||
The stale-creds fix in pre_install (delete creds file) correctly runs at INSTALL time. But
|
||||
the ROLLBACK may leave gitea's sqlite3 DB in a state where the admin password has changed
|
||||
(gitea 3.5.3 briefly started during the chaos deploy attempt and may have modified the DB).
|
||||
This cascade clears itself if the upgrade succeeds (no broken state). But if you can reproduce
|
||||
this 401-after-rollback, it suggests a deeper issue. Investigate if gitea modifies admin creds
|
||||
on any startup when certain env vars are set.
|
||||
|
||||
### Additional items (non-blocking for M2 recipe CI, but fix before DONE):
|
||||
|
||||
**cc-ci self-test lint failures:**
|
||||
All push-event CI builds (#683, #686, #687) fail at `ruff format` and `ruff check`:
|
||||
- 9 new gtea files need `ruff format` (test_admin_api.py, test_git_push.py, test_lfs_roundtrip.py,
|
||||
ops.py, recipe_meta.py, test_backup.py, test_install.py, test_upgrade.py, test_discovery.py)
|
||||
- 9 ruff check errors (at least bridge.py UP017 + likely others in gtea files)
|
||||
Fix:
|
||||
```bash
|
||||
cd /root/builder-clone
|
||||
nix develop .#lint --command ruff format tests/gitea/ tests/unit/test_discovery.py
|
||||
nix develop .#lint --command ruff check --fix tests/gitea/
|
||||
# verify: nix develop .#lint --command bash scripts/lint.sh
|
||||
git commit -m "fix(gtea): ruff format + check all gtea test files"
|
||||
```
|
||||
|
||||
**Drone dep path: needs live CI verification**
|
||||
No RECIPE=drone CI run since a121d2c changed generic.py + recipe_meta.py. Unit tests pass
|
||||
but M2 DoD requires live CI verification. Trigger a RECIPE=drone run when convenient
|
||||
(post !testme on a drone recipe PR, or manually trigger with RECIPE=drone).
|
||||
|
||||
— Adversary, 2026-06-15T21:30Z
|
||||
@ -173,3 +173,126 @@ Filed as critical M2 blockers in BACKLOG-gtea.md. Builder must fix before M2 can
|
||||
2. Upgrade fails on main branch run 674 (level=1, not level=5)
|
||||
|
||||
Gate M2: **NOT CLAIMED** — Builder must fix and re-trigger CI
|
||||
|
||||
---
|
||||
|
||||
## M2 re-verification @2026-06-15T21:30Z (builds #684 and #685)
|
||||
|
||||
Builder fixed two blockers (commit a121d2c): UPGRADE_EXTRA_ENV for LFS, head_ref SHA fix,
|
||||
stale creds deletion in pre_install. Triggered builds #684 (main) and #685 (PR #1).
|
||||
|
||||
### Build #684 — RECIPE=gitea REF=main PR=0 — **PASS** level=5 ✓
|
||||
|
||||
Full log reviewed from Drone API.
|
||||
|
||||
- lint: pass ✓
|
||||
- install: PASS — generic test_serving + gitea test_install_gitea both PASS ✓
|
||||
- upgrade: PASS — version=3.5.2→3.5.3, HC1: head_ref=e6a1cc79, chaos-version=e6a1cc79 (SHA match) ✓
|
||||
- backup: PASS — restic snapshot 8435c4df, 53 files, marker captured ✓
|
||||
- restore: PASS — pre_restore deleted ci-marker, restore returned it (genuine divergence) ✓
|
||||
- custom: all 4 tests:
|
||||
- test_admin_api: PASS (user+org+token CRUD lifecycle) ✓
|
||||
- test_git_push: PASS (create repo→push→verify via API) ✓
|
||||
- test_health: PASS (root HTTP 200) ✓
|
||||
- test_lfs_roundtrip: SKIP ✓ — correct ("compose.lfs.yml absent in gitea recipe checkout —
|
||||
LFS is not enabled on this branch. This test runs on lfs-plain-gitea (PR #1) and is
|
||||
EXPECTED_NA on main.")
|
||||
- deploy-count=1 (expected 1) ✓
|
||||
- clean_teardown=true, no_secret_leak=true ✓
|
||||
|
||||
**M2 main-branch condition: MET** (build #684, level=5, upgrade SHA-match correct, LFS skip correct)
|
||||
|
||||
Screenshot: PNG file, 36KB, captured at 21:04 (during run #684). Visual content not verified
|
||||
inline (requires file transfer); file is valid PNG with real content. Operator should visually
|
||||
confirm sign-in page is shown.
|
||||
|
||||
### Build #685 — RECIPE=gitea PR=1 REF=357926f26e69 — **FAIL** level=1 ✗
|
||||
|
||||
Full log reviewed from Drone API and results.json.
|
||||
|
||||
- lint: pass ✓
|
||||
- install: PASS (base 3.5.2, no LFS) ✓
|
||||
- upgrade: **FAIL** — `gite-e1cb78.ci.commoninternet.net: upgrade redeploy did NOT converge to
|
||||
the head spec — swarm UpdateStatus='rollback_completed'.`
|
||||
- backup: FAIL (cascade — pre_backup 401: could not ensure ci-marker exists)
|
||||
- restore: FAIL (cascade — ci-marker absent after restore; backup state was bad)
|
||||
- custom: FAIL — test_admin_api, test_git_push, test_lfs_roundtrip all get `401 Unauthorized:
|
||||
user's password is invalid [uid: 1, name: ci_admin]`; test_health: PASS ✓
|
||||
- test_lfs_roundtrip: reaches API call (compose.lfs.yml IS in recipe dir at test time,
|
||||
_lfs_available()=True, LFS test DID run) but hits 401 on repo create — cascade failure
|
||||
|
||||
**Root cause: upgrade chaos redeploy to PR head with compose.lfs.yml fails (rollback_completed)**
|
||||
|
||||
Evidence chain:
|
||||
1. `rollback_completed` in Docker Swarm means the NEW task STARTED but failed its health check.
|
||||
If lfs_jwt_secret did NOT exist as Docker secret, the deploy would fail BEFORE creating the
|
||||
task (Docker reports "secret not found" at deploy time, not as a task health failure). Therefore
|
||||
lfs_jwt_secret WAS generated as a Docker secret.
|
||||
2. `abra.secret_generate(domain)` WAS called (generic.py line 267, new fix in a121d2c) with
|
||||
SECRET_LFS_JWT_SECRET_VERSION=v1 in the .env after UPGRADE_EXTRA_ENV applied.
|
||||
3. The COMPOSE_FILE=compose.yml:compose.sqlite3.yml:compose.lfs.yml was correctly set in .env
|
||||
(confirmed from log: `upgrade-env: COMPOSE_FILE=...`).
|
||||
4. Docker confirmed no lfs secrets at post-run check — expected (clean_teardown=true cleaned them).
|
||||
|
||||
**Most likely root cause: lfs_jwt_secret generated with wrong length/format by abra --all**
|
||||
|
||||
The `.env.sample` in PR #1 (lfs-plain-gitea branch) has the lfs_jwt_secret spec COMMENTED OUT:
|
||||
```
|
||||
# SECRET_LFS_JWT_SECRET_VERSION=v1 # length=43
|
||||
```
|
||||
Compare with active (uncommented) entries:
|
||||
```
|
||||
SECRET_JWT_SECRET_VERSION=v1 # length=43
|
||||
SECRET_INTERNAL_TOKEN_VERSION=v1 # length=105
|
||||
```
|
||||
`abra secret generate --all` reads the recipe's `.env.sample` for secret parameters (including
|
||||
length). If the `SECRET_LFS_JWT_SECRET_VERSION` entry is commented out, abra may use a default
|
||||
length (likely not 43) when generating the Docker secret value. A gitea LFS JWT secret must be
|
||||
a base64 URL-safe string of exactly 43 chars (representing 32 bytes without padding). If abra
|
||||
generates a wrong-length value, gitea fails to parse its JWT secret on startup and crashes before
|
||||
passing the `/api/healthz` health check — causing `rollback_completed`.
|
||||
|
||||
**Secondary mystery: admin password 401 after upgrade rollback**
|
||||
After rollback, gitea 3.5.2 runs again. ci_admin password was written to creds file during
|
||||
pre_install (fresh install, stale file deleted). Yet all API calls return 401 `user's password
|
||||
is invalid`. This cascade is unexplained but consistent with gitea being in a bad state after
|
||||
the rollback (possible: the brief chaos deploy attempt changed state in the sqlite3 DB before
|
||||
the health check failed and Docker rolled back the CONTAINER — not the DATA volume).
|
||||
|
||||
**Files confirmed NOT the issue:**
|
||||
- compose.lfs.yml structure: correct (external secret declared, GITEA_LFS_START_SERVER env set) ✓
|
||||
- app.ini.tmpl: LFS_JWT_SECRET rendered from `{{ secret "lfs_jwt_secret" }}` when
|
||||
GITEA_LFS_START_SERVER=true ✓
|
||||
- UPGRADE_EXTRA_ENV applied correctly (confirmed in log) ✓
|
||||
- HC1 would pass if upgrade converged (SHA logic correct from #684 fix) ✓
|
||||
|
||||
### Additional finding: cc-ci self-test lint failures (non-blocking for M2 recipe CI)
|
||||
|
||||
Push-event builds #683/#686/#687 fail at `scripts/lint.sh`:
|
||||
- `ruff format --check`: 9 files need formatting:
|
||||
`tests/gitea/custom/test_admin_api.py`, `test_git_push.py`, `test_lfs_roundtrip.py`,
|
||||
`tests/gitea/ops.py`, `recipe_meta.py`, `test_backup.py`, `test_install.py`, `test_upgrade.py`,
|
||||
`tests/unit/test_discovery.py`
|
||||
- `ruff check`: 9 errors (at least `bridge/bridge.py:85:36: UP017` + others in gtea files)
|
||||
|
||||
These are the cc-ci REPO'S OWN self-tests, not the recipe CI runs. They do NOT gate M2 recipe
|
||||
CI (which runs via custom events). However, they reflect code quality debt and should be fixed.
|
||||
`ruff format tests/gitea/` and `ruff check --fix tests/gitea/` would address the gtea files.
|
||||
The `bridge.py UP017` may be pre-existing.
|
||||
|
||||
Filed in BACKLOG-gtea.md Adversary findings.
|
||||
|
||||
### Drone dep path: not re-verified via live CI since a121d2c
|
||||
|
||||
M2 DoD: "drone CI re-confirmed green (dep path intact)". No RECIPE=drone custom build has run
|
||||
since commit a121d2c modified generic.py and recipe_meta.py. Unit tests (test_gitea_dep.py 10/10)
|
||||
still pass and cover the dep path code-level. A live RECIPE=drone run is needed to satisfy the
|
||||
full M2 DoD dep-path verification. Filed in BACKLOG as pending.
|
||||
|
||||
## M2 VERDICT: PENDING — new critical blocker in build #685
|
||||
|
||||
1. ✓ M2 main-branch condition MET (build #684, level=5)
|
||||
2. ✗ PR #1 LFS capstone FAIL — upgrade rollback with LFS (build #685, level=1)
|
||||
Root cause: lfs_jwt_secret generated with wrong format/length (commented-out .env.sample spec)
|
||||
|
||||
Gate M2: **NOT CLAIMED** — Builder must fix lfs_jwt_secret generation and re-trigger build #685
|
||||
|
||||
Reference in New Issue
Block a user