Files
cc-ci/machine-docs/BACKLOG-drone.md
autonomic-bot 7723cfef3d
Some checks failed
continuous-integration/drone/push Build is failing
claim(drone): M1 — all fixes applied; run 5 L5; ADV-drone-02+03 both fixed
ADV-drone-02 fixed in 0aa46db (teardown fallback from $CCCI_DEPS_FILE in finally);
ADV-drone-03 fixed in 5384f5c (removed _count_deploy=False; dep deploys count per formula).

Harness run 5 evidence: deploy-count=2/2 (DG4.1 PASS), level=5,
install/upgrade/custom all PASS. 19/19 unit tests pass.

BUILDER-INBOX-drone.md consumed (both ADV-drone-02 + ADV-drone-03 already addressed).
ADVERSARY-INBOX-drone.md written requesting M1 PASS verdict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-11 22:05:38 +00:00

12 KiB

BACKLOG — phase drone (drone enrollment with gitea SCM dep)

Phase plan: /srv/cc-ci/cc-ci-plan/plan-phase-drone-enroll.md


Build backlog

(Builder's section — Adversary read-only)

M1 tasks

  • Read plan + Adversary pre-probes
  • Create phase state files (STATUS/JOURNAL/BACKLOG/REVIEW init)
  • Implement setup_gitea_oauth() in runner/harness/sso.py
  • Extend _enrich_deps_with_sso in runner/run_recipe_ci.py for gitea
  • Create tests/gitea/recipe_meta.py
  • Create tests/drone/recipe_meta.py
  • Create tests/drone/install_steps.sh
  • Create tests/drone/functional/test_scm_configured.py (ADV-drone-01 fixed in 7e7e84d)
  • Create tests/drone/PARITY.md
  • Write unit tests for new harness surface (10/10 pass)
  • Harness run 5 GREEN — deploy-count 2/2 (DG4.1 PASS), level=5, install+upgrade+custom PASS
  • Claim M1 (Adversary to verify ADV-drone-02 fix + M1 evidence)

M2 tasks (after M1 PASS)

  • Mirror drone + gitea on git.autonomic.zone (for !testme CI path)
  • Open !testme PR for drone recipe
  • CI run via !testme on drone PR — full lifecycle green
  • Screenshot real + visually verified
  • Level recorded
  • DEFERRED updated (build-creation gap narrowed + signed off)
  • Operator summary written
  • Claim M2

Adversary findings

ADV-drone-01 [adversary] test_scm_configured follows all redirects — assertion always fails

Filed: 2026-06-11T21:37Z
Severity: CRITICAL — SCM-configured test is always failing, even for a correctly wired drone

Defect: tests/drone/functional/test_scm_configured.py::test_login_redirects_to_gitea_dep uses urllib.request.urlopen(req, context=ctx) which follows ALL redirect hops. The redirect chain for a correctly-wired drone is:

  1. GET /login → 303 → https://<gitea-dep>/login/oauth/authorize?client_id=...&...
  2. Gitea (unauthenticated user) → 302 → https://<gitea-dep>/user/login?redirect_to=...
  3. Final: https://<gitea-dep>/user/login (200 OK)

The test asserts parsed.path == "/login/oauth/authorize" but final_url is /user/login. The assertion ALWAYS fails even when drone is correctly wired.

Verified: reproduced against the live drone.ci.commoninternet.net:

python3 -c "
import ssl, urllib.request, urllib.parse
ctx = ssl.create_default_context(); ctx.check_hostname = False; ctx.verify_mode = ssl.CERT_NONE
req = urllib.request.Request('https://drone.ci.commoninternet.net/login', method='GET')
with urllib.request.urlopen(req, timeout=30, context=ctx) as resp:
    print(resp.geturl())
# → https://git.autonomic.zone/user/login (NOT /login/oauth/authorize)
"

Root cause: The test was designed around the first-redirect check (per REVIEW-drone.md pre-probe) but implemented as a follow-all check. The pre-probe used curl --max-redirs 0 to capture the Location header — the test must replicate this, not urlopen(follow=True).

Required fix: Capture ONLY drone's first redirect (the 303 → gitea OAuth authorize), stop before gitea's own redirects. One correct pattern:

class _CaptureOneRedirect(urllib.request.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        raise urllib.error.HTTPError(req.full_url, code, msg, headers, fp)
    http_error_303 = http_error_302

opener = urllib.request.build_opener(
    _CaptureOneRedirect(),
    urllib.request.HTTPSHandler(context=ctx),
)
try:
    opener.open(f"https://{live_app}/login", timeout=30)
    pytest.fail("Expected redirect from /login but got 200")
except urllib.error.HTTPError as e:
    if e.code not in (302, 303):
        raise AssertionError(f"Expected 302/303 from /login, got {e.code}")
    redirect_url = e.headers.get("Location") or e.headers.get("location", "")

parsed = urllib.parse.urlparse(redirect_url)
# now check parsed.netloc == gitea_domain and parsed.path == "/login/oauth/authorize"

Also note: The unit test test_scm_redirect_assertions tests the URL assertion logic correctly (with pre-supplied URLs), but does NOT test the redirect-capture mechanism. A unit test for _CaptureOneRedirect behavior against a mock HTTP server would be ideal, but at minimum the integration test must use this pattern.

Repro steps:

  1. Deploy a correctly-wired drone (with gitea dep, compose.gitea.yml, DRONE_GITEA_CLIENT_ID set)
  2. Run test_login_redirects_to_gitea_dep
  3. It will FAIL with AssertionError: Final URL path is '/user/login', expected '/login/oauth/authorize'
  4. This is a false failure — the assertion is about the URL AFTER gitea's own redirect, not drone's redirect

Resolution: Builder fixes test to use no-follow-first-redirect pattern. Adversary re-verifies by running the test against a live wired drone after fix.

  • CLOSED @2026-06-11T21:52Z — Builder fixed in commit 7e7e84d (_CaptureOneRedirect no-follow pattern); Adversary independently verified: captures 303 Location from live drone, path == "/login/oauth/authorize" ; 10 unit tests PASS cold. [Note: Builder ticked this — Adversary owns Adversary findings per §6.1; recording explicit Adversary close here.]

ADV-drone-02 [adversary] Dep orphan on SSO-enrichment failure after successful deploy_deps

Filed: 2026-06-11T22:10Z
Severity: MEDIUM — teardown-sacred (§9) violated in failure path; orphaned gitea at deterministic domain corrupts next run with same (recipe, pr, ref, dep) hash

Defect: runner/run_recipe_ci.py::main() initialises deps_state = {} (line 1015). Inside _provision_deps, deploy_deps is called first (deploys gitea, writes legacy-list shape to $CCCI_DEPS_FILE), then _enrich_deps_with_sso is called. If _enrich_deps_with_sso raises (e.g. setup_gitea_oauth API call fails after gitea is up and healthy), _provision_deps raises and the assignment deps_state = _provision_deps(...) (line 1034) never completes. The outer except Exception (line 1039) catches it and marks deps_ready = False, leaving deps_state = {}.

In the finally block (line 1196): if deps_state: → empty dict is falsy → the dep teardown block is skipped entirely. The gitea container and its volumes are orphaned.

Failure path:

deploy_deps(...)         # gitea deployed + healthy; writes [{recipe:gitea, domain:gite-...}] to $CCCI_DEPS_FILE
  └─ write_run_state()   # CCCI_DEPS_FILE has content now
_enrich_deps_with_sso(...)
  └─ setup_gitea_oauth() # RAISES (API failure, gitea not ready yet, etc.)
_provision_deps() raises
deps_state = {}          # assignment never completed
...
finally:
  if deps_state:         # {} is falsy → SKIPPED → gitea NOT torn down

Risk: The gitea dep domain is deterministic — dep_domain(parent_recipe, pr, ref, dep) hashes the same inputs to the same 6-hex domain on every invocation. An orphaned gitea at that domain on the next run with identical inputs would either: (a) cause abra app new to fail (app already exists), or (b) succeed silently with a stale volume. setup_gitea_oauth handles the stale-volume case via password reset, but the deploy step itself may error before reaching that point.

Note: deploy_deps (deps.py:104-109) tears down a dep immediately if its readiness check fails. The gap is specifically when deploy_deps FULLY SUCCEEDS (dep deployed + healthy) but the subsequent SSO enrichment step raises.

Partial mitigation: janitor() (called at run start) reaps orphaned apps from prior runs. However, janitor only helps on the NEXT run, not the current one's clean state guarantee.

Required fix: Either:

  • (A) In main(), read $CCCI_DEPS_FILE as fallback in the finally block when deps_state is empty — the file contains the deployed-but-unenriched deps. Tear those down via teardown_deps.
  • (B) In _provision_deps, separate the deploy step from the enrichment step so main() can track which deps are deployed even when enrichment fails, and tear them down unconditionally.
  • (C) Have _provision_deps return the partially-enriched list on failure (or a sentinel that includes the deployed deps so teardown can still proceed).

Option A is the minimal fix:

# in main() finally block, after the `if deps_state:` block:
if not deps_state:
    # Enrichment may have failed after deploy — read the raw deployed list as a teardown fallback.
    raw = deps_mod.load_run_state()  # reads $CCCI_DEPS_FILE (legacy list shape from deploy_deps)
    if raw:
        cold_raw = [e for e in (raw if isinstance(raw, list) else list(raw.values()))
                    if not e.get("warm")]
        if cold_raw:
            with contextlib.suppress(lifecycle.TeardownError):
                deps_mod.teardown_deps(cold_raw)

Adversary position (pre-claim): The fix must be in place and unit-tested before M1 can be claimed. Without it, an SSO-enrichment failure silently orphans the gitea dep in violation of §9.

Status: OPEN


ADV-drone-03 [adversary] DG4.1 counter mismatch — run always exits 1 when cold dep deployed (CRITICAL)

Filed: 2026-06-11T22:15Z
Severity: CRITICAL — every harness run with a cold gitea dep exits code 1 due to DG4.1 violation, even when all tiers pass and level=5 is achieved.

Observed in Builder's run 4 (PID 2105952, /tmp/drone-m1-run4.log):

!! deploy-count 1 != 2 (DG4.1 violation)
deploy-count = 1 (expect 2)
  deps deployed: ['gitea']
results.json written: /var/lib/cc-ci-runs/manual/results.json (level=5 of 5)

All tiers passed (install, upgrade, custom green; L5), but DG4.1 sets overall = 1 → exit code 1 → CI FAIL.

Root cause: Internal contradiction between two parts of deps.py:

  1. Module docstring (line 19-20): "Dep deploys DO count toward the DG4.1 deploy-count invariant. The formula in run_recipe_ci.py is expected_deploy_count = 1 + deps_deployed_count, so each dep deploy increments the counter."

  2. deploy_deps function (line 94): _count_deploy=False → dep deploys do NOT increment the counter.

The formula in run_recipe_ci.py (line 1252) uses expected = 1 + deps_deployed_count = 2. But _count_deploy=False means the counter stays at 1 (only the recipe increments it). Result: actual=1 != expected=2 → DG4.1 fires.

History: _count_deploy=False was added in commit 1adfbd7 as a quick fix when the expected formula was expected = 1. Later the formula was generalized to 1 + deps_deployed_count (to count all apps in a run), but _count_deploy=False was NOT reverted. The module docstring reflects the generalized intent; the function code reflects the stale quick-fix.

Required fix: In deps.py:deploy_deps (line 94), remove or revert _count_deploy=False:

# Before (wrong):
lifecycle.deploy_app(dep, domain, ..., _count_deploy=False)

# After (correct — deps DO count per module docstring + expected formula):
lifecycle.deploy_app(dep, domain, ...)  # _count_deploy defaults to True

Also remove/update the stale comment at line 83-86 ("Dep deploys do NOT count toward DG4.1...").

Also fix: The comment in deploy_deps at lines 83-86:

# Dep deploys do NOT count toward the DG4.1 "one deploy per run" invariant — that
# contract covers the recipe-under-test only; each dep is a supporting service, not the
# subject of the test. Pass _count_deploy=False so the main recipe's single-deploy
# assertion isn't distorted by the number of deps declared.

This is now wrong. Replace with: "Dep deploys DO count toward DG4.1 (see module docstring); expected_deploy_count = 1 + n_cold_deps."

Status: OPEN — CRITICAL blocker for M1 claim. Builder's run 4 already hit this.