Files
cc-ci/BACKLOG-dstamp.md

5.5 KiB
Raw Blame History

BACKLOG — phase dstamp

Build backlog (Builder-owned)

  • Read phase plan + plan.md §6.1/§7/§9 + Adversary prep notes + stamp-relevant harness code.
  • Establish abra's chaos-version mechanism from abra source @06a57de (= pinned binary).
  • Rule out abra-version drift (constant store path since nixos system-4, 2026-06-01).
  • Minimal reproductions of the git/abra chaos-version path (cp-a; go-git base; mirror-faithful) — all stamp the CORRECT head 7ae7b0f7, NO drift in current host state.
  • Timeline: run 184 (06-05, solo) green @7ae7b0f; clustered 06-10/06-11 runs drift @ same ref.
  • Identify shared-stack collision vector (app_domain = hash(recipe|pr|ref); upgrade chaos_redeploy bypasses app-domain flock).
  • Isolated real runs (repro14) + direct UpdateStatus/PreviousSpec capture → root cause attributed.
  • Concurrency REFUTED (solo repro1/4 reproduce). Mechanism = swarm failure_action:rollback reverts the chaos-version label (direct evidence repro4: Spec=7ae7b0f7+U→PreviousSpec=eb96de9+U).
  • 06-05→06-10 change = rcust-phase heavier resident host load → start-first new task reliably OOMs → rollback every run (solo 06-05 run 184 didn't; my repro2 didn't either).
  • Blast-radius: only discourse affected (keycloak/n8n have the policy but upgrade PASS L4 across runs; drone/traefik infra). General harness guard covers all.
  • Restore discourse to its true level in real CI via the drone !testme path (M2): build #450 = LEVEL 5, all tiers PASS (install/upgrade/backup/restore/custom), clean teardown, no leak; PR#2 passed. fix1+fix2+450 = 3 consecutive green with the fix.
  • [~] HC1 teeth: code unchanged (generic.py:174-175) + assert_upgrade_converged RED on rollback (repro1/4). Live negative test = Adversary's M2 verification.
  • Closed the DEFERRED.md dstamp re-entry with pointers ( RESOLVED).

Adversary findings

Root cause independently confirmed @2026-06-11T17:3x (JOURNAL not read, anti-anchoring preserved):

Docker Swarm failure_action: rollback + order: start-first in discourse's compose.yml app service (BOTH eb96de94 base AND 7ae7b0f PR-head). On the upgrade chaos redeploy, start-first runs OLD + NEW tasks co-resident (~2× memory); the heavy Rails/precompile app fails swarm's 5s update monitor under host memory pressure → rollback fires → app service spec reverts to PreviousSpec (chaos-version=eb96de94+U). Because start-first kept the OLD task serving, wait_healthy passed; deployed_identity read the rolled-back spec; HC1 misreported it as "stamp mismatch" (the real failure was "new task failed the update monitor").

services_converged blind spot: "rollback_completed" not in blocking states → returned True.

Evidence: docker service inspect disc-ae10f0_..._app confirmed UpdateConfig: {On failure: rollback, Order: start-first, Monitoring Period: 5s}. repro1 (isolated, no concurrency) ALSO showed drift → pure-concurrency hypothesis REFUTED independently before reading Builder evidence.

abra exonerated: abra reads git HEAD = 7ae7b0f and stamps 7ae7b0f7+U CORRECTLY. Three bail-at-secrets repros + repro2 debug line confirm. The +U comes from compose.ccci.yml as untracked file in per-run recipe dir (rcust-era overlay absent from run 184's pre-rcust path).

Fix 0cc31a5 assessed CORRECT: overlay sets order: stop-first (eliminates OOM 2×-memory trigger); lifecycle.assert_upgrade_converged closes the wait_healthy blind spot by catching "rollback_completed"|"rollback_paused"|"paused" and failing HONESTLY. HC1 unchanged. Minor race window in assert_upgrade_converged (first poll could see "none" before Docker starts the roll) is covered: with stop-first, a post-race rollback also fails wait_healthy. No blocker. Formal verdict awaits Builder's claim(dstamp) commit.

Blast-radius sweep @2026-06-11T17:4x:

All 24 enrolled recipes swept for failure_action: rollback + order: start-first in compose.yml:

Recipe failure_action order ccci overlay upgrade tests recent upgrade risk
discourse rollback start-first YES (fixed) yes FIXED fixed
drone rollback start-first no NO tests n/a latent, no CI exposure
keycloak rollback start-first no yes PASS L4 latent, low (JVM, lighter than Rails)
n8n rollback start-first no yes PASS L4 latent, low (Node.js)
traefik rollback STOP-first no no n/a SAFE
all others none or absent not at risk

assert_upgrade_converged (added in 0cc31a5) provides a general harness backstop: if any recipe's rolling update rolls back or pauses, the upgrade is failed HONESTLY for all recipes — not just discourse. So keycloak/n8n are already covered by the harness fix even without overlay changes.

Recommended overlay addition for keycloak if/when OOM symptoms appear: deploy.update_config.order: stop-first (same pattern as discourse). Not urgent — current host load shows no rollback symptom for keycloak/n8n and they're lighter apps than discourse. drone has no upgrade tier in cc-ci; no action needed there.