Files
cc-ci/machine-docs/JOURNAL-prevb.md
2026-06-17 01:51:04 +00:00

159 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL — phase `prevb` (Builder reasoning; append-only)
## 2026-06-17 — Bootstrap + recon
Read SSOT (plan-phase-prevb), plan.md §6.1/§7/§9, Adversary's REVIEW-prevb (live, idle awaiting M1 claim).
**Mapped the harness upgrade flow** (`runner/run_recipe_ci.py`, `harness/lifecycle.py`,
`harness/generic.py`, `harness/meta.py`, `harness/canonical.py`):
- Base decision: `upgrade_base(stages, meta, recipe)``None` if upgrade∉stages or EXPECTED_NA[upgrade],
else `meta.UPGRADE_BASE_VERSION or lifecycle.previous_version(recipe)` (= `recipe_versions[-2]`).
`base = prev or target`; `prev` also gates whether the upgrade tier runs.
- Deploy: `deploy_app(version=base)` → pinned `recipe_checkout(version)` + (auto-chaos if overlay/lightweight tag);
`version=None` → chaos deploy of the current (head) checkout.
- Overlay `compose.ccci.yml`: copied into the checkout (`provide_ccci_overlay`), referenced by
`EXTRA_ENV.COMPOSE_FILE`, persists untracked across the head re-checkout → applies to ALL deploys.
- Upgrade op (`generic.perform_upgrade`): `recipe_checkout_ref(head_ref)` then chaos redeploy; the
ccci overlay persists → leaks version-specific pins onto the head. **That is the bug.**
- Last-green source: `canonical.read_registry(recipe)``{version, commit, status}` (promoted only on
GREEN LATEST cold runs for `WARM_CANONICAL` recipes). No separate "last-green" file.
**Ground-truth discourse facts** (gitea API, verified — see STATUS for the table). Key correction vs
plan §3 prose: main is `bitnamilegacy/discourse:3.5.0` (not 3.3.1 — main advanced). Thesis holds: base
(last-green/main = bitnamilegacy 3.5.0, deployable) → head (PR #4 = official discourse/discourse:3.5.3,
sidekiq dropped). So discourse needs NO `previous/`; the env overlay shrinks to `order: stop-first`.
**Design decisions (WHY):**
- *Resolution order* last-green → main-tip → skip. main-tip = the recipe's `main` branch HEAD = the true
predecessor the PR merges onto (more faithful than the old `vers[-2]`, which could span 2 version jumps).
This intentionally changes EVERY recipe's default base from `vers[-2]` to main-tip — plan-mandated, not a
regression; M2 spot-check validates representative recipes still go green.
- *Keep `UPGRADE_BASE_VERSION` as an optional explicit override* (still wins when set), but remove it from
discourse and make the DEFAULT dynamic. Rationale: fully deleting the meta field would break `plausible`
(its meta sets it) and the documented "PR adds a version above newest tag" escape hatch, without a deploy
test — risk vs guardrail "don't regress other recipes". The plan's "UPGRADE_BASE_VERSION removed" is in the
discourse-migration context; the normal/discourse path is now hardcode-free. Recorded in DECISIONS.
- *`previous/` scoped to last-green (published-version) base only* — version-guarded by a declared target;
on a main-tip base or version mismatch it is skipped + flagged stale. Discourse ships none (base deploys clean).
## 2026-06-17T00:30Z — M1 code done (unit+lint green); discourse e2e launched
Implemented B1B4 (commit bb2e3c6): resolve_upgrade_base/BasePlan, deploy_app base_ref+apply_previous,
previous/ surface in lifecycle, generic.perform_upgrade strip, discourse migration, unit tests.
Unit: 88 relevant pass (full suite 283 pass; 1 PRE-EXISTING unrelated fail
`test_warm_reconcile::test_traefik_spec_is_stateless_with_setup` KeyError 'health_domain' — fails on
clean HEAD, not mine; flagged for Adversary). Lint PASS.
B5 e2e launched on cc-ci (/root/prevb-deploy @ bb2e3c6), STAGES=install,upgrade, discourse PR#4
(REF=ae5a8180, SRC=recipe-maintainers/discourse). First log lines confirm the core mechanism:
`== upgrade base: kind=ref ref=f87c612d71b4 (target-branch (main) tip)` → base = main-tip chaos deploy
(bitnamilegacy:3.5.0), env overlay provided. Base now in slow Rails cold boot (15-25min). Polling ~5min.
(lint rung fail R011 = recipe-level, a rung not a gate; prepull skipped on the known sidekiq-depends-on
config rc=15 — non-fatal.)
## 2026-06-17T00:40Z — M1 GREEN locally; claiming
discourse install,upgrade e2e GREEN (2nd run, after the prune fix). Evidence in run-prevb-disc2.log on
cc-ci /root/prevb-deploy. The dynamic main-tip base worked first try (kind=ref f87c612d) — crucial,
because main (0.8.1+3.5.0) is AHEAD of the newest published tag (0.7.0+3.3.1), so the OLD vers[-2]
default (=0.6.3) would have been the wrong predecessor entirely. The upgrade moved
0.8.1+3.5.0 (bitnamilegacy, main-tip) → 1.0.0+3.5.3 (official, PR head), chaos-version=ae5a8180+U.
**The one real bug found+fixed (WHY):** first run, `test_head_runs_official_image` PASSED (head app =
official 3.5.3 — the leak is gone) but `test_sidekiq_service_dropped` FAILED: `docker stack deploy`
(what `abra app deploy` runs) only adds/updates services, it does NOT prune ones the new compose dropped,
so the base's sidekiq orphaned on the old image. This is a swarm mechanic, not a head-deploy failure, but
it means the deployed stack didn't faithfully reflect the head. Fix = `prune_orphan_services` in
perform_upgrade: reconcile the live stack to the head compose's `config --services` set (remove orphans).
Faithful (deployed stack == head), no-op when service sets match / compose unresolvable, weakens nothing.
Decided to CLAIM with the e2e green + image/sidekiq proof and leave the deliberately-broken-head teeth
probe to the Adversary's cold acceptance (its explicit M1 check; I can't push a broken commit to the
recipe mirror per guardrails). STATUS spells out where the teeth hold so they know where to probe.
## 2026-06-17T00:45Z — M2-prep spot-checks (3 green) while M1 under Adversary review
Ran 2 more recipes through the new dynamic base (de-risks the global resolver change; toward B8):
- **cryptpad #5** (install,upgrade): kind=ref main-tip 36ee3451; install+upgrade PASS incl
`test_upgrade_preserves_data` (data survived); deploy-count=1; clean teardown.
- **keycloak #3** (install,upgrade): base branch is **master** → kind=ref main-tip 12ac6db8 via the
origin/main→origin/master fallback in `recipe_branch_commit` (VALIDATES that path); install+upgrade
PASS incl `test_upgrade_preserves_realm`; SSO/DEPS path exercised; deploy-count=1; clean teardown.
Note: `prune-orphans` SAFE-SKIPPED ("head compose services unresolved — removes nothing") — keycloak's
`config --services` returned non-zero in that context; the defensive guard correctly removed nothing
(service set unchanged base→head anyway). Confirms prune never false-fails when compose is unresolvable.
So 3/3 current recipes resolve to main-tip (kind=ref) and pass — no warm canonicals exist on the host
(`find /var/lib/ci-warm -name canonical.json` empty), so last-green (kind=version) isn't exercised in e2e
yet (it IS unit-tested). For M2 I may seed/use a warm canonical to e2e the last-green path. Pre-existing
orphan `warm-keycloak_...` stack on the host (no registry record) — NOT from prevb; left untouched.
Stopping new e2e launches now — the Adversary is running its own discourse cold-acceptance on the shared
7GB node; piling on risks a memory-pressure false-failure in its run. Parking at M1 gate.
## 2026-06-17T01:05Z — M1 PASS; starting M2
Adversary M1 PASS (dbc7a3b), all 8 DoD cold-verified incl. teeth: break-it probe with head image
`discourse/discourse:99.99.99-adversary-broken``manifest unknown` at prepull → upgrade:fail (level 1/5),
base still resolved to main-tip — proves base/prune/previous can't paper over a broken head. No VETO.
Note for record: the Adversary attributed the lingering `warm-keycloak_...` stack to "Builder's concurrent
spot-check". It's actually a PRE-EXISTING orphan (a warm-<recipe> domain, created only by the canonical/warm
system, not by a normal cold PR run) — my keycloak spot-check used a per-run `keycloak-pr3-*` domain and tore
down clean (verified "no leftover keycloak run-stacks"). Not a prevb leak; pre-existing cruft.
M2 plan: B7 = discourse PR#4 !testme GREEN in real CI (Drone). Infra confirmed healthy: ccci-bridge_app 1/1
(polls POLL_REPOS incl. discourse every 30s), drone_...app 1/1, Drone healthz 200; Drone builds cc-ci@main
(= my prevb code). Before posting !testme publicly on PR#4, running the FULL pipeline locally first
(STAGES=install,upgrade,backup,restore,custom) to de-risk backup/restore/custom under the new model (my
local runs so far were install,upgrade only). If a non-prevb tier fails I fix/triage first, then !testme.
## 2026-06-17T01:30Z — All 5 discourse tiers green locally; posting !testme (B7)
Full local run (run-prevb-disc-full) found ONE failure: custom `test_create_topic_roundtrip``mint_admin`
hardcoded the bitnamilegacy path `/opt/bitnami/discourse` (404 on the official head). This is a DIRECT
consequence of prevb working (the head is now genuinely official, not overlay-reverted to bitnamilegacy).
Fixed `_discourse.py::mint_admin` image-agnostic (b66abc4): detect /var/www/discourse (official) vs
/opt/bitnami/discourse (legacy); on official re-export DISCOURSE_DB_PASSWORD from /run/secrets/db_password
(entrypoint exports it only for boot) and run bin/rails as root (official image USER is empty → exec=root;
verified it works). Re-run (install,upgrade,custom) → custom PASS (all 3 custom tests green).
Tier status (across run-prevb-disc-full + run-prevb-disc-custom): install✓ upgrade✓ backup✓ restore✓ custom✓.
So the real-CI !testme full pipeline should be green. Posting !testme on discourse PR#4 as autonomic-bot
(authorized org member) → bridge (polls every 30s) triggers a Drone build of cc-ci@main (= prevb code).
## 2026-06-17T01:33Z — B7 DONE: discourse PR#4 !testme GREEN in real CI (Drone 717)
Posted !testme as autonomic-bot (comment 14597); bridge replied in ~16s (build 717), bridge final
comment "✅ passed" @01:32:55Z. Run 717 junit (cold-readable at /var/lib/cc-ci-runs/717/junit/): ALL
10 suites failures=0 errors=0 — install / upgrade(generic+cc-ci) / backup(generic+cc-ci) /
restore(generic+cc-ci) / custom(create_topic+health_check+site_basic). upgrade__cc-ci proves
test_head_runs_official_image_not_bitnamilegacy + test_sidekiq_service_dropped_by_head PASS. Clean
teardown (no discourse stacks). This is the M2 headline: the migration is REALLY tested in real CI.
Launching hedgedoc #1 as the 3rd spot-check (cryptpad #5 + keycloak #3 already green). Then reconcile + claim M2.
## 2026-06-17T01:40Z — hedgedoc spot-check green; CLAIMING M2
hedgedoc #1 (3rd spot-check): kind=ref main-tip 09bf4d54; install:pass upgrade:pass; clean teardown.
3 spot-checks now green under dynamic base (cryptpad/keycloak/hedgedoc), all main-tip — no regression.
discourse 717 results.json level=4/5. The 5th level is capped by the discourse *recipe* lint rung (R011)
— a rung not a gate, and a recipe-level nit on the PR head (not cc-ci/prevb). The run is GREEN (bridge
"✅ passed", all 5 functional tiers junit 0-fail). Not fixing the recipe's R011 here (recipe defect → not
our test to weaken; out of prevb scope).
Records reconciliation: 717's artifacts (results.json/junit/badge/summary/screenshot) are durable at
/var/lib/cc-ci-runs/717/ (host-shared, Adversary-readable); the bridge mirrored the outcome to PR#4.
No warm canonicals to reconcile (none exist). Pre-existing warm-keycloak orphan left untouched (not prevb).
Claiming M2. Adversary cold-verifies (re-read 717 junit / re-trigger !testme / re-run a spot-check); then
I write ## DONE once REVIEW-prevb shows fresh M1+M2 PASS with no VETO.
## 2026-06-17T01:58Z — M2 PASS → ## DONE
Adversary M2 PASS (1c3ba71): all 6 M2 DoD items cold-verified incl. its own independent cryptpad#5 re-run;
discourse 717 real-CI GREEN with live-swarm-image teeth (official 3.5.3, sidekiq gone); lint R011
code-verified non-gating; public surface secret-clean; nothing merged. Both M1(01:03Z)+M2(01:58Z) fresh
PASS, no VETO. DONE handshake satisfied → wrote ## DONE to STATUS-prevb. Phase prevb complete. Stopping loop.