chore(redfix): bootstrap phase state files (STATUS/BACKLOG/JOURNAL); M1 investigation tracker seeded
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
This commit is contained in:
29
machine-docs/BACKLOG-redfix.md
Normal file
29
machine-docs/BACKLOG-redfix.md
Normal file
@ -0,0 +1,29 @@
|
||||
# BACKLOG — phase `redfix`
|
||||
|
||||
## Build backlog
|
||||
|
||||
### M1 — investigate + isolate + classify (all six)
|
||||
- [ ] discourse — reproduce cold-deploy timeout/wedge in isolation; root-cause (headroom vs
|
||||
convergence bug vs upstream compose defect `sidekiq.depends_on: discourse`); classify.
|
||||
- [ ] mattermost-lts — `test_restore.py::test_restore_returns_state` in isolation: green→load flake,
|
||||
red→diagnose restore (recipe vs test).
|
||||
- [ ] mumble — `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` in
|
||||
isolation (canonical already present from today → likely flake; confirm).
|
||||
- [ ] bluesky-pds — warm-canonical promote routing: why `warm-bluesky-pds…` → 000 over HTTPS while
|
||||
container healthy internally + cold-test domain routes. Find cc-ci warm-machinery defect.
|
||||
- [ ] gitea — `3.5.3→3.6.0` warm advance crash (`app.ini` read-only, JWT save). Recipe vs harness.
|
||||
- [ ] keycloak — de-enrolled (live-warm OIDC collision). Design collision-free warm domain/namespace.
|
||||
|
||||
### M2 — FIX + verify all six (recipe PR or harness improvement)
|
||||
- [ ] discourse — fix (timeout/serialization tuning and/or recipe convergence/upstream compose PR);
|
||||
converges in time.
|
||||
- [ ] mattermost-lts — fix (stabilize if flake; diagnose+fix restore if real); green.
|
||||
- [ ] mumble — fix (stabilize if flake; fix recipe/test if real); green.
|
||||
- [ ] bluesky-pds — fix warm-routing (cc-ci branch PR); promotes.
|
||||
- [ ] gitea — fix so it promotes 3.6.0 (recipe PR making app.ini writable OR harness clean-redeploy
|
||||
fallback); promotes 3.6.0.
|
||||
- [ ] keycloak — harness improvement: collision-free warm domain/namespace; enroll + promote.
|
||||
|
||||
## Adversary findings
|
||||
|
||||
(Adversary-owned — do not edit.)
|
||||
25
machine-docs/JOURNAL-redfix.md
Normal file
25
machine-docs/JOURNAL-redfix.md
Normal file
@ -0,0 +1,25 @@
|
||||
# JOURNAL — phase `redfix`
|
||||
|
||||
## 2026-06-17T23:20Z — Bootstrap
|
||||
|
||||
Read phase plan + plan.md §6.1/§7/§9 + canon DECISIONS exceptions (lines ~1494–1552). Six
|
||||
canon-sweep failures to investigate. Confirmed cc-ci access, no run in flight, sweep timer next
|
||||
fires 2026-06-21 (3-day window), disk 38G free.
|
||||
|
||||
Isolation mechanism understood: `runner/nightly_sweep.run_on_tag` = `abra.recipe_checkout(r, tag)` +
|
||||
`run_recipe_ci.py RECIPE=<r> CCCI_SKIP_FETCH=1` cold/full. I reproduce each failure by running ONE
|
||||
recipe at a time with no concurrent load.
|
||||
|
||||
Starting canonical state notable: **mumble canonical IS present** (`1.0.0+v1.6.870-0`, written
|
||||
20260617T180501Z — during today's nixenv sweep). The canon DECISIONS recorded mumble RED
|
||||
(`test_handshake_completes_with_channel_presence`). A canonical only gets written on a GREEN cold run
|
||||
on latest → mumble flipped green in a recent run. Strong early evidence for the operator's "mumble
|
||||
passed before" → load flake hypothesis. Must confirm with a clean isolation re-run + check whether the
|
||||
canon-sweep red was under concurrent load.
|
||||
|
||||
Next: start M1 investigation. Plan order (cheap/informative first): triage the existing sweep logs on
|
||||
cc-ci to pin the EXACT assertion/error for each (mumble, mattermost-lts restore, gitea app.ini,
|
||||
bluesky routing, discourse compose), then run isolation re-runs. discourse's recorded cause is an
|
||||
UPSTREAM compose defect (`sidekiq.depends_on: discourse` while service is `app`) that FATAs before any
|
||||
deploy — that's deterministic, not a load timeout, so it may not even need a long isolation run to
|
||||
confirm; verify the compose at the latest tag directly first.
|
||||
49
machine-docs/STATUS-redfix.md
Normal file
49
machine-docs/STATUS-redfix.md
Normal file
@ -0,0 +1,49 @@
|
||||
# STATUS — phase `redfix`
|
||||
|
||||
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
|
||||
|
||||
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
|
||||
gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs
|
||||
warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing
|
||||
exceptions. Nothing merged.
|
||||
|
||||
## Phase: M1 — investigate + isolate + classify (IN PROGRESS)
|
||||
|
||||
Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21
|
||||
(3-day clear window). Disk `/` 38G free (75% used).
|
||||
|
||||
### Isolation harness (how I reproduce each failure ALONE)
|
||||
|
||||
Each canon-sweep per-recipe run is `runner/nightly_sweep.run_on_tag(recipe, latest)`:
|
||||
`abra.recipe_checkout(recipe, <latest-tag>)` then `run_recipe_ci.py` with `RECIPE=<r>
|
||||
CCCI_SKIP_FETCH=1` and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE
|
||||
recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known
|
||||
flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`.
|
||||
|
||||
### Starting canonical state (cc-ci `/var/lib/ci-warm/<r>/canonical.json`, read 2026-06-17T23:19Z)
|
||||
|
||||
| Recipe | Canonical now | Note |
|
||||
|---|---|---|
|
||||
| discourse | (none) | no canonical dir |
|
||||
| mattermost-lts | (none) | no canonical dir |
|
||||
| mumble | `1.0.0+v1.6.870-0` @ 20260617T180501Z | **canonical PRESENT, written TODAY** — flake signal |
|
||||
| bluesky-pds | (none) | no canonical dir |
|
||||
| gitea | `3.5.3+1.24.2-rootless` @ 20260617T083930Z | 3.6.0 advance not promoted |
|
||||
| keycloak | (none) | de-enrolled (WARM_CANONICAL off) |
|
||||
|
||||
### M1 investigation tracker
|
||||
|
||||
| Recipe | Isolation run | Result | Root cause | Classification |
|
||||
|---|---|---|---|---|
|
||||
| discourse | pending | — | — | — |
|
||||
| mattermost-lts | pending | — | — | — |
|
||||
| mumble | pending | — | — | — |
|
||||
| bluesky-pds | pending | — | — | — |
|
||||
| gitea | pending | — | — | — |
|
||||
| keycloak | pending | — | — | — |
|
||||
|
||||
Gate: M1 not yet claimed.
|
||||
|
||||
## Blocked
|
||||
|
||||
(none)
|
||||
Reference in New Issue
Block a user