Files
cc-ci/machine-docs/STATUS-redfix.md
2026-06-18 00:08:40 +00:00

4.7 KiB
Raw Blame History

STATUS — phase redfix

Phase SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md

Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing exceptions. Nothing merged.

Phase: M1 — investigate + isolate + classify (IN PROGRESS)

Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21 (3-day clear window). Disk / 38G free (75% used).

Isolation harness (how I reproduce each failure ALONE)

Each canon-sweep per-recipe run is runner/nightly_sweep.run_on_tag(recipe, latest): abra.recipe_checkout(recipe, <latest-tag>) then run_recipe_ci.py with RECIPE=<r> CCCI_SKIP_FETCH=1 and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known flake source per phase plan §2.1). Runs execute on cc-ci from /etc/cc-ci.

Starting canonical state (cc-ci /var/lib/ci-warm/<r>/canonical.json, read 2026-06-17T23:19Z)

Recipe Canonical now Note
discourse (none) no canonical dir
mattermost-lts (none) no canonical dir
mumble 1.0.0+v1.6.870-0 @ 20260617T180501Z canonical PRESENT, written TODAY — flake signal
bluesky-pds (none) no canonical dir
gitea 3.5.3+1.24.2-rootless @ 20260617T083930Z 3.6.0 advance not promoted
keycloak (none) de-enrolled (WARM_CANONICAL off)

M1 investigation tracker

Recipe Isolation run Result Root cause Classification
discourse DONE @23:40Z (/tmp/redfix-discourse.log on cc-ci) install/backup/restore/custom PASS; upgrade overlay FAIL. Deploys+serves fine — NOT a timeout/FATA. cc-ci overlay tests/discourse/test_upgrade.py asserts head runs official discourse/discourse:3.5.3 + drops sidekiq; latest tag 0.8.1+3.5.0 AND main both still bitnamilegacy/discourse:3.5.0+sidekiq (migration exists in no release/main). The depends_on discourse string is a non-fatal prepull-only warning, not the deploy. stale/PR-specific cc-ci OVERLAY test mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery)
mattermost-lts DONE @00:05Z (/tmp/redfix-mattermost-lts.log) install/upgrade/backup/custom PASS; restore FAIL ci_marker does not existdeterministic in isolation (not a load race) recipe postgres svc backup labels: backs up hot live PGDATA + dump but has NO backupbot.restore.post-hook to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only backup.volumes.postgres.path: backup.sql + restore.post-hook: /pg_backup.sh restore. genuine RECIPE defect at latest → recipe PR (adopt immich-style dump+restore-post-hook)
mumble DONE @00:18Z (/tmp/redfix-mumble.log) ALL tiers PASS incl. handshake; no orphans. Canon red under load; canonical written green TODAY handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation load/timing FLAKE → harness stabilization (readiness gate / retry). (1 isolation green; will repeat 1-2× before M1 claim)
bluesky-pds DONE @00:45Z (/tmp/redfix-bluesky-pds.log + live diag) cold lifecycle GREEN; WC5 promote 000 reproduces (warm /xrpc/_health last status 0). NOT a flake caddy on-demand TLS (ask http://app:3000/tls-check) can't reach app: caddy resolves bare app to OTHER stacks' app endpoints on shared proxy net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci app aliases) → no cert → 000. Promote machinery correct (refused to write canonical). genuine routing/RECIPE defect (cross-stack app-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake
gitea running (isolation; warm 3.6.0 advance)
keycloak DONE @01:05Z (code-verified; no run) de-enrolled. canonical_domain("keycloak") == WARM_DOMAINS["keycloak"] == warm-keycloak.ci.commoninternet.net EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. data-warm canonical domain uses same warm-<r> scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. HARNESS defect (warm-domain namespace collision) → fix: collision-free canonical_domain for live-warm providers (warm-canon-<r>), then enroll keycloak

Gate: M1 not yet claimed.

Blocked

(none)