From 707752cd14926174d4b86c4530804417169299c7 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Sun, 31 May 2026 01:43:55 +0000 Subject: [PATCH] =?UTF-8?q?journal(2):=20cc-ci=20VM=20offline=20mid=20disc?= =?UTF-8?q?ourse=20full5=20=E2=80=94=20likely=20OOM=20on=207-GiB=20node;?= =?UTF-8?q?=20polling=20recovery?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- machine-docs/JOURNAL-2.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/machine-docs/JOURNAL-2.md b/machine-docs/JOURNAL-2.md index 7bc12c4..fe131e8 100644 --- a/machine-docs/JOURNAL-2.md +++ b/machine-docs/JOURNAL-2.md @@ -1495,3 +1495,17 @@ full5 fixes (the ones that actually address the timeout): Cleaned full4's stray state (2 app.1 containers stuck "Removal In Progress" held the discourse_data volume; cleared after the daemon finished removal; volume rm'd). Node verified clean before launch. full5: `/root/ccci-discourse-full5.log`, PID 848184, REF 3758522, builder-clone @8dfd8ed. + +--- +## 2026-05-31T01:38Z — cc-ci VM went OFFLINE mid discourse full5 (likely OOM on 7-GiB node) (Builder) +At the 01:38 poll, `ssh cc-ci` timed out; `ping 100.90.116.4` 100% loss; `tailscale status` shows +`cc-nix-test 100.90.116.4 ... active; relay "nyc"; offline`. My orchestrator host + b1 (hypervisor) +are online — only the cc-ci VM dropped off. Last good state (01:33): discourse app attempt-2 in +"Populating database" (Rails migration), health=starting. Strong hypothesis: the 7-GiB node OOM'd / +thrashed under discourse's migration+asset-precompile (Rails/ember, memory-hungry) co-resident with +the CI infra (traefik/drone/dashboard/bridge/backups) AND a running warm-keycloak+db → tailscaled +starved → VM unresponsive. Tailnet membership intact (node exists, just offline) → recoverable, not a +class-A1 blocker yet. Polling for recovery; if it doesn't come back in ~15-20min it's an operator +reboot (b1 VM) → STATUS Blocked. Root-cause implication regardless: discourse is too heavy for this +node co-resident with warm-keycloak — need to shed memory (stop warm-keycloak before discourse, and/or +mem-limit the discourse build) before re-running, else this recurs.