journal(2): cc-ci VM offline mid discourse full5 — likely OOM on 7-GiB node; polling recovery
This commit is contained in:
@ -1495,3 +1495,17 @@ full5 fixes (the ones that actually address the timeout):
|
||||
Cleaned full4's stray state (2 app.1 containers stuck "Removal In Progress" held the discourse_data
|
||||
volume; cleared after the daemon finished removal; volume rm'd). Node verified clean before launch.
|
||||
full5: `/root/ccci-discourse-full5.log`, PID 848184, REF 3758522, builder-clone @8dfd8ed.
|
||||
|
||||
---
|
||||
## 2026-05-31T01:38Z — cc-ci VM went OFFLINE mid discourse full5 (likely OOM on 7-GiB node) (Builder)
|
||||
At the 01:38 poll, `ssh cc-ci` timed out; `ping 100.90.116.4` 100% loss; `tailscale status` shows
|
||||
`cc-nix-test 100.90.116.4 ... active; relay "nyc"; offline`. My orchestrator host + b1 (hypervisor)
|
||||
are online — only the cc-ci VM dropped off. Last good state (01:33): discourse app attempt-2 in
|
||||
"Populating database" (Rails migration), health=starting. Strong hypothesis: the 7-GiB node OOM'd /
|
||||
thrashed under discourse's migration+asset-precompile (Rails/ember, memory-hungry) co-resident with
|
||||
the CI infra (traefik/drone/dashboard/bridge/backups) AND a running warm-keycloak+db → tailscaled
|
||||
starved → VM unresponsive. Tailnet membership intact (node exists, just offline) → recoverable, not a
|
||||
class-A1 blocker yet. Polling for recovery; if it doesn't come back in ~15-20min it's an operator
|
||||
reboot (b1 VM) → STATUS Blocked. Root-cause implication regardless: discourse is too heavy for this
|
||||
node co-resident with warm-keycloak — need to shed memory (stop warm-keycloak before discourse, and/or
|
||||
mem-limit the discourse build) before re-running, else this recurs.
|
||||
|
||||
Reference in New Issue
Block a user