change(cleanup): reap dev deploys at start+end of /upgrade-all instead of a timer
Per operator: drop the hourly cc-ci-reap-dev-deploys systemd timer; instead run the dev-* reaper at the START (Step 0, alongside the orphan sweep) and END (new step 4b) of each /upgrade-all run, with THRESHOLD=0 (the run is quiescent then, so clear all dev-* unconditionally). The reaper keeps its safe default (4h) for ad-hoc use. Step-2b mandatory teardown is unchanged (primary mechanism); this is the backstop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@ -157,9 +157,9 @@ ssh cc-ci 'export PATH=/run/current-system/sw/bin:$PATH; set -a; . /srv/cc-ci/.t
|
|||||||
Then **verify nothing leaked**: `ssh cc-ci 'docker stack ls --format "{{.Name}}" | grep -c "^dev-<recipe with . as _>"'`
|
Then **verify nothing leaked**: `ssh cc-ci 'docker stack ls --format "{{.Name}}" | grep -c "^dev-<recipe with . as _>"'`
|
||||||
should print `0` (and no `dev-<recipe>_*` volumes remain). If the recipe failed, tear the dev deploy
|
should print `0` (and no `dev-<recipe>_*` volumes remain). If the recipe failed, tear the dev deploy
|
||||||
down anyway, THEN report the failure — never leave it running.
|
down anyway, THEN report the failure — never leave it running.
|
||||||
Backstops (defence-in-depth, NOT a substitute for the explicit teardown above): the `/upgrade-all`
|
Backstop (defence-in-depth, NOT a substitute for the explicit teardown above): `/upgrade-all` runs
|
||||||
orphan-sweep (Step 0) and the **hourly `cc-ci-reap-dev-deploys` timer** (reaps idle `dev-*` stacks),
|
the `dev-*` reaper (`reap-dev-deploys.sh`) + the orphan-sweep at the START and END of every run, so a
|
||||||
so a crashed/abandoned loop's deploy is bounded — but you must still clean up yourself.
|
crashed/abandoned loop's `dev-` deploy is cleared by the next run — but you must still clean up yourself.
|
||||||
- Caveats: shared swarm — keep to **ONE** `dev-<recipe>` instance at a time and tear it down before the
|
- Caveats: shared swarm — keep to **ONE** `dev-<recipe>` instance at a time and tear it down before the
|
||||||
next recipe; the `dev-<recipe>` domain is distinct from the harness's per-run domains and from the
|
next recipe; the `dev-<recipe>` domain is distinct from the harness's per-run domains and from the
|
||||||
`warm-*` canonicals, so the sweep removes a leaked one without touching live services.
|
`warm-*` canonicals, so the sweep removes a leaked one without touching live services.
|
||||||
|
|||||||
@ -39,6 +39,14 @@ It is idempotent (a no-op when the host is already clean) and prints what it rem
|
|||||||
surviving Swarm services (which should be infra + `warm-*` only — eyeball that before continuing). If
|
surviving Swarm services (which should be infra + `warm-*` only — eyeball that before continuing). If
|
||||||
anything legitimate looks at risk, stop and investigate rather than proceeding.
|
anything legitimate looks at risk, stop and investigate rather than proceeding.
|
||||||
|
|
||||||
|
Then **reap any leftover step-2b dev deploys** from a prior run (the `dev-*` stacks `/recipe-upgrade`
|
||||||
|
step 2b creates to debug an upgrade with live logs). The full sweep above already removes them, but run
|
||||||
|
the dedicated reaper too so the start/end cleanup is explicit and symmetric (`THRESHOLD=0` = clear ALL
|
||||||
|
`dev-*`, since the run is quiescent now):
|
||||||
|
```
|
||||||
|
ssh cc-ci 'THRESHOLD=0 bash -s' < /srv/cc-ci/.claude/skills/upgrade-all/reap-dev-deploys.sh
|
||||||
|
```
|
||||||
|
|
||||||
## 1. Build the candidate list
|
## 1. Build the candidate list
|
||||||
Enrolled recipes = the cc-ci `tests/<recipe>/` dirs (same set `ci-test-review` sweeps):
|
Enrolled recipes = the cc-ci `tests/<recipe>/` dirs (same set `ci-test-review` sweeps):
|
||||||
```
|
```
|
||||||
@ -132,6 +140,15 @@ with `/recipe-upgrade <recipe> --with-tests` to also get a verified test-update
|
|||||||
Parse each final `RESULT:` line into SUCCESS / SUCCESS-PENDING-TESTS / FAILED / SKIPPED (default mode
|
Parse each final `RESULT:` line into SUCCESS / SUCCESS-PENDING-TESTS / FAILED / SKIPPED (default mode
|
||||||
won't emit `SUCCESS+TESTPR`). A subagent that emitted no `RESULT:` line → `FAILED — no result emitted`.
|
won't emit `SUCCESS+TESTPR`). A subagent that emitted no `RESULT:` line → `FAILED — no result emitted`.
|
||||||
|
|
||||||
|
## 4b. Reap dev deploys (END of run)
|
||||||
|
Every `/recipe-upgrade` subagent is required to tear down its own step-2b `dev-<recipe>` deploy, but
|
||||||
|
once ALL recipes are done, reap any that leaked (a crashed/killed subagent) — the symmetric end-of-run
|
||||||
|
cleanup to Step 0. The run is quiescent here, so clear ALL `dev-*` unconditionally (`THRESHOLD=0`):
|
||||||
|
```
|
||||||
|
ssh cc-ci 'THRESHOLD=0 bash -s' < /srv/cc-ci/.claude/skills/upgrade-all/reap-dev-deploys.sh
|
||||||
|
```
|
||||||
|
(Scoped to `dev-*` only — never touches the recipe PRs, CI per-run stacks, `warm-*`, or infra.)
|
||||||
|
|
||||||
## 5. Write + print the summary
|
## 5. Write + print the summary
|
||||||
Write `/srv/cc-ci/.cc-ci-logs/upgrades/upgrade-all-<YYYY-MM-DD>.md` and print it, **leading with the
|
Write `/srv/cc-ci/.cc-ci-logs/upgrades/upgrade-all-<YYYY-MM-DD>.md` and print it, **leading with the
|
||||||
PR list** (the actionable output):
|
PR list** (the actionable output):
|
||||||
|
|||||||
@ -2,22 +2,23 @@
|
|||||||
# Reap LEAKED step-2b dev deploys on the cc-ci server.
|
# Reap LEAKED step-2b dev deploys on the cc-ci server.
|
||||||
#
|
#
|
||||||
# /recipe-upgrade step 2b deploys a recipe under a `dev-<recipe>` domain to debug an upgrade with live
|
# /recipe-upgrade step 2b deploys a recipe under a `dev-<recipe>` domain to debug an upgrade with live
|
||||||
# logs, and REQUIRES the agent to tear it down when done. This is the automated backstop for when that
|
# logs, and REQUIRES the agent to tear it down when done. This is the backstop for a missed teardown
|
||||||
# teardown is missed (agent crashed / killed / abandoned mid-loop): it removes `dev-*` Swarm stacks
|
# (agent crashed / killed / abandoned mid-loop): it removes `dev-*` Swarm stacks (+ their dangling
|
||||||
# (+ their now-dangling volumes) whose newest service has not been updated in THRESHOLD seconds.
|
# volumes). **Invoked at the START and END of an `/upgrade-all` run** (with `THRESHOLD=0` — by then the
|
||||||
|
# run is quiescent, so any `dev-*` is leftover and removed unconditionally).
|
||||||
#
|
#
|
||||||
# SAFE to run anytime — even while CI is mid-run — because it is scoped + age-gated:
|
# SAFE — scoped to the `dev-` naming convention only: CI per-run stacks (`<recipe[:4]>-<hash>`),
|
||||||
# - it touches ONLY the `dev-` naming convention used by step 2b. CI per-run stacks
|
# `warm-*` canonicals, and infra are never `dev-*`, so never matched. Volume cleanup uses
|
||||||
# (`<recipe[:4]>-<hash>`), `warm-*` canonicals, and infra are never `dev-*`, so never matched.
|
# `dangling=true`, so a still-attached volume is never removed.
|
||||||
# - an ACTIVE dev loop redeploys (refreshing the service UpdatedAt), so it stays "fresh" and is NOT
|
|
||||||
# reaped mid-use; only an idle/abandoned `dev-*` ages past THRESHOLD and is removed.
|
|
||||||
# - volume cleanup uses `dangling=true`, so an active deploy's attached volumes are never removed.
|
|
||||||
#
|
#
|
||||||
# Run ON the cc-ci host: ssh cc-ci 'THRESHOLD=14400 bash -s' < reap-dev-deploys.sh
|
# `THRESHOLD` (seconds, default 14400=4h) only removes a `dev-*` stack whose newest service has been
|
||||||
|
# idle longer than it — so an ad-hoc run while a dev loop is ACTIVE won't kill it (an active loop keeps
|
||||||
|
# redeploying, refreshing UpdatedAt). `/upgrade-all` passes `THRESHOLD=0` at run start/end to clear ALL
|
||||||
|
# leftover dev deploys. Run ON the cc-ci host: ssh cc-ci 'THRESHOLD=0 bash -s' < reap-dev-deploys.sh
|
||||||
set -uo pipefail
|
set -uo pipefail
|
||||||
export PATH=/run/current-system/sw/bin:$PATH
|
export PATH=/run/current-system/sw/bin:$PATH
|
||||||
|
|
||||||
THRESHOLD="${THRESHOLD:-14400}" # 4h — generous, so a long but ACTIVE dev loop is never reaped
|
THRESHOLD="${THRESHOLD:-14400}" # default 4h (safe for ad-hoc use); /upgrade-all passes 0 at start/end
|
||||||
now=$(date +%s)
|
now=$(date +%s)
|
||||||
reaped=0
|
reaped=0
|
||||||
|
|
||||||
@ -31,8 +32,8 @@ for s in "${STACKS[@]}"; do
|
|||||||
[ "$e" -gt "$newest" ] && newest="$e"
|
[ "$e" -gt "$newest" ] && newest="$e"
|
||||||
done
|
done
|
||||||
age=$(( now - newest ))
|
age=$(( now - newest ))
|
||||||
if [ "$newest" -gt 0 ] && [ "$age" -gt "$THRESHOLD" ]; then
|
if [ "$newest" -gt 0 ] && [ "$age" -ge "$THRESHOLD" ]; then
|
||||||
echo "reap: dev stack '$s' idle ${age}s (> ${THRESHOLD}s) — removing"
|
echo "reap: dev stack '$s' idle ${age}s (>= ${THRESHOLD}s) — removing"
|
||||||
docker stack rm "$s" >/dev/null 2>&1 || true
|
docker stack rm "$s" >/dev/null 2>&1 || true
|
||||||
reaped=$((reaped + 1))
|
reaped=$((reaped + 1))
|
||||||
else
|
else
|
||||||
|
|||||||
@ -220,37 +220,4 @@ SSHCFG
|
|||||||
Persistent = true; # if the box was down at the scheduled time, run once on next boot
|
Persistent = true; # if the box was down at the scheduled time, run once on next boot
|
||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
# Hourly reaper for LEAKED /recipe-upgrade step-2b dev deploys (`dev-*` stacks on the cc-ci server).
|
|
||||||
# The upgrader must tear down its own dev deploy; this is the automated backstop for a missed
|
|
||||||
# teardown (crashed/abandoned loop). reap-dev-deploys.sh is scoped + age-gated so it is safe to run
|
|
||||||
# even mid-CI: it only touches `dev-*`, and only when idle > THRESHOLD (an active dev loop keeps
|
|
||||||
# redeploying and is never reaped). cc-ci-plan/IDEAS.md tracks the eventual separate-infra fix; this
|
|
||||||
# just bounds the leak window in the meantime.
|
|
||||||
systemd.services.cc-ci-reap-dev-deploys = {
|
|
||||||
description = "Reap leaked step-2b dev deploys (dev-* stacks) on the cc-ci server";
|
|
||||||
after = [ "network-online.target" "tailscaled.service" ];
|
|
||||||
wants = [ "network-online.target" ];
|
|
||||||
serviceConfig = {
|
|
||||||
Type = "oneshot";
|
|
||||||
User = "loops"; Group = "users";
|
|
||||||
WorkingDirectory = "/srv/cc-ci";
|
|
||||||
};
|
|
||||||
environment = { HOME = "/home/loops"; };
|
|
||||||
path = [ pkgs.bash pkgs.openssh pkgs.coreutils ];
|
|
||||||
script = ''
|
|
||||||
ssh cc-ci 'THRESHOLD=14400 bash -s' \
|
|
||||||
< /srv/cc-ci/.claude/skills/upgrade-all/reap-dev-deploys.sh \
|
|
||||||
>> /srv/cc-ci/.cc-ci-logs/reap-dev-deploys.log 2>&1
|
|
||||||
'';
|
|
||||||
};
|
|
||||||
|
|
||||||
systemd.timers.cc-ci-reap-dev-deploys = {
|
|
||||||
description = "Hourly reaper for leaked step-2b dev deploys on cc-ci";
|
|
||||||
wantedBy = [ "timers.target" ];
|
|
||||||
timerConfig = {
|
|
||||||
OnCalendar = "hourly";
|
|
||||||
Persistent = true;
|
|
||||||
};
|
|
||||||
};
|
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user