Commit Graph

97 Commits

Author SHA1 Message Date
cd5e645427 fix(opencode): use inference.tinfoil.sh + attach TUI + NO_COLOR
Three fixes discovered during first live run:
- inference host is inference.tinfoil.sh not api.tinfoil.sh (control plane
  only serves /v1/models, not /v1/chat/completions)
- opencode run exits after one turn; switch to opencode attach for the
  persistent TUI, then ping_session sends the kickoff prompt
- NO_COLOR=1 suppresses the first-run interactive theme picker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:56:06 +00:00
bca51071bd refactor: rewrite launchers as Python; add orchestrator JOURNAL.md
Bash scripts are now one-liner wrappers: exec python3 <script>.py "$@"
All logic lives in the Python scripts (pure stdlib, no deps).

launch.py — loops + watchdog:
  Full port of launch.sh: phase sequencing, start/stop/status/logs/watchdog,
  handoff signalling, stall detection, heal_session, heal_orchestrator.
  Cleaner structure: config block → helpers → phase/kickoff/agent/healing/
  handoff/watchdog/main. LOOP_BACKEND + LOOP_MODEL switches throughout.

launch-orchestrator.py — orchestrator session:
  claude path: --resume <id> preserved (conversation survives reboots).
  opencode path: run --attach --title (no --resume; STARTUP_PROMPT orients
  the new session; reads JOURNAL.md for context).
  STARTUP_PROMPT updated to reference JOURNAL.md on startup.

launch-upgrader.py — one-shot upgrade job:
  LOOP_BACKEND / LOOP_MODEL take precedence over UPGRADER_BACKEND / UPGRADER_MODEL.
  Both claude and opencode paths supported.

cc-ci-plan/JOURNAL.md — new orchestrator handoff file:
  Persistent across conversation resets. Documents the handoff format and
  carries the current session's summary: migration complete, phase 5 in
  progress (V3/V7 PASS), phase 4 deferred, open items for next session.

AGENTS.md: step 1 on startup = read JOURNAL.md; step 5 = append on handoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:50:09 +00:00
e0e5bf6e64 feat: opencode web at oc.commoninternet.net (one server, named sessions)
configuration.nix:
- systemd.services.opencode-web: one shared opencode server on 127.0.0.1:4096,
  EnvironmentFile=/srv/cc-ci/.testenv (TINFOIL_API_KEY), ExecStartPre clears
  stale /tmp/opencode so restarts never fail on the EEXIST race.
- services.nginx: reverse-proxy oc.commoninternet.net → localhost:4096,
  bound to tailscale IP 100.84.190.30 (tailnet-only, plain HTTP).
  DNS: A record oc.commoninternet.net → 100.84.190.30 (operator step).

launch.sh + launch-upgrader.sh:
- Drop per-session ports / OPENCODE_HOST; add OPENCODE_SERVER=http://127.0.0.1:4096.
- opencode backend: agents use `opencode run --attach $OPENCODE_SERVER --title $session`
  so each shows up as a named session in the web UI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:37:03 +00:00
a87d42f491 feat: opencode/tinfoil backend support in all launchers
Adds LOOP_BACKEND=opencode|claude (+ LOOP_MODEL) to launch.sh and
launch-upgrader.sh, enabling the loops/upgrader to run via opencode CLI
against the tinfoil.sh API (deepseek-v4-pro etc.) instead of Claude.

launch.sh:
- LOOP_BACKEND (claude|opencode), LOOP_MODEL env vars
- OPENCODE_BIN, OPENCODE_HOST (tailscale IP), OPENCODE_PORT (per-session)
- start_agent: backend switch — claude path unchanged; opencode starts
  `opencode --hostname <ts-ip> --port <N> run <kickoff>` so the web UI
  is bound to the tailscale interface (tailnet-only observability)
- preflight: validates the right binary per backend
- heal_session / heal_orchestrator: extend active-work detection to
  opencode spinner chars + "Running tool"
- help: shows both backend configs

launch-upgrader.sh:
- UPGRADER_BACKEND / UPGRADER_MODEL (LOOP_BACKEND/LOOP_MODEL override)
- start: same backend switch as launch.sh
- OPENCODE_PORT=4098 (separate from loops 4096/4097)

configuration.nix: note opencode binary location + re-install command.

Tinfoil config: ~/.config/opencode/opencode.jsonc — provider "tinfoil"
with baseURL=https://api.tinfoil.sh/v1, apiKey=env:TINFOIL_API_KEY
(key + TINFOIL_MODEL + TINFOIL_BASE_URL stored in .testenv).
opencode v1.15.13 installed at /home/loops/.local/bin/opencode.

Usage:
  LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro \
    RESUME_PHASE=1 cc-ci-plan/launch.sh start

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:21:13 +00:00
6910b197d0 fix(testme-on-pr): read cc-ci/testme context URL not first-status URL
When multiple commit statuses exist (e.g. an Adversary probe + the real run),
the first status in the array may not be the cc-ci run. Filter by context
'cc-ci/testme' to get the correct Drone build URL.
2026-05-31 14:00:02 +00:00
0df57c6d0c fix(open-recipe-pr): replace python3 with jq (cc-ci has jq, not python3) 2026-05-31 13:35:07 +00:00
25fd7407fd launch-upgrader: default model to sonnet (UPGRADER_MODEL)
Adds UPGRADER_MODEL env var (default: sonnet) passed as --model to the
claude invocation. The cron runs the upgrader on Sonnet so it doesn't
consume Opus weekly credits. Override with UPGRADER_MODEL=opus if needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 13:24:53 +00:00
21e7a79f50 orchestrator-hetzner: enable reboot-resilience + record migration
Now the workspace is staged on the Hetzner cpx22 (server 134487234, public
91.98.47.73, tailnet cc-ci-orchestrator-1 @ 100.84.190.30):

- configuration.nix: enable cc-ci-loops.service (wantedBy multi-user.target) so the
  loops + watchdog auto-resume on boot; wire reboot-log.sh as ExecStartPre so reboots
  auto-log to REBOOTS.md (boot_id-gated).
- plan-orchestrator-hetzner-migration.md: full migration record.
- REBOOTS.md / AGENTS.md: point the orchestrator host at Hetzner; first auto-logged
  reboot line.
- launch-orchestrator.sh: default session id -> the Hetzner orchestrator session.
- flake.lock: pin inputs.

Verified: nixos-rebuild switch applied; systemctl is-enabled cc-ci-loops.service =
enabled; ExecStartPre logged this boot to REBOOTS.md; loops healthy on phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 03:54:17 +00:00
e89f384c24 nix: remove --ssh flag from tailscale (use normal key auth, not tailscale ACL)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 03:02:04 +00:00
73b65af6d6 nix: add all 3 root SSH keys from current orchestrator VM
Includes the operator key (mfowler), the claude-vm key, and the cc-ci-sandbox key.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:40:51 +00:00
497bea8462 nix: add root SSH authorized key to cc-ci-orchestrator-hetzner config
nixos-rebuild removed the infect-provisioned authorized_keys — declare it
explicitly so rebuilds don't lock out root access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:35:37 +00:00
c44b967019 nix: add real cpx22 hardware config from nixos-infect (server 134487234)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:30:29 +00:00
17951b899e terraform: fix server_type to cpx22 (cpx11/cpx21 retired in nbg1); add lock file
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:17:26 +00:00
0103f369ad terraform+nix: Hetzner orchestrator server (cpx11, nixos-infect, cc-ci-orchestrator-hetzner flake host)
Adds terraform/ to provision a Hetzner cpx11 (2 vCPU / 2 GB dedicated AMD / 40 GB NVMe)
for the loops runtime, and a flake + NixOS host config to converge it — replacing the slow
b1 Incus VM. Mirrors the cc-ci server terraform (same nixos-infect pin, same pattern).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:11:30 +00:00
4c418765c8 plan: full migrate-cc-ci-to-hetzner (provision cpx32 → benchmark 2 recipes → cutover loops+pipeline+DNS → retire Incus VM); age key is on the VM so no secret-blocker; harden .gitignore for the age key 2026-05-31 02:04:02 +00:00
b25330d3e8 gitignore: ignore .sops/ + age-key files (lost in the repo consolidation; needed before staging the master age key) 2026-05-31 01:22:29 +00:00
102427ab5b plan: full migrate-to-Hetzner (provision → cut over loops → stop old b1 VM); server type cpx31→cpx32
- plan-cc-ci-hetzner-migration.md: 3-phase plan — (1) provision the Hetzner cpx32 cc-ci fully + green
  !testme readiness gate, (2) repoint the loops + dashboard + *.ci at it (one ssh-config + DNS change),
  (3) stop the b1 cc-nix-test (cold standby). Parallel bring-up, reversible cutover, b1 freed.
- plan-cc-ci-hetzner-terraform.md: cpx31 is retired → default to cpx32 (current dedicated-vCPU 8GB).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 01:15:29 +00:00
b98e527656 plan: switch cc-ci cloud terraform from DigitalOcean to Hetzner (cx32 8GB, hcloud provider, nixos-infect + D8 flake flow) 2026-05-31 00:25:05 +00:00
67226efe72 plan: cc-ci on DigitalOcean — terraform/ + nixos-infect + nix provisioning (8GB droplet, reproducible from the cc-ci flake) 2026-05-31 00:18:27 +00:00
01874821f2 decommission Pi: update all docs for VM-only setup
The orchestrator Pi is retired (2026-05-31). All agents now run on the
cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a
direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled,
no ProxyCommand. Updated across all affected files:

AGENTS.md
  - Remove Pi from reboot description; migration complete (not "parked")
  - cc-ci access: direct ssh, not via proxy

kickoff.md
  - Prerequisites: direct tailnet peer, not proxy
  - Host deps: NixOS (not apt)
  - Fallback/Incus: b1 reachable directly, no --proxy curl flag

plan.md §1 + §1.5
  - §1 bootstrap: direct SSH, check tailscale status (not restart proxy)
  - §1.5 intro: "VM" not "sandbox host"; no proxy
  - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row
  - Replace "Tailscale connection (proxy)" subsection with direct-peer description

plan-orchestrator-migration.md
  - Mark COMPLETE (2026-05-31); historical record only

plan-phase1c-full-reproducibility.md
  - Incus access: direct, not via SOCKS proxy

prompts/builder.md + prompts/adversary.md
  - cc-ci access language only: direct ssh, no proxy restart instructions
  - adversary: *.ci.commoninternet.net via plain curl, no proxy flag

REBOOTS.md
  - Retitle for VM; note Pi retired; Pi entries marked historical

systemd/cc-ci-loops.service
  - User/Group/HOME/PATH: notplants → loops
  - Remove cc-ci-tailscaled.service dependency (no proxy on VM)
  - Add note about nix/configuration.nix as the authoritative VM declaration

test-e2e-testme-acceptance.md
  - tailscale status: no --socket flag
  - ssh to throwaway: no ProxyCommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 00:16:37 +00:00
db375bcc07 rename to cc-ci-orchestrator: update all repo name references
Gitea repos renamed:
  cc-ci-autonomous-orchestrator → cc-ci-orchestrator
  cc-ci-orchestrator            → archived-cc-ci-orchestrator

Updated in this workspace:
- README.md, AGENTS.md: repo title
- cc-ci-plan/plan-orchestrator-migration.md: cc-ci-autonomous-orchestrator refs
- cc-ci-plan/plan-repo-consolidation.md: marked complete + Pi remote-update notice
- cc-ci-plan/launch-orchestrator.sh, launch.sh: session naming comment cleanup

NOTE: Pi clone still has the old origin URL. On the Pi, run:
  git remote set-url origin https://git.autonomic.zone/recipe-maintainers/cc-ci-orchestrator.git

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 00:03:11 +00:00
8ef141f538 absorb cc-ci-orchestrator NixOS config into nix/
Merging recipe-maintainers/cc-ci-orchestrator (the VM NixOS config repo)
into this repo as nix/ — the next step toward consolidating the two
orchestrator repos into a single cc-ci-orchestrator.

The source repo will be renamed to archived-cc-ci-orchestrator on Gitea.
This repo will be renamed cc-ci-orchestrator.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 00:01:14 +00:00
cd8ac99ed9 replace references/recipe-maintainer symlink with git submodule
The symlink pointed to /srv/recipe-maintainer which does not exist on
the VM, leaving the parity corpus empty and blocking Phase-2 porting
of not-yet-migrated recipes.

Submodule URL: https://git.autonomic.zone/notplants/recipe-maintainer
(notplants/recipe-maintainer on git.autonomic.zone)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 23:56:30 +00:00
2ef90a4237 launch-assistant.sh: run the assistant on sonnet (ASSISTANT_MODEL, default sonnet) 2026-05-30 23:54:25 +00:00
2233c6182a add launch-assistant.sh: cc-ci-assistant — remote-control, non-loop helper
A general-purpose Claude session sharing the orchestrator's workspace + access,
under remote-control (cc-ci-assistant), NOT on a loop. Sits idle until the
orchestrator/operator hands it a plan/task, does it, reports, waits. Modelled on
launch-orchestrator.sh: persistent pinned session-id (resume across relaunch),
root-aware --dangerously-skip-permissions handling, start/fresh/status/attach/stop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 23:52:08 +00:00
b550d6c432 plan stub: repo consolidation (merge 2 orchestrator repos) + references/recipe-maintainer as a submodule — deferred until credits (operator 2026-05-30) 2026-05-30 23:47:07 +00:00
fffd83fe4b launch.sh: use CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS env var when running as root (VM uses root; --dangerously-skip-permissions flag blocked by claude for root) 2026-05-30 19:36:35 +01:00
fd08a977d0 overlay policy: standardize the ccci overlay filename to compose.ccci.yml
Operator: use a single uniform filename `compose.ccci.yml` per recipe (one file
holding all cc-ci-side deploy tweaks) rather than per-purpose suffixes like
compose.ccci-health.yml. Updated §9 + plan-ccci-compose-overlay-policy.md; added
a DoD item to rename tests/{ghost,discourse}/compose.ccci-health.yml ->
compose.ccci.yml and update their install_steps.sh cp target + recipe_meta
COMPOSE_FILE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:25:48 +01:00
5f34c0ad01 overlay policy (content): §9 guardrail rewrite + plan-ccci-compose-overlay-policy.md
The prior commit only captured the file deletion (git add aborted on the
already-removed pathspec). This adds the actual content: the reworked §9
guardrail (justified ccci overlays OK; abra can't env start_period; always test
upgrade-to-latest, from-version custom tests skippable) and the new policy doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:19:18 +01:00
6cb5580390 overlay policy: ccci compose overlays OK when justified (abra can't env start_period); keep upgrade-to-latest
Operator correction (builder was right): abra does NOT support an env value for
healthcheck start_period, so the earlier "parameterize via APP_START_PERIOD env
PR" approach is impossible — a ccci compose overlay is the right tool there.

- plan.md §9: replace the "don't fork compose / use env PR" guardrail with
  "avoid where possible + justify each + prefer upstream PR, BUT a uniform
  optional compose.ccci-*.yml overlay is an acceptable fallback" (esp. for
  abra-unparameterizable values like start_period). Add the upgrade-tier rule:
  ALWAYS test the upgrade to latest; a from-version's custom tests may be
  skipped if it can't fully run, but never drop upgrade-to-latest.
- replace plan-prefer-env-over-compose-overlay.md with
  plan-ccci-compose-overlay-policy.md: ghost/discourse start_period overlays
  STAY (justified); discourse image re-pin STAYS (keeps the upgrade-to-latest
  testable; 0.7.0 custom tests may be skipped); mumble old-base host-ports copy
  DROPPED (skip 0.2.0 voice tests, still upgrade to latest + test there). Each
  surviving overlay must be minimal + header-justified + Adversary-confirmed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:18:53 +01:00
e7f05ceffe orchestrator-migration: Phase A COMPLETE (cc-ci-orchestrator VM up + ssh) + reboot #3 log
Phase A done before the Pi's reboot #3 (commit was interrupted): the loops VM
cc-ci-orchestrator is on the tailnet (100.116.55.106) and ssh-able; TS-key
finding recorded (VM-creator .test.env key revoked; cc-ci .testenv key valid +
persisted). REBOOTS.md carries the auto-logged 2026-05-30 17:03 reboot
(cc-ci-loops.service auto-recovered the loops at phase 2; swapfile persisted).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:07:02 +01:00
742a08b677 orchestrator-migration: Phase A started — cc-ci-orchestrator VM created (2GB/2vCPU)
Operator go-ahead (Pi is OOM-thrashing/slow). Created the dedicated loops VM
cc-ci-orchestrator (2GB RAM / 2 vCPU / 30GB, incus-base-vm NixOS) on b1 via the
Incus API, mirroring the known-good cc-nix-test spec; started it — cloud-init is
running nixos-rebuild boot + reboot + tailnet join. Status flipped DRAFT->IN
PROGRESS with the remaining Phase-A items noted (add cc-ci-root key via incus
exec, confirm tailnet+ssh, write the reproducible TF project).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 15:53:06 +01:00
71a4a1fea4 Reliable loop messaging: msg-loop.sh + hardened ping_session (retry submit)
tmux `send-keys -l <long msg>` often leaves the text UNSENT in the input box (the
immediate Enter is swallowed while the TUI ingests the paste). Both now type the
message then retry Enter/C-m until the leading text is no longer in the input box
(= submitted) or a bounded loop gives up.
- msg-loop.sh: standalone reliable messenger for orchestrator use.
- launch.sh ping_session: same retry-submit (loads on next watchdog restart).
Live-tested: delivered first try.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 15:31:28 +01:00
7a1f7f75aa Policy: prefer upstream env-parameterization over cc-ci compose overlays
Operator (2026-05-30): a cc-ci-authored compose overlay risks silent drift from
the recipe users actually run — avoid it wherever possible.

- plan.md §9 guardrail: when a recipe needs a cc-ci-env-tuned value (e.g. a longer
  healthcheck start_period for the slow single node), the preferred fix is an
  UPSTREAM recipe PR exposing it as an env var (e.g. APP_START_PERIOD) with the
  current value as the default in env.sample — CI sets the env, no new compose.
  For making the upgrade tier work from an older base version, prefer DECLARING
  that version not-testable under this CI env over crafting a custom compose.
  Overlay = last resort, Adversary-confirmed non-drifting + paired with the env PR.
- plan-prefer-env-over-compose-overlay.md: migrates the existing debt —
  ghost/discourse compose.ccci-health.yml start_period -> APP_START_PERIOD recipe
  PRs (default=current) then drop the overlays; discourse image re-pin + mumble
  old-base host-ports copy -> declare those old versions untestable instead of
  forking compose. No test weakened; untestable-version is an honest outcome.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 15:17:42 +01:00
a89b082240 plan §7: recommend Monitor-on-convergence pattern for long deploys (builder's idea)
For a long deploy/convergence, arm a Monitor that polls the node every ~30s and
wakes on convergence OR failure, with a longer fallback heartbeat (ScheduleWakeup)
as a backstop. Proceeds the instant it converges (no over-waiting), surfaces
failures promptly, and the heartbeat bounds the wait. Size the timeout sanely
(longer if justified, never absurd like the ~40-min ghost case). Credit: builder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:17:18 +01:00
e85e16318c Phase 2b narrowed to "confirm minimal deploys"; perf ideas moved to IDEAS
Operator (2026-05-30): the real deploy-speed bottleneck was hardware (cc-ci VM
was 2 vCPU on a 4-core host + disk-I/O-bound; RAM fine), now fixed directly
(bumped to 4 vCPU, made cc-nix-test the only running VM on b1). The 2b software
micro-optimizations are judged unlikely to help, so:

- IDEAS.md: parked the whole empirical-perf program (instrumentation, baseline,
  attribution) + the optimization menu (image cache/prepull, readiness tuning,
  warm-SSO start/stop, runner caching, concurrency sizing, resources, secret
  overhead) under "Phase-2b empirical performance work", revisit only if
  measurement later proves a specific software bottleneck.
- plan-phase2b: reduced to ONE goal — confirm (and fix if needed) that the
  per-recipe test sequence already uses the minimum deploys (1 base shared by
  install+functional+backup/restore, +1 for the upgrade tier, +1 per dep),
  enforced by the existing DG4.1 deploy-count check, WITHOUT weakening any test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:07:49 +01:00
1c2be64124 Phase 5 §4: install weekly upgrade cron at completion+1h and verify first kickoff
Operator: when the final phase completes, install the weekly cron anchored to
actual completion — first run ~1h after the build finishes, weekly from then on
(supersedes the fixed "Sat 03:00 UTC" placeholder).

- plan-phase5 §4: orchestrator computes T0=now+1h, installs a weekly job at T0's
  DOW+HH:MM running launch-upgrader.sh start; cron env needs claude on PATH +
  tmux + claude.ai login (mirror cc-ci-loops.service). VERIFY the first kickoff:
  cheap --dry-run pre-check, then confirm the real T0 fire launched the
  cc-ci-upgrader agent (status RUNNING, ran /upgrade-all, summary produced);
  record schedule + verified kickoff in DECISIONS.md.
- upgrade-all skill Cron section + cron memory updated to the completion-anchored
  schedule + first-kickoff verification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 21:21:20 +01:00
bf71420106 Add cc-ci-upgrader agent: observable one-shot weekly upgrade-run agent
The weekly upgrade run now executes inside a dedicated, remote-control agent
(cc-ci-upgrader) — viewable/steerable at claude.ai/code like the Builder — rather
than buried in headless cron output.

- launch-upgrader.sh: spins up the cc-ci-upgrader tmux session under
  --remote-control with a kickoff that runs /upgrade-all (DEFAULT mode) to
  completion. On finish the agent STOPS and stays idle (does NOT self-terminate)
  so the run + summary stay reviewable in the web UI. `start` = use-or-create:
  leaves an in-flight (busy) run alone, else clears a finished/idle/wedged
  session and runs fresh; `fresh` always restarts. UPGRADER_ARGS passes flags
  (e.g. --dry-run); never --with-tests.
- launch.sh: orchestrator_alive() now also skips the cc-ci-upgrader
  remote-control name, so the upgrader job isn't mistaken for the orchestrator.
- upgrade-all skill: documents it runs as the cc-ci-upgrader agent; the weekly
  cron invokes `launch-upgrader.sh start` (not /upgrade-all inline).
- Phase 5: V8a verifies the agent lifecycle (launch → run to completion → stay
  idle/viewable → next start clears it); V9 stops the verification session.
- cron memory: weekly task = launch-upgrader.sh start at 0 3 * * 6 UTC.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 21:12:47 +01:00
4f74676c72 Phase 5 (final): verify the /recipe-upgrade + testme-on-pr.sh end-to-end flow
Appended as the LAST phase in the launcher sequence (… 3 4 5). It can only run
once cc-ci is fully built — the !testme-on-recipe-PR flow depends on Phase 3
(results UX) surfacing the run result back on the PR for testme-on-pr.sh to read.

DoD (Adversary cold-verifies): !testme on a recipe PR is the real gate + results
land in the PR (V1); testme-on-pr.sh reads GREEN/RED/PENDING + BUILD url, POST=0
polls without re-triggering (V2); /recipe-upgrade default end-to-end green on a
sandbox recipe, nothing merged (V3); the ≤3 !testme regression loop (V4); stale
test DEFAULT = comment-only, no test edit (V5); --with-tests opens+verifies a
cc-ci test PR, paired (V6); mirror reconcile closes merged/superseded PRs and
main==upstream (V7); /upgrade-all default dry-run + small live run never edits
tests (V8); all verification PRs closed + deploys torn down (V9). Use a sandbox
recipe; never merge; never weaken tests. Watchdog reloaded (seq …3 4 5).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 20:38:39 +01:00
4a1da1dd60 recipe-upgrade: !testme-on-PR verification + make test PRs opt-in (--with-tests)
Per operator:
- Verify via `!testme` posted ON the recipe PR (the real CI path) so results are
  viewable in the PR; iterate up to 3 !testme runs (fix a real regression + re-test).
  New helper testme-on-pr.sh posts !testme and polls the PR head commit status
  for the verdict (POST=0 to keep polling without re-triggering).
- Test updates are now OPT-IN via `--with-tests`. DEFAULT: recipe PR only using
  existing tests; if a test fails and is genuinely stale, leave an explanatory
  COMMENT on the PR (upgrade looks correct; re-run --with-tests to update tests)
  and do NOT touch any test. --with-tests keeps the verified cc-ci test-update PR
  path (verified via the branch-checkout harness run, since !testme uses prod tests).
- upgrade-all (weekly cron) calls the DEFAULT — never auto-edits tests unattended;
  surfaces "tests look stale" PRs in the summary for the operator to opt in per-recipe.
- New RESULT: SUCCESS-PENDING-TESTS for the recipe-green-but-test-stale default case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 20:18:59 +01:00
c7da03fa6c watchdog: STALL_GRACE so stall_check never races a loop's own ScheduleWakeup
Root cause of the adversary "overrun": stall_check rebooted the instant
now >= WAITING-UNTIL (zero grace), but the loop's own ScheduleWakeup fires AT
that stated time — and the runtime scheduled it ~40s later than the marker
(date-vs-scheduler skew). So the watchdog pre-empted a HEALTHY self-wake by
~37s; the loop wasn't wedged, it was killed just before it woke. That was the
single false reboot at 18:55Z.

Fix: split the two cases cleanly.
- Marker present: reboot only when now > WAITING-UNTIL + STALL_GRACE (180s) —
  covers wake+start latency + marker/scheduler skew, so the watchdog only fires
  if the self-wake GENUINELY failed.
- No marker: unchanged — reboot when idle >= STALL_IDLE (300s).

Verified post-fix: adversary self-woke on time and re-paced (WAITING-UNTIL
19:19:30Z); no new stall reboots.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 20:12:46 +01:00
e8c4330ce3 watchdog: reboot idle-wedged loops via self-reported WAITING-UNTIL markers
The builder wedged at the context limit (garbled output) — alive but matching
none of heal_session's signals (dead/FATAL/limit), so the watchdog left it
stuck. Fix: loops now declare every wait, and the watchdog reboots a wait that
never resumes.

- plan.md §7 + both prompts: cap every wait at 10 min (chunk longer waits);
  before going idle, the loop's FINAL line must be `WAITING-UNTIL: <ISO8601 UTC>`
  (the resume time, matching its ScheduleWakeup); run /compact proactively at
  ~80% context to avoid wedging near the limit.
- launch.sh: new stall_check (runs every 30s signal tick) — reboots a loop idle
  >= STALL_IDLE (300s) when it has NO current WAITING-UNTIL marker as its last
  message OR is past the time the marker named; a healthy paced wait (marker
  present, before its time) is left alone. Complements heal_session's
  dead/FATAL/limit cases. Reboot is safe — loops re-orient from git + STATUS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 19:05:29 +01:00
62b7af7a97 recipe-upgrade: reconcile mirror to upstream main + close merged/superseded PRs
Per operator: an open mirror PR must mean "genuinely still open against true
current upstream main". On every run the recipe-upgrade flow now:
- force-syncs the recipe-maintainers/<recipe> mirror `main` to be IDENTICAL to
  upstream main (origin/main of the abra checkout = coopcloud);
- closes any open mirror PR whose changes are already in upstream main (merged
  upstream, no-op merge detected via `git merge-tree` vs main's tree) — even
  when the recipe is up to date (new `--reconcile-only` mode, run in step 1);
- when opening a new upgrade PR, closes any other still-open PR for that recipe
  (superseded) and opens the new one IN ITS PLACE; same-version re-runs just
  update the existing same-branch PR.
open-recipe-pr.sh gains the --reconcile-only mode + the close logic (with an
auto-close comment naming the reason). upgrade-all reconciles every candidate's
mirror during the survey so merged PRs are closed fleet-wide. Still never merges.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 17:32:34 +01:00
a8b4b4c39e upgrade-all: pin weekly slot (Sat 03:00 UTC) + defer activation until cc-ci is built
Operator: don't run the weekly upgrade-all while the build loops are still
constructing cc-ci (shared-host contention). Activate the Sat 03:00 UTC
(0 3 * * 6) cron only once the build is complete; on-demand until then.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 17:24:40 +01:00
db31c08d6a Add /recipe-upgrade + /upgrade-all skills (cc-ci-gated upgrades, never merge)
Per-recipe and fleet-wide upgrade skills modelled on recipe-maintainer's
recipe-upgrade-full / recipe-upgrade-cron-all, but gated by the cc-ci CI server
and inheriting ci-test-review's create+verify+never-merge discipline.

- recipe-upgrade/: plan (release notes, breaking changes) -> implement (abra
  recipe upgrade + version bump + config, lint) -> open the recipe PR -> VERIFY
  green on cc-ci (full suite cold against the PR head via verify-pr.sh). If the
  upgrade is correct but a cc-ci TEST went stale, also update the test, verify
  it, and open a second PR to recipe-maintainers/cc-ci. Never merges; never
  weakens a test; prefers a recipe-only PR. Emits a parseable RESULT line.
  + open-recipe-pr.sh: adapted recipe-create-pr; runs on cc-ci (has the recipe
    checkout + bot token), creds passed from the orchestrator .testenv;
    force-syncs the mirror main so the PR diff is exactly the upgrade.
- upgrade-all/: weekly fan-out — enumerate enrolled recipes, survey upgrades,
  run /recipe-upgrade per upgradeable recipe via subagent (sequential default,
  --parallel / --dry-run), collect into one PR-list summary. Coordination +
  single-writer + shared-Swarm-teardown guardrails; built for a weekly cron.
- ci-test-review/verify-pr.sh: pass SRC (recipe-maintainers/<recipe>) alongside
  REF so the harness clones the mirror PR head correctly (its real contract).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 17:19:20 +01:00
27480b3513 Commit the 3r removal + skills-tracking .gitignore (missed in prior 2 commits)
The earlier `git add` included an already-`git rm`'d pathspec, so it errored and
staged nothing — launch.sh (3r removal) and .gitignore (track .claude/skills/)
were left uncommitted while the skill files went in via a separate -f add.
Runtime was already correct (watchdog reads the working-tree launch.sh); this
just syncs git HEAD to the working tree.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 17:05:43 +01:00
cbe1406bce ci-test-review: close the loop — author + open + cc-ci-verify fix PRs (never merge)
Per operator: the skill should not just propose, it should CREATE the fix PR
(recipe repo or cc-ci repo) and VERIFY it green on its own CI server — but not
merge. It drives cc-ci like the loops do.

- SKILL.md: diagnose+classify (recipe vs CI-server) -> author the fix + open a
  PR (recipe-create-pr for recipe PRs; Gitea API for cc-ci PRs, dedicated branch
  in a separate clone, single-writer safe) -> VERIFY on cc-ci (full suite cold
  against the PR head = the !testme dogfood path) -> report a verified,
  ready-to-merge PR. Never merges; never weakens a test; flake != bug. General
  bar = one cold green; repeated-green (REPEAT=3) only for a known-flaky recipe.
  Adds coordination/single-writer guardrails (shared Swarm is stateful; tear
  down deploys; never push main or touch the loops' clones).
- verify-pr.sh: deterministic recipe-PR gate — RECIPE + REF -> cold full suite
  on cc-ci, green iff every repeat exits 0. CI-server-PR verification stays
  bespoke (branch checkout + rebuild + regression sample) per SKILL.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 17:04:35 +01:00
2530845e50 orchestrator: add /ci-test-review skill (in THIS repo) + drop Phase 3r from loops queue
The on-demand AI review layer is now an orchestration-repo skill built directly
by the orchestrator, NOT a loops phase in the cc-ci product repo:

- .claude/skills/ci-test-review/{SKILL.md,run-all-recipes.sh}: runs the real
  cc-ci harness across all enrolled recipes (deterministic, AI-free execution),
  then AI diagnoses each failure and classifies it as needing a recipe PR / a
  CI-server PR / a stale-test update — or reports "ALL PASSED, recipes + tests
  up to date". Proposes PRs; never decides pass/fail; never auto-merges.
- .gitignore: track .claude/skills/ (shareable) while still ignoring local
  claude session state (locks, history) under .claude/.
- launch.sh: remove Phase 3r from PHASES_SPEC; loops sequence back to
  1c 1b 1d 1e 2w 2pc 2 2b 3 4. Deleted plan-phase3r (superseded by the skill).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 16:57:26 +01:00
5f84f8c028 plan: Phase 3r — /ci-test-review Claude skill (on-demand AI review + recipe-vs-CI PR diagnosis)
Deterministic CI stays the primary, AI-free path. Adds a separate on-demand skill (ships in the
cc-ci repo .claude/skills/ci-test-review/) that runs the full suite across all recipes and, per
failure, AI-diagnoses + classifies: recipe PR (+ proposed change) vs CI-server PR vs stale-test;
or 'all passed, recipes+tests up to date' (incl. a latest-version freshness check). Proposes, never
auto-merges (operator-merge rule). Slotted 3 -> 3r -> 4. AI only diagnoses; execution stays
deterministic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 16:39:07 +01:00
61ab3ecb3a plan: per-test image pre-pull sub-plan (warm images before deploy + upgrade; cheap on warm cache)
Resolve a recipe's images (docker compose config --images) and docker pull them (skip-if-present for
pinned tags) at the start of the recipe sequence + before the upgrade-new-version deploy, then the
normal abra deploy. Separates pull from converge (clear pull failures vs murky convergence timeouts),
speeds convergence (fits abra-native window). No layer re-download on warm cache; nightly all-recipes
run warms everything. Complements (not replaces) the recipe healthcheck for slow-init convergence.
Near-term Phase-2 harness unit; real abra deploy unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 14:55:21 +01:00