cc-ci-orchestrator

Author	SHA1	Message	Date
autonomic-bot	bca51071bd	refactor: rewrite launchers as Python; add orchestrator JOURNAL.md Bash scripts are now one-liner wrappers: exec python3 <script>.py "$@" All logic lives in the Python scripts (pure stdlib, no deps). launch.py — loops + watchdog: Full port of launch.sh: phase sequencing, start/stop/status/logs/watchdog, handoff signalling, stall detection, heal_session, heal_orchestrator. Cleaner structure: config block → helpers → phase/kickoff/agent/healing/ handoff/watchdog/main. LOOP_BACKEND + LOOP_MODEL switches throughout. launch-orchestrator.py — orchestrator session: claude path: --resume <id> preserved (conversation survives reboots). opencode path: run --attach --title (no --resume; STARTUP_PROMPT orients the new session; reads JOURNAL.md for context). STARTUP_PROMPT updated to reference JOURNAL.md on startup. launch-upgrader.py — one-shot upgrade job: LOOP_BACKEND / LOOP_MODEL take precedence over UPGRADER_BACKEND / UPGRADER_MODEL. Both claude and opencode paths supported. cc-ci-plan/JOURNAL.md — new orchestrator handoff file: Persistent across conversation resets. Documents the handoff format and carries the current session's summary: migration complete, phase 5 in progress (V3/V7 PASS), phase 4 deferred, open items for next session. AGENTS.md: step 1 on startup = read JOURNAL.md; step 5 = append on handoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-31 17:50:09 +00:00
autonomic-bot	e0e5bf6e64	feat: opencode web at oc.commoninternet.net (one server, named sessions) configuration.nix: - systemd.services.opencode-web: one shared opencode server on 127.0.0.1:4096, EnvironmentFile=/srv/cc-ci/.testenv (TINFOIL_API_KEY), ExecStartPre clears stale /tmp/opencode so restarts never fail on the EEXIST race. - services.nginx: reverse-proxy oc.commoninternet.net → localhost:4096, bound to tailscale IP 100.84.190.30 (tailnet-only, plain HTTP). DNS: A record oc.commoninternet.net → 100.84.190.30 (operator step). launch.sh + launch-upgrader.sh: - Drop per-session ports / OPENCODE_HOST; add OPENCODE_SERVER=http://127.0.0.1:4096. - opencode backend: agents use `opencode run --attach $OPENCODE_SERVER --title $session` so each shows up as a named session in the web UI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-31 17:37:03 +00:00
autonomic-bot	a87d42f491	feat: opencode/tinfoil backend support in all launchers Adds LOOP_BACKEND=opencode\|claude (+ LOOP_MODEL) to launch.sh and launch-upgrader.sh, enabling the loops/upgrader to run via opencode CLI against the tinfoil.sh API (deepseek-v4-pro etc.) instead of Claude. launch.sh: - LOOP_BACKEND (claude\|opencode), LOOP_MODEL env vars - OPENCODE_BIN, OPENCODE_HOST (tailscale IP), OPENCODE_PORT (per-session) - start_agent: backend switch — claude path unchanged; opencode starts `opencode --hostname <ts-ip> --port <N> run <kickoff>` so the web UI is bound to the tailscale interface (tailnet-only observability) - preflight: validates the right binary per backend - heal_session / heal_orchestrator: extend active-work detection to opencode spinner chars + "Running tool" - help: shows both backend configs launch-upgrader.sh: - UPGRADER_BACKEND / UPGRADER_MODEL (LOOP_BACKEND/LOOP_MODEL override) - start: same backend switch as launch.sh - OPENCODE_PORT=4098 (separate from loops 4096/4097) configuration.nix: note opencode binary location + re-install command. Tinfoil config: ~/.config/opencode/opencode.jsonc — provider "tinfoil" with baseURL=https://api.tinfoil.sh/v1, apiKey=env:TINFOIL_API_KEY (key + TINFOIL_MODEL + TINFOIL_BASE_URL stored in .testenv). opencode v1.15.13 installed at /home/loops/.local/bin/opencode. Usage: LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro \ RESUME_PHASE=1 cc-ci-plan/launch.sh start Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-31 17:21:13 +00:00
autonomic-bot	db375bcc07	rename to cc-ci-orchestrator: update all repo name references Gitea repos renamed: cc-ci-autonomous-orchestrator → cc-ci-orchestrator cc-ci-orchestrator → archived-cc-ci-orchestrator Updated in this workspace: - README.md, AGENTS.md: repo title - cc-ci-plan/plan-orchestrator-migration.md: cc-ci-autonomous-orchestrator refs - cc-ci-plan/plan-repo-consolidation.md: marked complete + Pi remote-update notice - cc-ci-plan/launch-orchestrator.sh, launch.sh: session naming comment cleanup NOTE: Pi clone still has the old origin URL. On the Pi, run: git remote set-url origin https://git.autonomic.zone/recipe-maintainers/cc-ci-orchestrator.git Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-31 00:03:11 +00:00
autonomic-bot	fffd83fe4b	launch.sh: use CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS env var when running as root (VM uses root; --dangerously-skip-permissions flag blocked by claude for root)	2026-05-30 19:36:35 +01:00
autonomic-bot	71a4a1fea4	Reliable loop messaging: msg-loop.sh + hardened ping_session (retry submit) tmux `send-keys -l <long msg>` often leaves the text UNSENT in the input box (the immediate Enter is swallowed while the TUI ingests the paste). Both now type the message then retry Enter/C-m until the leading text is no longer in the input box (= submitted) or a bounded loop gives up. - msg-loop.sh: standalone reliable messenger for orchestrator use. - launch.sh ping_session: same retry-submit (loads on next watchdog restart). Live-tested: delivered first try. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 15:31:28 +01:00
autonomic-bot	bf71420106	Add cc-ci-upgrader agent: observable one-shot weekly upgrade-run agent The weekly upgrade run now executes inside a dedicated, remote-control agent (cc-ci-upgrader) — viewable/steerable at claude.ai/code like the Builder — rather than buried in headless cron output. - launch-upgrader.sh: spins up the cc-ci-upgrader tmux session under --remote-control with a kickoff that runs /upgrade-all (DEFAULT mode) to completion. On finish the agent STOPS and stays idle (does NOT self-terminate) so the run + summary stay reviewable in the web UI. `start` = use-or-create: leaves an in-flight (busy) run alone, else clears a finished/idle/wedged session and runs fresh; `fresh` always restarts. UPGRADER_ARGS passes flags (e.g. --dry-run); never --with-tests. - launch.sh: orchestrator_alive() now also skips the cc-ci-upgrader remote-control name, so the upgrader job isn't mistaken for the orchestrator. - upgrade-all skill: documents it runs as the cc-ci-upgrader agent; the weekly cron invokes `launch-upgrader.sh start` (not /upgrade-all inline). - Phase 5: V8a verifies the agent lifecycle (launch → run to completion → stay idle/viewable → next start clears it); V9 stops the verification session. - cron memory: weekly task = launch-upgrader.sh start at 0 3 * * 6 UTC. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 21:12:47 +01:00
autonomic-bot	4f74676c72	Phase 5 (final): verify the /recipe-upgrade + testme-on-pr.sh end-to-end flow Appended as the LAST phase in the launcher sequence (… 3 4 5). It can only run once cc-ci is fully built — the !testme-on-recipe-PR flow depends on Phase 3 (results UX) surfacing the run result back on the PR for testme-on-pr.sh to read. DoD (Adversary cold-verifies): !testme on a recipe PR is the real gate + results land in the PR (V1); testme-on-pr.sh reads GREEN/RED/PENDING + BUILD url, POST=0 polls without re-triggering (V2); /recipe-upgrade default end-to-end green on a sandbox recipe, nothing merged (V3); the ≤3 !testme regression loop (V4); stale test DEFAULT = comment-only, no test edit (V5); --with-tests opens+verifies a cc-ci test PR, paired (V6); mirror reconcile closes merged/superseded PRs and main==upstream (V7); /upgrade-all default dry-run + small live run never edits tests (V8); all verification PRs closed + deploys torn down (V9). Use a sandbox recipe; never merge; never weaken tests. Watchdog reloaded (seq …3 4 5). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 20:38:39 +01:00
autonomic-bot	c7da03fa6c	watchdog: STALL_GRACE so stall_check never races a loop's own ScheduleWakeup Root cause of the adversary "overrun": stall_check rebooted the instant now >= WAITING-UNTIL (zero grace), but the loop's own ScheduleWakeup fires AT that stated time — and the runtime scheduled it ~40s later than the marker (date-vs-scheduler skew). So the watchdog pre-empted a HEALTHY self-wake by ~37s; the loop wasn't wedged, it was killed just before it woke. That was the single false reboot at 18:55Z. Fix: split the two cases cleanly. - Marker present: reboot only when now > WAITING-UNTIL + STALL_GRACE (180s) — covers wake+start latency + marker/scheduler skew, so the watchdog only fires if the self-wake GENUINELY failed. - No marker: unchanged — reboot when idle >= STALL_IDLE (300s). Verified post-fix: adversary self-woke on time and re-paced (WAITING-UNTIL 19:19:30Z); no new stall reboots. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 20:12:46 +01:00
autonomic-bot	e8c4330ce3	watchdog: reboot idle-wedged loops via self-reported WAITING-UNTIL markers The builder wedged at the context limit (garbled output) — alive but matching none of heal_session's signals (dead/FATAL/limit), so the watchdog left it stuck. Fix: loops now declare every wait, and the watchdog reboots a wait that never resumes. - plan.md §7 + both prompts: cap every wait at 10 min (chunk longer waits); before going idle, the loop's FINAL line must be `WAITING-UNTIL: <ISO8601 UTC>` (the resume time, matching its ScheduleWakeup); run /compact proactively at ~80% context to avoid wedging near the limit. - launch.sh: new stall_check (runs every 30s signal tick) — reboots a loop idle >= STALL_IDLE (300s) when it has NO current WAITING-UNTIL marker as its last message OR is past the time the marker named; a healthy paced wait (marker present, before its time) is left alone. Complements heal_session's dead/FATAL/limit cases. Reboot is safe — loops re-orient from git + STATUS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 19:05:29 +01:00
autonomic-bot	27480b3513	Commit the 3r removal + skills-tracking .gitignore (missed in prior 2 commits) The earlier `git add` included an already-`git rm`'d pathspec, so it errored and staged nothing — launch.sh (3r removal) and .gitignore (track .claude/skills/) were left uncommitted while the skill files went in via a separate -f add. Runtime was already correct (watchdog reads the working-tree launch.sh); this just syncs git HEAD to the working tree. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 17:05:43 +01:00
autonomic-bot	5f84f8c028	plan: Phase 3r — /ci-test-review Claude skill (on-demand AI review + recipe-vs-CI PR diagnosis) Deterministic CI stays the primary, AI-free path. Adds a separate on-demand skill (ships in the cc-ci repo .claude/skills/ci-test-review/) that runs the full suite across all recipes and, per failure, AI-diagnoses + classifies: recipe PR (+ proposed change) vs CI-server PR vs stale-test; or 'all passed, recipes+tests up to date' (incl. a latest-version freshness check). Proposes, never auto-merges (operator-merge rule). Slotted 3 -> 3r -> 4. AI only diagnoses; execution stays deterministic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 16:39:07 +01:00
autonomic-bot	0352cb5607	plan: Phase 2pc — image pull-through cache + sane prune policy (front-loaded perf interjection) Operator-directed (2026-05-29): front-load the two EVIDENCE-BASED image wins before grinding the remaining deploy-heavy recipes — Phase 2 pauses, 2pc runs, Phase 2 resumes (seq: …2w 2pc 2 2b 3 4). PC1: conservative prune (no reflexive `prune -af`, never mid-run, keep base images) — kills the documented prune→re-pull→rate-limit churn. PC2: local registry:2 pull-through cache for docker.io, PAT-authenticated, Nix-reconciled, daemon registry-mirror → transparent to abra/swarm; subsequent pulls (across recipes/runs/post-prune) are local → faster deploys + rate-limit gone. Bounded scope: these two only; concurrency/readiness-tuning stay in measurement-driven 2b. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 09:20:30 +01:00
autonomic-bot	ae83a8120d	watchdog: signal handoffs off claim()/review() commit prefixes (robust) + codify the convention Replaces the brittle markdown prose-match ("Gate: … CLAIMED, awaiting Adversary") with detection of the loops' conventional commit prefixes on origin/main: a new `claim(...)` commit pings the Adversary; a new `review(...)` commit pings the Builder. Edge-triggered on the origin/main SHA (append-only — no force-push), no file parsing, can't mis-route. The loops already use these prefixes consistently; codified as a load-bearing contract in plan.md §6.1 + both prompts so it stays reliable. INBOX detection unchanged (pushed-state, file-routed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 03:10:12 +01:00
autonomic-bot	e0e60bc2bc	watchdog: fix handoff lag — detect on pushed origin/main + precise formal-claim match The handoff pings fired on the writer's LOCAL working-tree write (before push), so the receiver pulled a stale origin/main, saw "no formal gate", and a clarifying inbox round-trip ensued (several minutes + wasted turns per handoff). And the gate-id parser read "WC1" as "C1" and could fire on prose mentions. Fix (1): handoff_check now `git fetch`es and reads origin/main (what the receiver will pull), via _wd_fetch_origin + _wd_show_pushed, for STATUS / REVIEW / both INBOXes — a ping only fires once the claim/verdict is actually pushed, so the receiver's pull always sees it. Eliminates the stale-pull "premature" dance. Fix (2): gate-claim detection matches ONLY a formal line (Gate: <id> … CLAIMED, awaiting Adversary) and edge-triggers on a genuinely-new such line compared whole — no firing on historical "CLAIMED detail" lines or prose; gate-id is a best-effort label only. Loops' clones have a credential helper (reads .testenv) so the watchdog's fetch works non-interactively. Verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 02:38:47 +01:00
autonomic-bot	a2728eec2d	plan: Phase 2w — warm canonical deployments + --quick CI mode (interjected into Phase 2) Operator-directed: pause Phase 2, build the warm-data + --quick system, then resume Phase 2. - live-warm keycloak (SSO dep, realm-per-run), data-warm canonicals (undeploy keeps volume), cold = authoritative default. --quick reattaches the canonical, upgrades to PR head, asserts, and rolls back to the last-known-good snapshot on failure (never loses working data). - known-good = raw volume copy taken while undeployed (consistent), one per app, advanced ONLY by green cold runs; a nightly full-cold sweep refreshes canonicals + is a daily regression run. - launch.sh: insert 2w at the current index (Phase 2 -> resumes after 2w DONE); seq is now 1c 1b 1d 1e 2w 2 2b 3 4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 23:04:33 +01:00
autonomic-bot	11a2ce652d	watchdog: self-heal FATAL session-state errors + supervise the orchestrator - heal_session: detect the unrecoverable "thinking/redacted_thinking blocks cannot be modified" 400 (recurs every turn, session stays alive so the dead-check misses it) and kill+restart the loop fresh (re-orients from repo). Consolidates the dead/fatal/limit handling for builder+adversary. - heal_orchestrator: keep the orchestrator alive too, conflict-safe. Restarts via launch-orchestrator.sh ONLY when no orchestrator is alive anywhere — liveness detects both a managed cc-ci-orchestrator tmux session AND a hand-launched terminal session (any non-loop claude), so it never double-resumes the conversation (the likely cause of the thinking-block crashes). Kill+restart if the managed session is wedged on the FATAL error. Toggle: WATCH_ORCHESTRATOR=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 21:09:21 +01:00
autonomic-bot	36a6c9872a	orchestrator: reboot-resilience + session auto-resume + full session plan/tooling Reboot survival for the Pi orchestrator host: - systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the orchestrator session. - reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count). - launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed orchestrator announces itself (PushNotification) + reports reboots. - AGENTS.md: on-startup notify routine documented. Plans/tooling accumulated this session: - plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review), sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance. - launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution, limit-stall re-nudge, INBOX side-channel detection. - plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED. - prompts: isolation discipline + INBOX + pacing. - .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, .tmp.). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 20:28:10 +01:00
autonomic-bot	5681438b0f	launch.sh fix: don't let an empty-match grep kill the watchdog (set -e + pipefail) handoff_check's now="$(grep CLAIMED.*awaiting ... )" returned non-zero when a phase's STATUS has no claimed-awaiting lines yet (normal early in a phase); under set -euo pipefail that assignment exited the whole watchdog. Append `\|\| true` to the now= and cur= command substitutions. Verified: watchdog survives the handoff tick on a freshly-created STATUS-1c.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 16:09:01 +01:00
autonomic-bot	994e52c101	launch.sh: phase-aware sequencer (run 1c -> auto-transition 1b -> stop for manual gate) Make the launcher drive an ordered phase sequence (default 1c then 1b). Each phase has its own plan + phase-namespaced loop-state files (STATUS-<id>.md/BACKLOG/REVIEW/JOURNAL); the watchdog auto-transitions when the current phase's STATUS-<id>.md shows ## DONE, and STOPS after the last phase (writes SEQUENCE-COMPLETE, exits) as a manual gate before Phase 2. start_agent injects a phase preamble (source-of-truth = phase plan; phase-namespaced state) ahead of the base role prompt. DONE detection reads the builder's local clone (reliable, no push-lag). Handoff signalling + resilience preserved and made phase-scoped (reset baseline on transition). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 16:00:51 +01:00
autonomic-bot	e68a520d4c	Fix watchdog false gate-ping: edge-trigger on NEW claimed-awaiting gate ids, baseline silently The Adversary got a spurious "gate CLAIMED" ping: STATUS.md keeps historical "Gate: Mn — CLAIMED, awaiting Adversary" lines after they PASS, and on watchdog restart the first observation pinged on those already-passed lines. Now track the SET of gate ids on CLAIMED-awaiting lines and ping only when an id NEWLY appears vs the prior observation, after a silent baseline. A gate passing (line kept) or evidence edits don't re-ping; restart re-baselines without pinging. Verified: watchdog restart no longer pings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:25:09 +01:00
autonomic-bot	649b90b586	launch.sh: resolve script to absolute path (SELF) so the watchdog re-invokes correctly Bug: start_watchdog used $0, which breaks when launch.sh is called by a relative path (the watchdog tmux session cd's into PLAN_DIR, so a relative $0 no longer resolves — "No such file or directory", watchdog dies instantly). Resolve BASH_SOURCE to an absolute SELF once and use it for the watchdog self-invocation. Verified: watchdog now starts and its handoff_check immediately pinged the Adversary about a standing CLAIMED gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:16:54 +01:00
autonomic-bot	239dfd8e26	Watchdog handoff signalling: ping the waiting loop on gate-claim / verdict (kill double-idle) launch.sh watchdog now runs a fast (~30s) handoff_check alongside the heavy (300s) restart/DONE check: when the Builder writes a CLAIMED gate it pings the Adversary to verify now; when the Adversary updates REVIEW.md it pings the Builder to proceed (edge-triggered, reads local clones). So a pending handoff resolves in <~30s instead of a whole idle interval. Pacing revised: the Adversary may idle freely when nothing's pending (no pointless re-verify/busy-poll) and is woken by the watchdog; Builder waits on the ping + a fallback ~2-4m self-poll. kickoff documents the new "handoff signalling" role. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:15:25 +01:00
autonomic-bot	bdc78da921	Initial commit: cc-ci autonomous orchestrator Planning + launch + setup material for the cc-ci Co-op Cloud recipe CI server: plan.md (single source of truth), kickoff/launch supervision, and the Builder/Adversary loop prompts. Secrets (.testenv) and runtime dirs are gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:46:28 +01:00

24 Commits