Commit Graph

129 Commits

Author SHA1 Message Date
37a422bc31 refactor(wake): thin wake prompt -> points at orchestrator-supervision.md
The hourly wake prompt was hardcoding phase 5 / STATUS-5.md and going stale
as the build advanced. Make it a one-line pointer to a maintained doc
(orchestrator-supervision.md) that looks the CURRENT phase up live via
launch.py status — so the wake prompt never needs editing as phases change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 01:37:32 +00:00
7bdeb74449 plan(regression): add per-tier RED canaries (install/upgrade/backup/restore)
One deliberately-broken custom-html-tiny fixture per lifecycle tier so the
suite proves the server reports RED at EVERY tier (not just one) — each
asserts RED at the intended tier with prior tiers PASS, so it's 'catches a
failure at this tier', not 'fails somewhere'. Fast (simplest recipe); the
fast subset of the suite vs the slow good canaries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 01:28:23 +00:00
2f9d7df78f ideas: package cc-ci itself as a Co-op Cloud recipe (parked, not implementing)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:43:44 +00:00
ad2ade842c plan(mirror): remove the operator deploy gate — loops deploy+verify autonomously
The gate existed because a wrong-target nixos-rebuild #cc-ci once dropped
the cc-ci server into emergency mode. That footgun is fixed (be4f451 maps
#cc-ci -> the Hetzner host config), and deploying cc-ci is the loops'
normal operation, so Phase 4 now runs autonomously with verify + rollback
as the safety net.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:38:59 +00:00
fd86baea2a plan: regression canaries are milestone-cadence (polish/review/release), not per-commit
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:30:09 +00:00
947e7f55b9 plan: server regression canaries (codified E2E good+bad self-tests)
E2E pytest canaries proving the server confirms a healthy app healthy
(semantic per-tier assertions, not just exit codes) AND catches a broken
one (false-green guard). Good canaries: custom-html-tiny + lasuite-docs;
known-bad fixture must report RED. Queued as the loops' next phase after
mirror-enroll.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:29:01 +00:00
2b617ba19f feat(launch): persist PHASES_SPEC to .phases-spec (status/watchdog/reboot agree)
Mirror the .loop-backend pattern: env wins, else the persisted file, else
the default build sequence. Without this, a custom single-phase run was
invisible to bare 'launch.py status' and would NOT survive a reboot (the
service has no PHASES_SPEC env). Now the current phase set is durable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:17:34 +00:00
d349656c3b feat(launch): forward PHASES_SPEC/backend to watchdog; mark plan Phase 4 as operator gate
The watchdog is spawned into the existing tmux server and didn't reliably
inherit a custom PHASES_SPEC — it would fall back to the default 11-phase
spec and mis-detect completion. Forward PHASES_SPEC/PHASE_IDX_FILE/
LOOP_BACKEND/LOOP_MODEL explicitly in the watchdog command so custom
single-phase runs (like the mirror-enroll plan) work end-to-end. Also make
the mirror-enroll plan's live-host-deploy step an explicit claim-and-wait
operator gate for the loops.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:15:42 +00:00
8007053d94 plan: mirror + enroll ALL recipes before resuming per-recipe debugging
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:13:00 +00:00
e2551f3d79 chore(nix): infra polish — bake cc-ci IP, mark stale Incus config, park nginx vhost
- SSH config: replace REPLACE_WITH_CC_CI_HETZNER_TAILNET_IP placeholder with
  the real tailnet IP 100.95.31.88 (so a fresh re-provision is correct).
- nix/configuration.nix + nix/README.md: mark HISTORICAL/dead (old Incus VM,
  superseded by the Hetzner host) to prevent a wrong-host deploy.
- nginx oc.commoninternet.net vhost: note it's PARKED alongside opencode-web
  (kept for one-step re-enable, not deleted).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:07:05 +00:00
19fda8d2b8 fix(recipe-upgrade): stop auto-closing superseded/unrelated open PRs
Per operator: opening a new upgrade PR should stack ON TOP of any other
still-open PRs, not close them. Only PRs already merged into upstream
main are closed (merging them is a no-op). This prevents the phase-7
incident where an unrelated open ghost PR was auto-closed as 'superseded'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 00:07:05 +00:00
2304628375 chore(nix): park opencode-web (wantedBy=[]) — loops are on claude now
Keep the unit definition in the flake for easy re-enable; just stop it
auto-starting. Restore wantedBy = [ "multi-user.target" ] to bring it back.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 23:32:41 +00:00
d219b0972c journal: BUILD COMPLETE + weekly-upgrade cron cutover to NixOS timer (Sun 02:00 UTC)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 23:26:59 +00:00
ee58027c3e feat(nix): weekly /upgrade-all as a reboot-safe systemd timer (Sun 02:00 UTC)
Replace the boot-fragile busybox-crond-in-tmux (phase 5 §4) with a
systemd service+timer. Service is timer-triggered only (not wantedBy
multi-user.target) so it never runs on boot/activation; mirrors the
cc-ci-loops env fix (CLAUDE_BIN + /home/loops/.local/bin on PATH).
Timer fires Sundays 02:00 UTC, Persistent=true so a missed run (box
down) fires once on next boot. Runs launch-upgrader.py start ->
cc-ci-upgrader agent -> /upgrade-all DEFAULT (opens recipe PRs, never
merges). Activate via nixos-rebuild + retire the old Monday crond after
the phase-5 T0-fire verification completes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 22:54:52 +00:00
d8f558e987 journal: backend reverted to claude, waker folded into watchdog, boot service fixed
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:48:09 +00:00
2235110e29 journal: phase-5 progress-monitor events (19:04, 19:08)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:46:29 +00:00
1f96eba577 fix(ci-test-review): resolve PR ref to commit sha in verify-pr.sh
Resolve the recipe branch/ref to its head commit sha via the Gitea API
before invoking the cold full-suite run, so the upgrade tier deploys the
exact PR head. From the phase-5 upgrade-flow verification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:46:29 +00:00
ed849096a6 fix(nix): put claude on the cc-ci-loops service PATH so loops start on boot
The service path lacked /home/loops/.local/bin, so launch.py preflight's
which(claude) failed on every boot and the loops never auto-started
(they were restarted by hand). Set CLAUDE_BIN to the standalone CLI's
absolute path and prepend the dir to PATH so the tmux server every agent
session inherits resolves bare claude.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:46:29 +00:00
ca6e68c08d feat(orchestrator): fold hourly supervision wake into the watchdog
The standalone ai-progress-monitor.sh waker pinged a hardcoded
orchestrator session every 15m. Move that into the watchdog loop:
ORCH_WAKE_INTERVAL (default 3600s) types the supervision prompt into
the live orchestrator session, retrying each tick until it lands so a
busy or briefly-absent orchestrator is never interrupted and no hour is
skipped. Delete the now-redundant waker script; the prompt file is now
driven by the watchdog. Reboot-safe by inheritance (the watchdog is
started by cc-ci-loops.service).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:46:20 +00:00
8f7265e948 feat(orchestrator): wake the live monitor session 2026-06-01 18:51:05 +00:00
9fe9d49cac journal: record Hetzner rescue recovery for cc-ci 2026-06-01 13:55:15 +00:00
9574972f1d feat(skill): add Hetzner server recovery playbook 2026-06-01 13:48:23 +00:00
8093a95184 journal: session 2026-06-01 03:34 UTC handoff (opencode gpt-5.4 visible) 2026-06-01 13:03:51 +00:00
837fed17d2 fix(orchestrator): attach opencode session from orchestrator repo 2026-06-01 13:03:51 +00:00
a896ee9476 fix(testme-on-pr): wait for a fresh cc-ci status update 2026-06-01 13:03:41 +00:00
2486b7c368 fix(ci-test-review): resolve remote cc-ci worktree 2026-06-01 13:03:41 +00:00
dff090e5c8 docs(agents): require append-only push after commits 2026-06-01 12:59:12 +00:00
24bf379b5b feat(assistant): add opencode launcher and phase 6/7 plans 2026-06-01 12:59:03 +00:00
df6ca04611 feat(recipe-upgrade): add stale-test PR helpers 2026-06-01 03:48:05 +00:00
6a6c17f526 fix(launch-orchestrator): opencode uses plain TUI + ping, not run --attach
Same fix as the loops: opencode run --attach exits after one turn;
plain opencode TUI stays alive in tmux. Send startup prompt via
ping_session (Enter) after 8s init wait. Bootstrap points to
JOURNAL.md rather than sending the full prompt inline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 18:30:09 +00:00
2aa3fbda8d journal: session 2026-05-31 18:30 UTC handoff (opencode/deepseek running, phase 5) 2026-05-31 18:27:17 +00:00
3412100240 fix(opencode): all issues from first live run resolved
1. API key: opencode doesn't support env: substitution in apiKey — write
   actual key value to ~/.config/opencode/opencode.jsonc at setup time
   (file is not committed to git; key sourced from .testenv).
2. Permission system: add permission:"allow" to opencode config (equivalent
   to --dangerously-skip-permissions) to avoid interactive prompts.
3. Submit key: opencode TUI uses Enter (return) to submit; Ctrl+S not
   needed. ping_session already uses Enter — keep as is.
4. Startup timing: bump opencode TUI init wait from 4s to 8s so the TUI
   is fully connected to the server before bootstrap is sent.
5. Backend persistence: LOOP_BACKEND/LOOP_MODEL written to .loop-backend /
   .loop-model so the watchdog uses them when restarting dead sessions.

All tested: both builder and adversary sessions alive, deepseek-v4-pro
processing kickoffs via tinfoil inference.tinfoil.sh, no API/permission
errors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 18:21:10 +00:00
cd5e645427 fix(opencode): use inference.tinfoil.sh + attach TUI + NO_COLOR
Three fixes discovered during first live run:
- inference host is inference.tinfoil.sh not api.tinfoil.sh (control plane
  only serves /v1/models, not /v1/chat/completions)
- opencode run exits after one turn; switch to opencode attach for the
  persistent TUI, then ping_session sends the kickoff prompt
- NO_COLOR=1 suppresses the first-run interactive theme picker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:56:06 +00:00
bca51071bd refactor: rewrite launchers as Python; add orchestrator JOURNAL.md
Bash scripts are now one-liner wrappers: exec python3 <script>.py "$@"
All logic lives in the Python scripts (pure stdlib, no deps).

launch.py — loops + watchdog:
  Full port of launch.sh: phase sequencing, start/stop/status/logs/watchdog,
  handoff signalling, stall detection, heal_session, heal_orchestrator.
  Cleaner structure: config block → helpers → phase/kickoff/agent/healing/
  handoff/watchdog/main. LOOP_BACKEND + LOOP_MODEL switches throughout.

launch-orchestrator.py — orchestrator session:
  claude path: --resume <id> preserved (conversation survives reboots).
  opencode path: run --attach --title (no --resume; STARTUP_PROMPT orients
  the new session; reads JOURNAL.md for context).
  STARTUP_PROMPT updated to reference JOURNAL.md on startup.

launch-upgrader.py — one-shot upgrade job:
  LOOP_BACKEND / LOOP_MODEL take precedence over UPGRADER_BACKEND / UPGRADER_MODEL.
  Both claude and opencode paths supported.

cc-ci-plan/JOURNAL.md — new orchestrator handoff file:
  Persistent across conversation resets. Documents the handoff format and
  carries the current session's summary: migration complete, phase 5 in
  progress (V3/V7 PASS), phase 4 deferred, open items for next session.

AGENTS.md: step 1 on startup = read JOURNAL.md; step 5 = append on handoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:50:09 +00:00
e0e5bf6e64 feat: opencode web at oc.commoninternet.net (one server, named sessions)
configuration.nix:
- systemd.services.opencode-web: one shared opencode server on 127.0.0.1:4096,
  EnvironmentFile=/srv/cc-ci/.testenv (TINFOIL_API_KEY), ExecStartPre clears
  stale /tmp/opencode so restarts never fail on the EEXIST race.
- services.nginx: reverse-proxy oc.commoninternet.net → localhost:4096,
  bound to tailscale IP 100.84.190.30 (tailnet-only, plain HTTP).
  DNS: A record oc.commoninternet.net → 100.84.190.30 (operator step).

launch.sh + launch-upgrader.sh:
- Drop per-session ports / OPENCODE_HOST; add OPENCODE_SERVER=http://127.0.0.1:4096.
- opencode backend: agents use `opencode run --attach $OPENCODE_SERVER --title $session`
  so each shows up as a named session in the web UI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:37:03 +00:00
a87d42f491 feat: opencode/tinfoil backend support in all launchers
Adds LOOP_BACKEND=opencode|claude (+ LOOP_MODEL) to launch.sh and
launch-upgrader.sh, enabling the loops/upgrader to run via opencode CLI
against the tinfoil.sh API (deepseek-v4-pro etc.) instead of Claude.

launch.sh:
- LOOP_BACKEND (claude|opencode), LOOP_MODEL env vars
- OPENCODE_BIN, OPENCODE_HOST (tailscale IP), OPENCODE_PORT (per-session)
- start_agent: backend switch — claude path unchanged; opencode starts
  `opencode --hostname <ts-ip> --port <N> run <kickoff>` so the web UI
  is bound to the tailscale interface (tailnet-only observability)
- preflight: validates the right binary per backend
- heal_session / heal_orchestrator: extend active-work detection to
  opencode spinner chars + "Running tool"
- help: shows both backend configs

launch-upgrader.sh:
- UPGRADER_BACKEND / UPGRADER_MODEL (LOOP_BACKEND/LOOP_MODEL override)
- start: same backend switch as launch.sh
- OPENCODE_PORT=4098 (separate from loops 4096/4097)

configuration.nix: note opencode binary location + re-install command.

Tinfoil config: ~/.config/opencode/opencode.jsonc — provider "tinfoil"
with baseURL=https://api.tinfoil.sh/v1, apiKey=env:TINFOIL_API_KEY
(key + TINFOIL_MODEL + TINFOIL_BASE_URL stored in .testenv).
opencode v1.15.13 installed at /home/loops/.local/bin/opencode.

Usage:
  LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro \
    RESUME_PHASE=1 cc-ci-plan/launch.sh start

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:21:13 +00:00
6910b197d0 fix(testme-on-pr): read cc-ci/testme context URL not first-status URL
When multiple commit statuses exist (e.g. an Adversary probe + the real run),
the first status in the array may not be the cc-ci run. Filter by context
'cc-ci/testme' to get the correct Drone build URL.
2026-05-31 14:00:02 +00:00
0df57c6d0c fix(open-recipe-pr): replace python3 with jq (cc-ci has jq, not python3) 2026-05-31 13:35:07 +00:00
25fd7407fd launch-upgrader: default model to sonnet (UPGRADER_MODEL)
Adds UPGRADER_MODEL env var (default: sonnet) passed as --model to the
claude invocation. The cron runs the upgrader on Sonnet so it doesn't
consume Opus weekly credits. Override with UPGRADER_MODEL=opus if needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 13:24:53 +00:00
21e7a79f50 orchestrator-hetzner: enable reboot-resilience + record migration
Now the workspace is staged on the Hetzner cpx22 (server 134487234, public
91.98.47.73, tailnet cc-ci-orchestrator-1 @ 100.84.190.30):

- configuration.nix: enable cc-ci-loops.service (wantedBy multi-user.target) so the
  loops + watchdog auto-resume on boot; wire reboot-log.sh as ExecStartPre so reboots
  auto-log to REBOOTS.md (boot_id-gated).
- plan-orchestrator-hetzner-migration.md: full migration record.
- REBOOTS.md / AGENTS.md: point the orchestrator host at Hetzner; first auto-logged
  reboot line.
- launch-orchestrator.sh: default session id -> the Hetzner orchestrator session.
- flake.lock: pin inputs.

Verified: nixos-rebuild switch applied; systemctl is-enabled cc-ci-loops.service =
enabled; ExecStartPre logged this boot to REBOOTS.md; loops healthy on phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 03:54:17 +00:00
e89f384c24 nix: remove --ssh flag from tailscale (use normal key auth, not tailscale ACL)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 03:02:04 +00:00
73b65af6d6 nix: add all 3 root SSH keys from current orchestrator VM
Includes the operator key (mfowler), the claude-vm key, and the cc-ci-sandbox key.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:40:51 +00:00
497bea8462 nix: add root SSH authorized key to cc-ci-orchestrator-hetzner config
nixos-rebuild removed the infect-provisioned authorized_keys — declare it
explicitly so rebuilds don't lock out root access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:35:37 +00:00
c44b967019 nix: add real cpx22 hardware config from nixos-infect (server 134487234)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:30:29 +00:00
17951b899e terraform: fix server_type to cpx22 (cpx11/cpx21 retired in nbg1); add lock file
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:17:26 +00:00
0103f369ad terraform+nix: Hetzner orchestrator server (cpx11, nixos-infect, cc-ci-orchestrator-hetzner flake host)
Adds terraform/ to provision a Hetzner cpx11 (2 vCPU / 2 GB dedicated AMD / 40 GB NVMe)
for the loops runtime, and a flake + NixOS host config to converge it — replacing the slow
b1 Incus VM. Mirrors the cc-ci server terraform (same nixos-infect pin, same pattern).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 02:11:30 +00:00
4c418765c8 plan: full migrate-cc-ci-to-hetzner (provision cpx32 → benchmark 2 recipes → cutover loops+pipeline+DNS → retire Incus VM); age key is on the VM so no secret-blocker; harden .gitignore for the age key 2026-05-31 02:04:02 +00:00
b25330d3e8 gitignore: ignore .sops/ + age-key files (lost in the repo consolidation; needed before staging the master age key) 2026-05-31 01:22:29 +00:00
102427ab5b plan: full migrate-to-Hetzner (provision → cut over loops → stop old b1 VM); server type cpx31→cpx32
- plan-cc-ci-hetzner-migration.md: 3-phase plan — (1) provision the Hetzner cpx32 cc-ci fully + green
  !testme readiness gate, (2) repoint the loops + dashboard + *.ci at it (one ssh-config + DNS change),
  (3) stop the b1 cc-nix-test (cold standby). Parallel bring-up, reversible cutover, b1 freed.
- plan-cc-ci-hetzner-terraform.md: cpx31 is retired → default to cpx32 (current dedicated-vCPU 8GB).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 01:15:29 +00:00
b98e527656 plan: switch cc-ci cloud terraform from DigitalOcean to Hetzner (cx32 8GB, hcloud provider, nixos-infect + D8 flake flow) 2026-05-31 00:25:05 +00:00