cc-ci-orchestrator

Author	SHA1	Message	Date
autonomic-bot	01874821f2	decommission Pi: update all docs for VM-only setup The orchestrator Pi is retired (2026-05-31). All agents now run on the cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled, no ProxyCommand. Updated across all affected files: AGENTS.md - Remove Pi from reboot description; migration complete (not "parked") - cc-ci access: direct ssh, not via proxy kickoff.md - Prerequisites: direct tailnet peer, not proxy - Host deps: NixOS (not apt) - Fallback/Incus: b1 reachable directly, no --proxy curl flag plan.md §1 + §1.5 - §1 bootstrap: direct SSH, check tailscale status (not restart proxy) - §1.5 intro: "VM" not "sandbox host"; no proxy - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row - Replace "Tailscale connection (proxy)" subsection with direct-peer description plan-orchestrator-migration.md - Mark COMPLETE (2026-05-31); historical record only plan-phase1c-full-reproducibility.md - Incus access: direct, not via SOCKS proxy prompts/builder.md + prompts/adversary.md - cc-ci access language only: direct ssh, no proxy restart instructions - adversary: *.ci.commoninternet.net via plain curl, no proxy flag REBOOTS.md - Retitle for VM; note Pi retired; Pi entries marked historical systemd/cc-ci-loops.service - User/Group/HOME/PATH: notplants → loops - Remove cc-ci-tailscaled.service dependency (no proxy on VM) - Add note about nix/configuration.nix as the authoritative VM declaration test-e2e-testme-acceptance.md - tailscale status: no --socket flag - ssh to throwaway: no ProxyCommand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-31 00:16:37 +00:00
autonomic-bot	fd08a977d0	overlay policy: standardize the ccci overlay filename to compose.ccci.yml Operator: use a single uniform filename `compose.ccci.yml` per recipe (one file holding all cc-ci-side deploy tweaks) rather than per-purpose suffixes like compose.ccci-health.yml. Updated §9 + plan-ccci-compose-overlay-policy.md; added a DoD item to rename tests/{ghost,discourse}/compose.ccci-health.yml -> compose.ccci.yml and update their install_steps.sh cp target + recipe_meta COMPOSE_FILE. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:25:48 +01:00
autonomic-bot	5f34c0ad01	overlay policy (content): §9 guardrail rewrite + plan-ccci-compose-overlay-policy.md The prior commit only captured the file deletion (git add aborted on the already-removed pathspec). This adds the actual content: the reworked §9 guardrail (justified ccci overlays OK; abra can't env start_period; always test upgrade-to-latest, from-version custom tests skippable) and the new policy doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:19:18 +01:00
autonomic-bot	7a1f7f75aa	Policy: prefer upstream env-parameterization over cc-ci compose overlays Operator (2026-05-30): a cc-ci-authored compose overlay risks silent drift from the recipe users actually run — avoid it wherever possible. - plan.md §9 guardrail: when a recipe needs a cc-ci-env-tuned value (e.g. a longer healthcheck start_period for the slow single node), the preferred fix is an UPSTREAM recipe PR exposing it as an env var (e.g. APP_START_PERIOD) with the current value as the default in env.sample — CI sets the env, no new compose. For making the upgrade tier work from an older base version, prefer DECLARING that version not-testable under this CI env over crafting a custom compose. Overlay = last resort, Adversary-confirmed non-drifting + paired with the env PR. - plan-prefer-env-over-compose-overlay.md: migrates the existing debt — ghost/discourse compose.ccci-health.yml start_period -> APP_START_PERIOD recipe PRs (default=current) then drop the overlays; discourse image re-pin + mumble old-base host-ports copy -> declare those old versions untestable instead of forking compose. No test weakened; untestable-version is an honest outcome. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 15:17:42 +01:00
autonomic-bot	a89b082240	plan §7: recommend Monitor-on-convergence pattern for long deploys (builder's idea) For a long deploy/convergence, arm a Monitor that polls the node every ~30s and wakes on convergence OR failure, with a longer fallback heartbeat (ScheduleWakeup) as a backstop. Proceeds the instant it converges (no over-waiting), surfaces failures promptly, and the heartbeat bounds the wait. Size the timeout sanely (longer if justified, never absurd like the ~40-min ghost case). Credit: builder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:17:18 +01:00
autonomic-bot	e8c4330ce3	watchdog: reboot idle-wedged loops via self-reported WAITING-UNTIL markers The builder wedged at the context limit (garbled output) — alive but matching none of heal_session's signals (dead/FATAL/limit), so the watchdog left it stuck. Fix: loops now declare every wait, and the watchdog reboots a wait that never resumes. - plan.md §7 + both prompts: cap every wait at 10 min (chunk longer waits); before going idle, the loop's FINAL line must be `WAITING-UNTIL: <ISO8601 UTC>` (the resume time, matching its ScheduleWakeup); run /compact proactively at ~80% context to avoid wedging near the limit. - launch.sh: new stall_check (runs every 30s signal tick) — reboots a loop idle >= STALL_IDLE (300s) when it has NO current WAITING-UNTIL marker as its last message OR is past the time the marker named; a healthy paced wait (marker present, before its time) is left alone. Complements heal_session's dead/FATAL/limit cases. Reboot is safe — loops re-orient from git + STATUS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 19:05:29 +01:00
autonomic-bot	7f8e6cb13e	guardrail: abra convergence by default; custom READY_PROBE only when necessary + a real strict test (operator 2026-05-29, re F2-12) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 12:56:26 +01:00
autonomic-bot	b4451527c3	builder: clean-tree-before-claim discipline (git status must be clean — Adversary cold-verifies from git) Cheap guard against the deploy/git divergence: a fix built locally but uncommitted/un-pushed is a guaranteed Adversary cold-build mismatch. Added to the builder prompt claim discipline + plan.md §6.1. (Lighter than binding the deploy to a git rev — iteration speed + the Adversary's cold-from-git verify is the real safety net.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 09:49:09 +01:00
autonomic-bot	ae83a8120d	watchdog: signal handoffs off claim()/review() commit prefixes (robust) + codify the convention Replaces the brittle markdown prose-match ("Gate: … CLAIMED, awaiting Adversary") with detection of the loops' conventional commit prefixes on origin/main: a new `claim(...)` commit pings the Adversary; a new `review(...)` commit pings the Builder. Edge-triggered on the origin/main SHA (append-only — no force-push), no file parsing, can't mis-route. The loops already use these prefixes consistently; codified as a load-bearing contract in plan.md §6.1 + both prompts so it stays reliable. INBOX detection unchanged (pushed-state, file-routed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 03:10:12 +01:00
autonomic-bot	36a6c9872a	orchestrator: reboot-resilience + session auto-resume + full session plan/tooling Reboot survival for the Pi orchestrator host: - systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the orchestrator session. - reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count). - launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed orchestrator announces itself (PushNotification) + reports reboots. - AGENTS.md: on-startup notify routine documented. Plans/tooling accumulated this session: - plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review), sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance. - launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution, limit-stall re-nudge, INBOX side-channel detection. - plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED. - prompts: isolation discipline + INBOX + pacing. - .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, .tmp.). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 20:28:10 +01:00
autonomic-bot	239dfd8e26	Watchdog handoff signalling: ping the waiting loop on gate-claim / verdict (kill double-idle) launch.sh watchdog now runs a fast (~30s) handoff_check alongside the heavy (300s) restart/DONE check: when the Builder writes a CLAIMED gate it pings the Adversary to verify now; when the Adversary updates REVIEW.md it pings the Builder to proceed (edge-triggered, reads local clones). So a pending handoff resolves in <~30s instead of a whole idle interval. Pacing revised: the Adversary may idle freely when nothing's pending (no pointless re-verify/busy-poll) and is woken by the watchdog; Builder waits on the ping + a fallback ~2-4m self-poll. kickoff documents the new "handoff signalling" role. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:15:25 +01:00
autonomic-bot	deca47d9c7	Pacing §7: avoid both-loops-idle during a handoff (short-poll when blocked on the counterpart) Root cause of "both waiting": parked-at-gate was lumped into the long idle sleep, so a pending handoff sat while both loops slept on desynced timers. Fix: three cases — (1) in flight → ~4m; (2) BLOCKED ON THE OTHER LOOP (Builder at CLAIMED gate / Adversary awaiting a fix) → ~4m poll for the counterpart, never long-idle; (3) genuinely nothing pending → ~10-15m. Adversary: a CLAIMED gate is immediate top-priority; otherwise run background probes, rarely idle while Builder is active. Builder: keep an unblocked item in hand to rarely be fully gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:05:55 +01:00
autonomic-bot	8a4a010723	Reduce idle loop cadence 20–30m -> ~10–15m (pick up work sooner) §7 pacing + Builder/Adversary prompts: idle/parked sleep lowered to ~10–15 min so the next unit of work (or a gate claim) is picked up without long gaps. Unchanged: ~4m polling while a build/deploy is in flight; keep polling something clearly in-flight rather than treating it as idle; don't spin on a minutes-long build. Adversary aligned to §7 for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 05:17:29 +01:00
autonomic-bot	667c7cd5a0	plan §4.2/§4.3: MAX_TESTS via DRONE_RUNNER_CAPACITY + native queue/timeout; teardown after each run Don't overload the single node: cap concurrent test builds at a configurable MAX_TESTS (= DRONE_RUNNER_CAPACITY); Drone natively queues excess builds and times out hung ones, freeing slots — no custom queue. Each run deploys one app then undeploys; the run-start janitor is the backstop for timed-out/killed builds. At most MAX_TESTS apps live at once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:45:26 +01:00
autonomic-bot	34cbb60f35	plan §4.1/§1.5: polling primary + read-only CI; webhook is optional manual-admin Finalize trigger model per operator: polling is the primary trigger (outbound, read-only, no admin); the server never self-registers webhooks (that needs admin) — webhook is an optional push optimization an admin registers manually, documented in enroll-recipe.md. Commenter auth via org-membership endpoint (read-level), not the admin-only permission endpoint. Bot's required privilege is read + comment + org-membership, never repo-admin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:37:17 +01:00
autonomic-bot	e157a943bb	plan §4.1: commenter auth via /permission endpoint (write+), not the collaborators list The repo's explicit collaborator list is empty — bot and maintainers (trav/notplants) all access via org ownership, so the collaborators check 404s for everyone. Authorize via GET /collaborators/{user}/permission requiring owner/admin/write (matches the builder's fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:20:59 +01:00
autonomic-bot	ef42e3d922	plan §4.1: trigger is webhook-OR-poll (mutually exclusive, flag-selected), + collaborator check Record the trigger design: webhook (default/primary, confirmed working) and polling (kept but disabled behind a flag) are mutually exclusive — only one runs at a time, so no cross-path dedupe. Poll is the fallback when webhook delivery fails. Also note the commenter-auth check must count recipe-maintainers org members/admins, not just repo collaborators (the bot is org admin and was being rejected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:15:32 +01:00
autonomic-bot	4ffcdda9da	plan §9: infra bring-up = declarative idempotent reconciliation, not manual/run-once Strengthen the idempotency guardrail: every infra piece (swarm, traefik recipe deploy, drone, bridge, dashboard) is a systemd oneshot that re-runs each activation/boot and converges to desired state (like swarm-init) — no manual post-steps, no run-once sentinels. Goal: from-scratch install = clone + nixos-rebuild switch + preconditions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 22:49:49 +01:00
autonomic-bot	2264e0fa74	plan: use the real coop-cloud/traefik recipe via abra (e2e fidelity), not a custom Traefik Supersedes the original modules/traefik.nix hand-rolled proxy. cc-ci now deploys the coop-cloud/traefik recipe via abra in wildcard/file-provider mode, serving the operator's pre-issued wildcard cert as the recipe's ssl_cert/ssl_key swarm secrets — canonical web/web-secure + proxy/swarm conventions every recipe expects, no ACME, DNS token never on cc-ci. Updated §1, §1.5, §3, §4.0, §4.2, §5 (M1), §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 22:13:04 +01:00
autonomic-bot	8ea3276d20	plan: document recipe mirror+PR flow and bot org scope for enrollment Recipe repos under test live on the private mirror git.autonomic.zone/recipe-maintainers, mirrored from upstream git.coopcloud.tech. autonomic-bot is admin on that org (can create repos + add webhooks). A recipe missing from the mirror is not a blocker — fetch from upstream and open a PR via the recipe-create-pr procedure. Updated D10 (§2) and enrollment (§4.1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:53:27 +01:00
autonomic-bot	bdc78da921	Initial commit: cc-ci autonomous orchestrator Planning + launch + setup material for the cc-ci Co-op Cloud recipe CI server: plan.md (single source of truth), kickoff/launch supervision, and the Builder/Adversary loop prompts. Secrets (.testenv) and runtime dirs are gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:46:28 +01:00

21 Commits