cc-ci-orchestrator

Author	SHA1	Message	Date
autonomic-bot	11a2ce652d	watchdog: self-heal FATAL session-state errors + supervise the orchestrator - heal_session: detect the unrecoverable "thinking/redacted_thinking blocks cannot be modified" 400 (recurs every turn, session stays alive so the dead-check misses it) and kill+restart the loop fresh (re-orients from repo). Consolidates the dead/fatal/limit handling for builder+adversary. - heal_orchestrator: keep the orchestrator alive too, conflict-safe. Restarts via launch-orchestrator.sh ONLY when no orchestrator is alive anywhere — liveness detects both a managed cc-ci-orchestrator tmux session AND a hand-launched terminal session (any non-loop claude), so it never double-resumes the conversation (the likely cause of the thinking-block crashes). Kill+restart if the managed session is wedged on the FATAL error. Toggle: WATCH_ORCHESTRATOR=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 21:09:21 +01:00
autonomic-bot	36a6c9872a	orchestrator: reboot-resilience + session auto-resume + full session plan/tooling Reboot survival for the Pi orchestrator host: - systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the orchestrator session. - reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count). - launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed orchestrator announces itself (PushNotification) + reports reboots. - AGENTS.md: on-startup notify routine documented. Plans/tooling accumulated this session: - plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review), sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance. - launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution, limit-stall re-nudge, INBOX side-channel detection. - plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED. - prompts: isolation discipline + INBOX + pacing. - .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, .tmp.). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 20:28:10 +01:00
autonomic-bot	5681438b0f	launch.sh fix: don't let an empty-match grep kill the watchdog (set -e + pipefail) handoff_check's now="$(grep CLAIMED.*awaiting ... )" returned non-zero when a phase's STATUS has no claimed-awaiting lines yet (normal early in a phase); under set -euo pipefail that assignment exited the whole watchdog. Append `\|\| true` to the now= and cur= command substitutions. Verified: watchdog survives the handoff tick on a freshly-created STATUS-1c.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 16:09:01 +01:00
autonomic-bot	782a3c7360	Phase-1c: true verification = Adversary deletes the throwaway VM, creates a fresh one, full install Strengthen C4/W5: the genuine reproducibility proof is a clean-room repeat — the Adversary destroys any existing throwaway VM, creates a brand-new blank VM, and runs the entire install from scratch per docs/install.md so nothing from the Builder's setup attempt can mask a gap. Cold, with logged evidence (VM id, exact install commands, convergence + TLS-from-git-cert). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 16:05:54 +01:00
autonomic-bot	994e52c101	launch.sh: phase-aware sequencer (run 1c -> auto-transition 1b -> stop for manual gate) Make the launcher drive an ordered phase sequence (default 1c then 1b). Each phase has its own plan + phase-namespaced loop-state files (STATUS-<id>.md/BACKLOG/REVIEW/JOURNAL); the watchdog auto-transitions when the current phase's STATUS-<id>.md shows ## DONE, and STOPS after the last phase (writes SEQUENCE-COMPLETE, exits) as a manual gate before Phase 2. start_agent injects a phase preamble (source-of-truth = phase plan; phase-namespaced state) ahead of the base role prompt. DONE detection reads the builder's local clone (reliable, no push-lag). Handoff signalling + resilience preserved and made phase-scoped (reset baseline on transition). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 16:00:51 +01:00
autonomic-bot	9d13bb0b58	Reorder: Phase 1c before 1b (refactor first, then review/lint + full re-verify) 1c (full git reproducibility: cc-ci-secrets split, cert-in-sops, genuine D8 live rebuild) now runs before 1b. This way 1b's review/lint and its final cold re-verification of all D1-D10 cover the final refactored state (incl. the secrets split) and the genuine post-1c D8 — rather than reviewing pre-refactor code and re-verifying a flawed D8. Updated status lines in 1b/1c and the README ordering. Sequence: 1 -> 1c -> 1b -> 2 -> 2b -> 3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 15:51:07 +01:00
autonomic-bot	c6d27b251a	Phase-1c: split only secrets into a separate cc-ci-secrets repo; base stays parameterized Per operator: the split boundary is secrecy, not modularity. Only the sops-encrypted secrets (incl. the wildcard cert) move to a separate private repo `cc-ci-secrets` (extra access-control layer), consumed by the base via a flake input. Instance non-secret vars (domain, gateway, recipients) stay in the well-parameterized base cc-ci repo — another admin repoints by editing params, no second config repo. Guardrail reworded: instance vars in base are fine; only plaintext SECRETS must never leak into base/store. Updated model/C1/C2/W2/§6/§7 + README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 15:47:57 +01:00
autonomic-bot	769dfd0c62	Phase-1c: resource plan -> 4GB/4GB under a 12GB guideline (not 2GB) Per operator: don't downsize cc-nix-test to 2GB. Instead raise the terraform-ci running-RAM guideline to ~12GB (it's doc-only — the project has no enforced limits.memory; b1 is 16GB), resize cc-nix-test 6->4GB, and create the throwaway VM at 4GB (4+4+lichen 4 = 12 <= 16). Updated W1/W3/C6/§4 and the incus memory note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 15:29:37 +01:00
autonomic-bot	d41a76f757	Add Phase-1c plan: full git reproducibility (secrets+cert in sops) + genuine D8 live rebuild D8's throwaway-VM live rebuild was wrongly marked "infeasible by design" — the master recovery age key defeats the sops-host-key reason, DNS/cert is a precondition not a rebuild blocker, and Incus was available. Phase 1c (loop-driven): (A) make the VM fully reproducible from git including ALL secrets — move the wildcard cert + every secret into sops-in-git, split generic base repo from a private instance repo composed via a flake input (the only out-of-band secret is the bootstrap age key); (B) actually perform + cold-verify a blank-VM nixos-rebuild and rewrite D8 honestly. Resize cc-nix-test to 2GB first to free b1 headroom for a sized throwaway VM; destroy it after; restore/promote sizing. Gandi token stays out of repo/agent (only the cert artifact is committed). Linked README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 15:24:48 +01:00
autonomic-bot	e68a520d4c	Fix watchdog false gate-ping: edge-trigger on NEW claimed-awaiting gate ids, baseline silently The Adversary got a spurious "gate CLAIMED" ping: STATUS.md keeps historical "Gate: Mn — CLAIMED, awaiting Adversary" lines after they PASS, and on watchdog restart the first observation pinged on those already-passed lines. Now track the SET of gate ids on CLAIMED-awaiting lines and ping only when an id NEWLY appears vs the prior observation, after a silent baseline. A gate passing (line kept) or evidence edits don't re-ping; restart re-baselines without pinging. Verified: watchdog restart no longer pings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:25:09 +01:00
autonomic-bot	649b90b586	launch.sh: resolve script to absolute path (SELF) so the watchdog re-invokes correctly Bug: start_watchdog used $0, which breaks when launch.sh is called by a relative path (the watchdog tmux session cd's into PLAN_DIR, so a relative $0 no longer resolves — "No such file or directory", watchdog dies instantly). Resolve BASH_SOURCE to an absolute SELF once and use it for the watchdog self-invocation. Verified: watchdog now starts and its handoff_check immediately pinged the Adversary about a standing CLAIMED gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:16:54 +01:00
autonomic-bot	239dfd8e26	Watchdog handoff signalling: ping the waiting loop on gate-claim / verdict (kill double-idle) launch.sh watchdog now runs a fast (~30s) handoff_check alongside the heavy (300s) restart/DONE check: when the Builder writes a CLAIMED gate it pings the Adversary to verify now; when the Adversary updates REVIEW.md it pings the Builder to proceed (edge-triggered, reads local clones). So a pending handoff resolves in <~30s instead of a whole idle interval. Pacing revised: the Adversary may idle freely when nothing's pending (no pointless re-verify/busy-poll) and is woken by the watchdog; Builder waits on the ping + a fallback ~2-4m self-poll. kickoff documents the new "handoff signalling" role. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:15:25 +01:00
autonomic-bot	deca47d9c7	Pacing §7: avoid both-loops-idle during a handoff (short-poll when blocked on the counterpart) Root cause of "both waiting": parked-at-gate was lumped into the long idle sleep, so a pending handoff sat while both loops slept on desynced timers. Fix: three cases — (1) in flight → ~4m; (2) BLOCKED ON THE OTHER LOOP (Builder at CLAIMED gate / Adversary awaiting a fix) → ~4m poll for the counterpart, never long-idle; (3) genuinely nothing pending → ~10-15m. Adversary: a CLAIMED gate is immediate top-priority; otherwise run background probes, rarely idle while Builder is active. Builder: keep an unblocked item in hand to rarely be fully gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 06:05:55 +01:00
autonomic-bot	8a4a010723	Reduce idle loop cadence 20–30m -> ~10–15m (pick up work sooner) §7 pacing + Builder/Adversary prompts: idle/parked sleep lowered to ~10–15 min so the next unit of work (or a gate claim) is picked up without long gaps. Unchanged: ~4m polling while a build/deploy is in flight; keep polling something clearly in-flight rather than treating it as idle; don't spin on a minutes-long build. Adversary aligned to §7 for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 05:17:29 +01:00
autonomic-bot	3d198c8c17	Phase-1b: require full cold-start re-verification of all Phase-1 D1–D10 as the final gate RL3 strengthened: after lint/review findings are responded to and fixed, the Adversary independently re-verifies EVERY Phase-1 Definition-of-Done item (D1–D10) from a cold start to the same bar as Phase 1's own DONE (fresh PASS + evidence in REVIEW.md), proving the cleanup regressed nothing. 1b cannot be DONE until all D1–D10 are re-confirmed green post-cleanup. Method/W2 updated to make the ordering explicit (tooling -> fixes -> re-verify). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 05:15:34 +01:00
autonomic-bot	5d90cbd576	Add Phase-1b plan: bounded review & lint pass at the end of Phase 1 Before scaling to many recipes: (1) deterministic style/hygiene via linters/formatters (alejandra/statix/deadnix, ruff, shellcheck/shfmt) wired as a .drone.yml stage so commits stay clean; (2) a white-box review checklist with teeth (real tests not health-only/skipped, DRY harness, Nix-declared idempotent bring-up, no footguns/secrets-in-code, architecture matches plan) — blocking fixed, advisory triaged. Bounded pass; never weaken a test for a nit. Phase 2 now follows 1b. Linked in README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 05:11:06 +01:00
autonomic-bot	2d3c17f4bd	Add Phase-2b plan: test performance (measure, attribute, improve empirically) Phase 2b (after Phase 2, before Phase 3): instrument per-phase timings, baseline a representative recipe set (cold vs warm), attribute where time goes (Pareto), then try improvements as controlled before/after experiments and keep measured winners — image pull cache/pre-pull, readiness-wait tuning, dedup deploy cycles, warm/shared infra (isolation-proven), runner caching, concurrency sizing, vCPU. Speed never weakens tests or isolation (Adversary re-measures + re-verifies). Phase 3 now follows 2b. Linked in README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 04:26:27 +01:00
autonomic-bot	7c77aec7ab	Add Phase-3 plan: beautiful YunoHost-style results (levels + image comment + dashboard) Phase 3 (after Phase-2 DONE, manual transition): compute a per-run quality LEVEL, post an image-forward Gitea PR comment in the YunoHost shape (marker + status/level badge + a rendered summary card containing a real app screenshot, linking to the run), and polish the overview dashboard to a ci-apps.yunohost.org look/feel with per-recipe level badges + screenshots. Reuses the Phase-1 dashboard/bridge/Playwright; presentation never changes the verdict; no secrets in any artifact; cosmetics never block the pipeline. Linked from README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 04:01:21 +01:00
autonomic-bot	781f9fd91f	Phase-2 plan: harden Adversary mandate — no skipped tests / corners cut Add §7.1 Adversary mandate: default assumption is everything meaningful is testable (OIDC/SSO, federation, media, WOPI, WebRTC connectivity, backup data survival) — the job is a good test, not declaring impossibility. Adversary reads test bodies, rejects skip/xfail/mock/health-only/empty-assertion tests and bogus parity renames, re-runs cold. "Untestable" is a rare exception needing a true environment blocker + maximal subset + Adversary sign-off; "needs browser/SSO/another app" is not valid. Tighten P7 and §8 to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 03:50:33 +01:00
autonomic-bot	07faa6007f	Add Phase-2 plan: comprehensive per-recipe test authoring (after Phase-1 DONE) Phase 2 fills the CI machine with good tests for every maintained Co-op Cloud app, using references/recipe-maintainer as the corpus: port a comparable cc-ci test for EACH existing recipe-maintainer test (parity, tracked in PARITY.md) + >=2 new recipe-specific functional tests per recipe, plus real backup data-integrity and SSO dependency handling. Reuses the Phase-1 harness/stages/trigger/resource-caps; adds test content + small shared-harness ports from helpers.py. Linked from the package README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 03:41:28 +01:00
autonomic-bot	667c7cd5a0	plan §4.2/§4.3: MAX_TESTS via DRONE_RUNNER_CAPACITY + native queue/timeout; teardown after each run Don't overload the single node: cap concurrent test builds at a configurable MAX_TESTS (= DRONE_RUNNER_CAPACITY); Drone natively queues excess builds and times out hung ones, freeing slots — no custom queue. Each run deploys one app then undeploys; the run-start janitor is the backstop for timed-out/killed builds. At most MAX_TESTS apps live at once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:45:26 +01:00
autonomic-bot	8c4efe3c88	Add cc-ci-plan/IDEAS.md: deferred-ideas backlog; park optional webhook self-registration First item: later, for environments where the CI server has repo-admin, consider an opt-in (off-by-default) feature to auto-register + idempotently reconcile the issue_comment webhook — preserving the read-only/polling default. Parked, out of current scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:42:34 +01:00
autonomic-bot	34cbb60f35	plan §4.1/§1.5: polling primary + read-only CI; webhook is optional manual-admin Finalize trigger model per operator: polling is the primary trigger (outbound, read-only, no admin); the server never self-registers webhooks (that needs admin) — webhook is an optional push optimization an admin registers manually, documented in enroll-recipe.md. Commenter auth via org-membership endpoint (read-level), not the admin-only permission endpoint. Bot's required privilege is read + comment + org-membership, never repo-admin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:37:17 +01:00
autonomic-bot	e157a943bb	plan §4.1: commenter auth via /permission endpoint (write+), not the collaborators list The repo's explicit collaborator list is empty — bot and maintainers (trav/notplants) all access via org ownership, so the collaborators check 404s for everyone. Authorize via GET /collaborators/{user}/permission requiring owner/admin/write (matches the builder's fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:20:59 +01:00
autonomic-bot	ef42e3d922	plan §4.1: trigger is webhook-OR-poll (mutually exclusive, flag-selected), + collaborator check Record the trigger design: webhook (default/primary, confirmed working) and polling (kept but disabled behind a flag) are mutually exclusive — only one runs at a time, so no cross-path dedupe. Poll is the fallback when webhook delivery fails. Also note the commenter-auth check must count recipe-maintainers org members/admins, not just repo collaborators (the bot is org admin and was being rejected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 02:15:32 +01:00
autonomic-bot	4ffcdda9da	plan §9: infra bring-up = declarative idempotent reconciliation, not manual/run-once Strengthen the idempotency guardrail: every infra piece (swarm, traefik recipe deploy, drone, bridge, dashboard) is a systemd oneshot that re-runs each activation/boot and converges to desired state (like swarm-init) — no manual post-steps, no run-once sentinels. Goal: from-scratch install = clone + nixos-rebuild switch + preconditions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 22:49:49 +01:00
autonomic-bot	2264e0fa74	plan: use the real coop-cloud/traefik recipe via abra (e2e fidelity), not a custom Traefik Supersedes the original modules/traefik.nix hand-rolled proxy. cc-ci now deploys the coop-cloud/traefik recipe via abra in wildcard/file-provider mode, serving the operator's pre-issued wildcard cert as the recipe's ssl_cert/ssl_key swarm secrets — canonical web/web-secure + proxy/swarm conventions every recipe expects, no ACME, DNS token never on cc-ci. Updated §1, §1.5, §3, §4.0, §4.2, §5 (M1), §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 22:13:04 +01:00
autonomic-bot	76dcff70e8	Add README: orchestrator tmux + resume/remote-control relaunch quickref Records the exact sequence to keep the orchestrator alive in tmux and resume it with remote control (survives disconnects/laptop close), reconnect commands, and pointers to launch/supervision docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 21:00:05 +01:00
autonomic-bot	c75ffccb99	AGENTS.md: document resume-by-name + /remote-control for the orchestrator session Clarify the two distinct names (--resume <conversation> vs --remote-control display label), the in-session /remote-control shortcut, and the persist-vs-reconnect model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:56:24 +01:00
autonomic-bot	8ea3276d20	plan: document recipe mirror+PR flow and bot org scope for enrollment Recipe repos under test live on the private mirror git.autonomic.zone/recipe-maintainers, mirrored from upstream git.coopcloud.tech. autonomic-bot is admin on that org (can create repos + add webhooks). A recipe missing from the mirror is not a blocker — fetch from upstream and open a PR via the recipe-create-pr procedure. Updated D10 (§2) and enrollment (§4.1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:53:27 +01:00
autonomic-bot	001ff29190	Add AGENTS.md: orchestrator role + keep-open-under-remote-control model Documents the three roles (orchestrator vs Builder/Adversary loops), how to keep this orchestrator session alive under --remote-control for check-ins/steering via claude.ai/code, launch/supervision pointers, access/cred locations, and the VM fallback. Secrets remain gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:50:15 +01:00
autonomic-bot	bdc78da921	Initial commit: cc-ci autonomous orchestrator Planning + launch + setup material for the cc-ci Co-op Cloud recipe CI server: plan.md (single source of truth), kickoff/launch supervision, and the Builder/Adversary loop prompts. Secrets (.testenv) and runtime dirs are gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:46:28 +01:00

32 Commits