memory+journal: cc-ci host rebuild procedure; pxgate M2 deployed + verified on live host
This commit is contained in:
@ -654,3 +654,15 @@ session cc-ci-orchestrator-stale can be killed; recipe-mirrors org still private
|
||||
M1 = fix + controlled repro (loops), M2 = from-scratch cold-boot proof (orchestrator owns the
|
||||
live nixos-rebuild). Appended pxgate to agents.toml (idx 14); cleared SEQUENCE-COMPLETE +
|
||||
`phase set 14` + started loops → resumes the build for this one phase.
|
||||
|
||||
## 2026-06-13 ~13:50 — pxgate M2: orchestrator nixos-rebuilt the cc-ci host (operator-authorized)
|
||||
- Operator OK'd the live rebuild (no CI running). Deployed the pxgate fix (deploy-proxy health probe
|
||||
→ traefik.ci.commoninternet.net/api/version) via nixos-rebuild switch --flake .#cc-ci.
|
||||
- Debugged 3 issues to get the build green: (1) git "not owned by current user" → chown root; (2) sops
|
||||
`secrets/secrets.yaml does not exist` → copied operator-held secret from /etc/cc-ci/secrets/; (3) git
|
||||
flake excludes untracked files → dropped .git (plain path flake). Procedure saved as memory
|
||||
[[cc-ci-host-rebuild-procedure]] (the host had NO self-service rebuild path; last rebuild 05-31).
|
||||
- Result: deploy-proxy active on the dashboard-independent probe (verified in the running nix-store
|
||||
warm_reconcile.py), all 9 infra services 1/1, no failed units, endpoints 200. Cycle broken by
|
||||
construction. Only proxy/keycloak/sweep units rebuilt (nixpkgs pinned). Pushed BUILDER-INBOX with M2
|
||||
evidence to unblock the loops. From-scratch *reboot* proof offered to operator, not yet done.
|
||||
|
||||
@ -14,3 +14,6 @@
|
||||
- [cfold paused pending upgrade](cfold-paused-pending-upgrade.md) — cfold phase loops+watchdog STOPPED until /upgrade-all (cc-ci-upgrader) finishes; resume = restart watchdog (phase-idx 9)
|
||||
- [proxy VIP exhaustion runbook](proxy-vip-exhaustion-runbook.md) — TODO after upgrade: enlarge proxy overlay to /16 (exhausts at /24=254 VIPs); root cause of discourse/ghost deploy wedges
|
||||
- [ghost PR debug](ghost-pr-debug.md) — TODO after proxy fix: debug+fix the ghost upgrade PR (wedged on proxy VIP exhaustion; possible duplicate PR)
|
||||
- [launch-system unification](launch-system-unification.md) — replacing 5 launchers + ~15 dotfiles with one agents.toml + agents.py; plan in cc-ci-plan/plan-unified-launch.md
|
||||
- [orchestrator backend-switch gotcha](orchestrator-backend-switch-gotcha.md) — full claude↔opencode switch: kill watchdog first, fix all control files, env overrides files, pkill -f self-matches
|
||||
- [cc-ci host rebuild procedure](cc-ci-host-rebuild-procedure.md) — how to nixos-rebuild the live cc-ci server (no self-service path; stage clone + copy sops secrets + drop .git + --flake .#cc-ci)
|
||||
|
||||
31
memory/cc-ci-host-rebuild-procedure.md
Normal file
31
memory/cc-ci-host-rebuild-procedure.md
Normal file
@ -0,0 +1,31 @@
|
||||
---
|
||||
name: cc-ci-host-rebuild-procedure
|
||||
description: How to nixos-rebuild the live cc-ci SERVER host (no self-service path exists; worked out 2026-06-13)
|
||||
metadata:
|
||||
node_type: memory
|
||||
type: reference
|
||||
originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b
|
||||
---
|
||||
|
||||
The cc-ci **server** host (ssh alias `cc-ci`, Hetzner) has **no repo/script on it** to rebuild from
|
||||
(`/root/cc-ci` does not exist; last operator rebuild was 2026-05-31). Orchestrator procedure to deploy
|
||||
a nix change (e.g. a `deploy-proxy.service` / systemd-unit change that can't be applied at the docker
|
||||
layer) — establishished during phase `pvfix`/`pxgate`:
|
||||
|
||||
1. Stage current `main` on the host: `rsync -a --delete /home/loops/work/cc-ci-fix/ root@cc-ci:/root/cc-ci-deploy/`
|
||||
(orchestrator clone must be on the target ref + clean).
|
||||
2. `ssh cc-ci 'chown -R root:root /root/cc-ci-deploy'` — else git FATAs "repository not owned by current user".
|
||||
3. **Copy the operator-held sops secrets** (NOT in git): `cp /etc/cc-ci/secrets/secrets.yaml
|
||||
/root/cc-ci-deploy/secrets/secrets.yaml`. Without it the build FATAs `secrets/secrets.yaml does not
|
||||
exist` (sops module). The age key is at `/var/lib/sops-nix/key.txt`.
|
||||
4. `rm -rf /root/cc-ci-deploy/.git` — a git flake only includes *tracked* files, so the untracked
|
||||
secrets.yaml is excluded; dropping `.git` makes it a plain path flake that uses ALL files. (flake.nix
|
||||
has no `self.rev` dependency, so this is safe.)
|
||||
5. Build first: `cd /root/cc-ci-deploy && nixos-rebuild build --flake .#cc-ci` (target is `.#cc-ci` =
|
||||
`.#cc-ci-hetzner` = `nix/hosts/cc-ci-hetzner/`). nixpkgs is PINNED to the running rev, so only the
|
||||
changed cc-ci modules rebuild — small + fast, not a giant bump.
|
||||
6. `nixos-rebuild switch --flake .#cc-ci`. Then verify: `systemctl is-active deploy-proxy`,
|
||||
`systemctl --failed`, `docker service ls` all N/N, routed endpoints 200.
|
||||
|
||||
Operator must authorize (and pick a no-CI window) — a switch cycles reconcile oneshots
|
||||
(deploy-proxy, warm-keycloak). A true from-scratch boot proof = reboot the host.
|
||||
43
memory/launch-system-unification.md
Normal file
43
memory/launch-system-unification.md
Normal file
@ -0,0 +1,43 @@
|
||||
---
|
||||
name: launch-system-unification
|
||||
description: Plan to replace 5 bespoke launcher scripts + ~15 dotfiles with one agents.toml + one agents.py driver
|
||||
metadata:
|
||||
node_type: memory
|
||||
type: project
|
||||
originSessionId: fc17c9c2-ab6e-4c11-856e-a6a6e160a0ec
|
||||
---
|
||||
|
||||
The cc-ci agent launch system was 5 near-duplicate launchers (`launch.py` loops+watchdog,
|
||||
`launch-orchestrator.py`, `launch-assistant.py`, `launch-upgrader.py`, `launch-report.py`)
|
||||
each re-implementing claude/opencode backend plumbing, plus ~15 scattered dotfiles in
|
||||
`/srv/cc-ci/.cc-ci-logs/` (`.loop-backend`, `.loop-model*`, `.orch-model`, `.phases-spec`,
|
||||
`.phase-idx`, `.*-session-id`, `.limited-*`, …).
|
||||
|
||||
**STATUS: IMPLEMENTED + CUT OVER 2026-06-13 ~05:27 UTC.** agents.toml + agents.py are live;
|
||||
orchestrator/builder/adversary + watchdog all respawned under the new system and confirmed
|
||||
working on phase pvfix. launch.py + launch-orchestrator.sh are now shims → agents.py (originals
|
||||
at *.orig); systemd boot chain (cc-ci-loops-start → launch.sh → launch.py start) drives the new
|
||||
system. State moved to .cc-ci-logs/state/ (phase-idx, <name>.id resume, limited-*.json). tmux
|
||||
targeting uses exact match (=name / =name:) to avoid prefix collisions (cc-ci-assistant vs
|
||||
cc-ci-assistant3). **VERIFIED STABLE over 3 hourly checks (06:30/07:34/08:36 UTC 2026-06-13);
|
||||
hourly wake ENDED (orchestrator runs its own hourly self-wake via the watchdog).** During
|
||||
verification the system autonomously ran the build to completion (pvfix→pvcheck→ghost→cf48, an
|
||||
operator-appended opus review phase) — clean auto-advances, handoff pings, per-phase model
|
||||
overrides all worked. One port defect found+fixed at check 2: phase_advance_check is now
|
||||
idempotent once SEQUENCE-COMPLETE exists (no 5-min log spam) and the watchdog keeps death-healing
|
||||
the orchestrator without restarting the intentionally-stopped finished loops. End state: build
|
||||
sequence COMPLETE, loops intentionally stopped, orchestrator supervising; 13 recipe PRs await
|
||||
operator review/merge.
|
||||
|
||||
**Original plan (approved 2026-06-13, operator chose TOML + build-now + de-dupe mailu):** one
|
||||
declarative `cc-ci-plan/agents.toml` (single source of truth: per-agent backend, model,
|
||||
prompt, kind, watch policy; backends declared as data) + one `cc-ci-plan/agents.py` driver
|
||||
(up/down/status/watchdog/logs/phase). Watchdog reads the config file, not env. Config vs
|
||||
state split: declarative config in TOML, runtime state (phase-idx, resume ids, limit
|
||||
windows) under a `state/` dir. Full design + behavior-mapping + migration in
|
||||
`cc-ci-plan/plan-unified-launch.md`. Mimic current behavior first, cut over between phases
|
||||
(old launchers become shims), then retire them. De-dupe note: live `.phases-spec` lists
|
||||
`mailu` twice (idx 5 & 7); dropping the 2nd shifts current phase cf55 from idx 10→9.
|
||||
Agent kinds: loop (builder/adversary, phase machine) | persistent (orchestrator resume+wake,
|
||||
assistant) | task (upgrader/report one-shot slash command) | service (watchdog/cleanlogs).
|
||||
See [[orchestrator-backend-switch-gotcha]].
|
||||
27
memory/orchestrator-backend-switch-gotcha.md
Normal file
27
memory/orchestrator-backend-switch-gotcha.md
Normal file
@ -0,0 +1,27 @@
|
||||
---
|
||||
name: orchestrator-backend-switch-gotcha
|
||||
description: How to fully switch the cc-ci agent system between claude/opencode backends without the watchdog reverting it
|
||||
metadata:
|
||||
node_type: memory
|
||||
type: feedback
|
||||
originSessionId: fc17c9c2-ab6e-4c11-856e-a6a6e160a0ec
|
||||
---
|
||||
|
||||
Switching the cc-ci agents between claude and opencode backends is NOT just editing the
|
||||
`.loop-*`/`.orch-model` files: **env overrides files** in `launch.py`/`launch-orchestrator.py`,
|
||||
and the **running watchdog** (`launch.py watchdog`) carries `LOOP_BACKEND`/`LOOP_MODEL`/`ADV_MODEL`
|
||||
in its OWN environment. Its `heal_orchestrator()` kills any orchestrator whose backend ≠ the
|
||||
watchdog's expected backend and relaunches it — so a half-switch gets auto-reverted within ~30s.
|
||||
|
||||
**Why:** the watchdog is a long-lived process; it read its env at launch and re-applies it every tick.
|
||||
|
||||
**How to apply (full switch):** (1) kill the watchdog session FIRST (`tmux kill-session -t
|
||||
cc-ci-watchdog`) so it stops healing; (2) rewrite ALL control files coherently — `.loop-backend`,
|
||||
`.orch-model`, `.loop-model`, `.loop-model-adv`, and any per-phase `.loop-model[-adv]-<phase>` that
|
||||
held opencode `provider/model` values (claude crashes on `--model openai/gpt-5.4`); (3) stop opencode
|
||||
(tmux sessions cc-ci-orchestrator/adv/builder + the `opencode serve` server on :4096 + orphan
|
||||
`attach` clients — note the `-oc` alt-session may be **root-owned**, needs `sudo`); (4) start the
|
||||
claude orchestrator via `cc-ci-plan/launch-orchestrator.sh start`; it runs its startup routine and
|
||||
relaunches loops+watchdog on claude. **Gotchas:** `pkill -f opencode`/`pkill -f "launch.py watchdog"`
|
||||
match your OWN command line and kill your shell — kill by exact PID or `tmux kill-session` instead.
|
||||
The [[launch-system-unification]] rework removes this footgun (config is the sole source of truth).
|
||||
Reference in New Issue
Block a user