orchestrator: reboot-resilience + session auto-resume + full session plan/tooling

Reboot survival for the Pi orchestrator host:
- systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot
  records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the
  orchestrator session.
- reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count).
- launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed
  orchestrator announces itself (PushNotification) + reports reboots.
- AGENTS.md: on-startup notify routine documented.

Plans/tooling accumulated this session:
- plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review),
  sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance.
- launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution,
  limit-stall re-nudge, INBOX side-channel detection.
- plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED.
- prompts: isolation discipline + INBOX + pacing.
- .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, *.tmp.*).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 20:28:10 +01:00
parent 5681438b0f
commit 36a6c9872a
20 changed files with 1395 additions and 19 deletions

6
.gitignore vendored
View File

@ -10,3 +10,9 @@
/cc-ci-adv/
/.cc-ci-watch/
/.cc-ci-logs/
# More secrets / local state — NEVER commit
/.sops/ # master recovery age key
/cc-ci-secrets/ # separate sops-secrets repo, cloned in
/.claude/ # local claude session/project state
*.tmp.* # editor temp files

View File

@ -16,6 +16,25 @@ project (NixOS config, test runner, recipe tests) lives in a **separate** repo t
The two loops coordinate **only** through the cc-ci git repo (see `plan.md` §6.1). The orchestrator
watches from outside.
## On startup: announce yourself + report reboots
**Every time you (the orchestrator) start or resume, send a `PushNotification`** that you are online —
the operator wants to know the supervising session is back (especially after a reboot, which kills
this session along with the Pi). Include the current phase and the reboot count. Steps on startup:
1. Read `cc-ci-plan/REBOOTS.md` (count the `## Reboots` entries) and `cc-ci-plan/launch.sh status`
(current phase + whether the loops/watchdog are running).
2. `PushNotification` (proactive), e.g.: *"cc-ci orchestrator online — phase 2, loops+watchdog
running; N reboots logged (last <date>)."*
3. If a reboot happened while you were away (a new line in REBOOTS.md since you last looked, or the
loops are down), check that `cc-ci-loops.service` brought the loops back; if not, relaunch with
`RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
Reboot resilience is handled by **`cc-ci-loops.service`** (system unit): on boot it logs the reboot
to `REBOOTS.md` (boot_id-gated) and runs `launch.sh start` with `RESUME_PHASE=1`, so the loops +
watchdog auto-resume the saved phase. The orchestrator session itself is NOT auto-started — the
operator reconnects to it (that's why the startup notification matters). The fuller "move the
orchestrator onto its own VM" plan is parked at `cc-ci-plan/plan-orchestrator-migration.md`.
## Keep the orchestrator open, under remote-control
Run this session as a long-lived **interactive** session with `--remote-control` so the operator can

View File

@ -4,6 +4,22 @@ Post-DONE or "revisit later" ideas that are intentionally **out of scope** for t
(§2 Definition of Done). Not active work — parked here so they aren't lost. The loops may pull an
item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant.
- [ ] **Optional `--extra-tests` flag for heavy / operational tests (opt-in heavy suite).**
Some recipe tests are "more than needed" for the default CI signal — state-management /
long-running-instance / load / helper-script operational tests that don't fit the ephemeral
per-run-deploy model cheaply but are useful occasionally. Today they're deferred to
`cc-ci/machine-docs/DEFERRED.md` (e.g. matrix-synapse `compress_state.sh`,
`test_complexity_limit.sh`, `test_purge.sh`) and don't run.
*Idea:* add an **opt-in `--extra-tests` flag** (e.g. `!testme --extra-tests` on a PR comment, or
a `STAGES=extra` / `EXTRA_TESTS=1` Drone build parameter) that the orchestrator passes through;
recipes declare an `extra/` test dir or mark tests with `@pytest.mark.extra`; on opt-in the
orchestrator runs them **alongside** the default tiers (still one deploy, still teardown). Default
off so default CI stays fast; the operator can ask for the heavy suite when reviewing a PR that
touches an extra-covered area (e.g. matrix-synapse's abra helpers). When implemented, each
matching DEFERRED entry can be CLOSED by porting its test into the recipe's `extra/` and noting
the commit in DEFERRED.md. *Why deferred for now:* default coverage is sufficient; this is a
later breadth/depth knob, not a critical-path feature. *Added:* 2026-05-28.
- [ ] **Optional webhook self-registration (admin-access environments).**
We deliberately made **polling the primary trigger** and require the CI server/bot to run on
**read-level** access only — so the server does **not** auto-register Gitea webhooks (that needs

View File

@ -17,7 +17,9 @@ autonomous Claude loops (a Builder and an adversarial Reviewer) running over day
| `plan.md` | The Phase-1 plan (build the CI server). Agents treat it as their single source of truth. |
| `plan-phase1c-full-reproducibility.md` | **Phase 1c** (runs first): make the VM fully reproducible from git (all secrets incl. the wildcard cert in sops, in a separate private `cc-ci-secrets` repo as a flake input; base stays well-parameterized) and do the **genuine throwaway-VM live rebuild** to close D8 honestly (the "infeasible by design" was overstated). |
| `plan-phase1b-review-lint.md` | **Phase 1b** (after 1c): deterministic linting/formatting in CI + a white-box review checklist (real tests, DRY harness, idempotent Nix, no footguns/secrets), ending in a full cold re-verification of all D1D10 — now covering 1c's refactor. |
| `plan-phase2-recipe-tests.md` | **Phase 2** (after Phase 1b): author comprehensive per-recipe tests — port every recipe-maintainer test + ≥2 recipe-specific tests per app. |
| `plan-phase1d-generic-test-suite.md` | **Phase 1d** (after 1b, before 2): a **generic install/upgrade/backup/restore** suite that runs on *any* recipe with zero config, with a recipe's own `test_<op>.py` **overriding or extending** the generic (Builder's call) and **reusing the generic's deployment — no redeploy**, plus optional custom install-steps; recipes needing special setup fail the generic form gracefully. The test-architecture foundation Phase 2 builds on. |
| `plan-phase1e-harness-corrections.md` | **Phase 1e** (after 1d, before 2): three operator-review corrections to the shared generic harness — (HC1) upgrade goes previous-release → **PR head** via `deploy --chaos`; (HC2) **repo-local PR code runs only for approved recipes** (default = cc-ci overlays + generic only); (HC3) the **generic runs by default** alongside an overlay, skipped only via explicit opt-out. |
| `plan-phase2-recipe-tests.md` | **Phase 2** (after Phase 1e): build on the corrected generic suite — author the recipe overlays (port recipe-maintainer tests as `test_*.py`) + define custom install steps where a recipe fails generically. |
| `plan-phase2b-test-performance.md` | **Phase 2b** (after Phase 2, before Phase 3): empirically measure where test time goes and reduce it (image cache, readiness tuning, dedup deploys, warm infra, concurrency) — no weakened tests. |
| `plan-phase3-results-ux.md` | **Phase 3** (after Phase 2b): beautiful YunoHost-style results — per-run **level**, image-forward PR comment (badge + summary card + app screenshot), polished dashboard. |
| `IDEAS.md` | Deferred/future ideas, parked out of current scope. |

15
cc-ci-plan/REBOOTS.md Normal file
View File

@ -0,0 +1,15 @@
# Reboot log — cc-ci orchestrator Pi
One line per genuine reboot of the orchestrator Pi (`raspberrypi`), appended automatically by
`reboot-log.sh` (ExecStartPre of `cc-ci-loops.service`, boot_id-gated so manual service restarts are
NOT counted). The Pi hosts the Builder + Adversary loops + watchdog; a reboot drops the tmux sessions
(and this orchestrator session), and `cc-ci-loops.service` restarts the loops on boot. Count the
lines below to see how often it's happening.
## Reboots
- 2026-05-28 (~19:?? BST) — reboot (backfilled from memory; mid-Phase-2). Orchestrator + loops were
down until manually relaunched. This pre-dates the systemd auto-restart service.
- 2026-05-28 (~20:02 BST) — reboot (backfilled from memory; uptime showed 5 min at 20:07). Loops
manually relaunched at phase 2; this is what prompted adding `cc-ci-loops.service` +
auto-logging. Auto-logging is live from the next reboot onward.

117
cc-ci-plan/launch-orchestrator.sh Executable file
View File

@ -0,0 +1,117 @@
#!/usr/bin/env bash
#
# launch-orchestrator.sh — start/resume the cc-ci ORCHESTRATOR session in tmux under remote-control.
#
# The orchestrator (see /srv/cc-ci/AGENTS.md) is the long-lived SUPERVISORY session: it watches the
# Builder/Adversary loops, reads their logs/STATUS, edits the plan/prompts, restarts stuck loops, and
# owns the VM-level fallback. It is SEPARATE from the loops that launch.sh manages — this script only
# brings the orchestrator back (e.g. after a reboot, which kills the tmux server and every session in
# it). The conversation itself survives on disk across exits/reboots; remote-control only stays
# connected while the process is alive, so recovery = relaunch the process and re-attach by --resume.
#
# Naming: tmux session AND remote-control name are both "cc-ci-orchestrator", matching the loop
# sessions cc-ci-builder / cc-ci-adv / cc-ci-watchdog.
#
# Usage:
# ./launch-orchestrator.sh start # resume the persistent orchestrator session (DEFAULT)
# ./launch-orchestrator.sh fresh # start a NEW orchestrator session (no --resume)
# ./launch-orchestrator.sh status # show tmux + remote-control state
# ./launch-orchestrator.sh attach # tmux attach to the session (Ctrl-b d to detach)
# ./launch-orchestrator.sh stop # kill the tmux session (conversation persists on disk)
#
# The persistent session id is read from $ID_FILE (seeded on first run with DEFAULT_ID). A Claude
# session keeps the SAME id across --resume, so this stays valid across reboots. To point the script
# at a different session, edit that file or export ORCH_SESSION_ID.
set -euo pipefail
# ----- config -------------------------------------------------------------
SESSION="${ORCH_SESSION:-cc-ci-orchestrator}" # tmux session name == remote-control name
WORKDIR="${ORCH_DIR:-/srv/cc-ci}" # orchestrator cwd (its claude project dir)
CLAUDE_BIN="${CLAUDE_BIN:-claude}"
CLAUDE_FLAGS="${CLAUDE_FLAGS:---dangerously-skip-permissions}"
# REMOTE_CONTROL=1 → --remote-control session, viewable/steerable at claude.ai/code. Needs the box
# logged into the claude.ai account. =0 for a plain local interactive session.
REMOTE_CONTROL="${REMOTE_CONTROL:-1}"
LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}"
ID_FILE="${ORCH_ID_FILE:-$LOG_DIR/.orchestrator-session-id}"
DEFAULT_ID="34a80a99-b37e-4809-b8da-ccc9fafe785e" # the orchestrator session as of 2026-05-28
# Startup nudge injected as the resumed session's first turn, so an AUTO-launched orchestrator (e.g.
# cc-ci-loops.service ExecStartPost after a reboot) actually RUNS its AGENTS.md startup routine —
# announce itself + report reboots — instead of resuming silently and waiting. Set empty to disable.
# Must contain NO single quotes (it is single-quoted into the tmux command).
STARTUP_PROMPT="${ORCH_STARTUP_PROMPT-STARTUP (auto-launch): you are the cc-ci orchestrator, just (re)launched, likely after a reboot. Do your AGENTS.md On-startup routine NOW: read cc-ci-plan/REBOOTS.md and run cc-ci-plan/launch.sh status, then send a proactive PushNotification that you are online with the current phase and reboot count, and confirm cc-ci-loops.service brought the loops + watchdog back (relaunch with RESUME_PHASE=1 cc-ci-plan/launch.sh start if not). Then resume supervising.}"
# --------------------------------------------------------------------------
log() { printf '[orchestrator %(%H:%M:%S)T] %s\n' -1 "$*"; }
die() { log "ERROR: $*"; exit 1; }
session_alive() { tmux has-session -t "$SESSION" 2>/dev/null; }
preflight() {
command -v tmux >/dev/null 2>&1 || die "missing dependency: tmux"
command -v "$CLAUDE_BIN" >/dev/null 2>&1 || die "claude CLI not found (set CLAUDE_BIN)"
[[ -d "$WORKDIR" ]] || die "workdir not found: $WORKDIR"
mkdir -p "$LOG_DIR"
[[ -f "$ID_FILE" ]] || echo "$DEFAULT_ID" > "$ID_FILE"
}
resume_id() { echo "${ORCH_SESSION_ID:-$(cat "$ID_FILE" 2>/dev/null || echo "$DEFAULT_ID")}"; }
# Launch claude in a detached tmux session. $1=resume ("resume"|"fresh").
start() {
local mode="${1:-resume}"
preflight
if session_alive; then
log "$SESSION already running — leaving it (use '$0 stop' first to relaunch)"
return 0
fi
local rc="" resume="" id=""
[[ "$REMOTE_CONTROL" == "1" ]] && rc="--remote-control '$SESSION'"
if [[ "$mode" == "resume" ]]; then
id="$(resume_id)"
[[ -n "$id" ]] && resume="--resume '$id'"
log "starting $SESSION (resume=$id, cwd=$WORKDIR, rc=$REMOTE_CONTROL)"
else
log "starting $SESSION FRESH (no resume, cwd=$WORKDIR, rc=$REMOTE_CONTROL)"
fi
# Startup nudge as a POSITIONAL prompt (not stdin — stdin would force print mode and break
# remote-control). On --resume this appends as the session's next turn, triggering the AGENTS.md
# startup routine (announce + report reboots). Empty STARTUP_PROMPT => clean resume, no nudge.
local prompt_arg=""
[[ -n "$STARTUP_PROMPT" ]] && prompt_arg="'$STARTUP_PROMPT'"
tmux new-session -d -s "$SESSION" -c "$WORKDIR" \
"$CLAUDE_BIN $resume $rc $CLAUDE_FLAGS $prompt_arg"
tmux pipe-pane -o -t "$SESSION" "cat >> '$LOG_DIR/$SESSION.log'"
log "started. status: $0 status | attach: tmux attach -t $SESSION"
}
case "${1:-start}" in
start) start resume ;;
fresh) start fresh ;;
stop)
if session_alive; then log "killing $SESSION"; tmux kill-session -t "$SESSION" || true; else log "$SESSION not running"; fi
;;
status)
if session_alive; then
log "$SESSION: RUNNING"
ps -eo pid,etime,args | grep "[r]emote-control $SESSION" || true
else
log "$SESSION: stopped"
fi
log "resume id: $(cat "$ID_FILE" 2>/dev/null || echo "$DEFAULT_ID") (file: $ID_FILE)"
;;
attach) exec tmux attach -t "$SESSION" ;;
*)
cat <<EOF
cc-ci orchestrator launcher
$0 start resume the persistent orchestrator session in tmux + remote-control (default)
$0 fresh start a NEW orchestrator session (no --resume)
$0 status show tmux + remote-control state and the resume id
$0 attach tmux attach to the session
$0 stop kill the tmux session (conversation persists on disk)
Env: SESSION=$SESSION WORKDIR=$WORKDIR REMOTE_CONTROL=$REMOTE_CONTROL CLAUDE_BIN=$CLAUDE_BIN
EOF
;;
esac

View File

@ -7,10 +7,10 @@
# • Adversary (tmux session: cc-ci-adv) working clone /srv/cc-ci/cc-ci-adv
# coordinating only through the git repo on git.autonomic.zone.
#
# PHASES: the watchdog runs an ordered sequence of sub-phases (default: 1c then 1b). Each phase
# has its own plan + phase-namespaced loop-state files (STATUS-<id>.md etc.). When a phase's
# PHASES: the watchdog runs an ordered sequence of sub-phases (default: 1c → 1b → 1d → 1e → 2 → 2b → 3 → 4).
# Each phase has its own plan + phase-namespaced loop-state files (STATUS-<id>.md etc.). When a phase's
# STATUS-<id>.md shows "## DONE", the watchdog AUTO-TRANSITIONS to the next phase; after the LAST
# phase it STOPS the loops and exits (a manual gate — e.g. check in before Phase 2).
# phase (4, final review/polish/cleanup) it STOPS the loops and exits (end of the whole build).
#
# Three jobs: ITERATION (each agent's /loop), RESILIENCE (restart a dead loop), HANDOFF SIGNALLING
# (ping the waiting loop the moment its counterpart hands off), PHASE SEQUENCING (this file).
@ -49,7 +49,7 @@ WATCHDOG_SESSION="cc-ci-watchdog"
# Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order,
# auto-transitions on the phase's "## DONE" (in BUILDER_DIR/<statusbasename>), and STOPS after the
# last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence.
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md}"
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
IFS=';' read -r -a PHASES <<< "$PHASES_SPEC"
PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}"
# --------------------------------------------------------------------------
@ -64,7 +64,10 @@ phase_id() { echo "${PHASES[$1]}" | cut -d'|' -f1; }
phase_plan() { echo "${PHASES[$1]}" | cut -d'|' -f2; }
phase_status() { echo "${PHASES[$1]}" | cut -d'|' -f3; }
phase_review() { echo "REVIEW-$(phase_id "$1").md"; }
phase_done() { grep -qE '^##[[:space:]]+DONE' "$BUILDER_DIR/$1" 2>/dev/null; } # $1 = status basename (read locally)
# Loop-state files may sit at the repo root OR under machine-docs/ (the 1b RL6 move). Prefer
# machine-docs/ if present, else root — so the watchdog survives the move whenever it happens.
resolve_state() { local dir="$1" base="$2"; if [[ -f "$dir/machine-docs/$base" ]]; then echo "$dir/machine-docs/$base"; else echo "$dir/$base"; fi; }
phase_done() { grep -qE '^##[[:space:]]+DONE' "$(resolve_state "$BUILDER_DIR" "$1")" 2>/dev/null; } # $1 = status basename (read locally)
all_ids() { local p; for p in "${PHASES[@]}"; do printf '%s ' "$(echo "$p" | cut -d'|' -f1)"; done; }
preflight() {
@ -133,15 +136,32 @@ ping_session() {
tmux send-keys -t "$s" -l -- "$msg" 2>/dev/null && { sleep 0.3; tmux send-keys -t "$s" Enter 2>/dev/null; }
}
# A loop can stall ALIVE on a usage/spend-limit notice: the claude process stays up (so the
# dead-session restart never fires) but makes no progress, and the /loop self-pacing is dead because
# the limit interrupted the turn that would have scheduled the next tick. Detect that signature
# (limit text present + no active-turn marker) and re-nudge it each heavy tick — once the limit resets
# the next nudge lands and the loop resumes. Gated on the limit text so we NEVER nudge a loop that is
# just legitimately idle-waiting on a handoff.
LIMIT_RE='spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)'
nudge_if_limit_stalled() {
local s="$1" pane
pane="$(tmux capture-pane -pt "$s" 2>/dev/null | tail -25 || true)"
if printf '%s\n' "$pane" | grep -q 'esc to interrupt'; then return 0; fi # actively working
if ! printf '%s\n' "$pane" | grep -qiE "$LIMIT_RE"; then return 0; fi # not a limit stall
log "limit-stall detected on $s — re-nudging to resume"
ping_session "$s" "watchdog: the usage/spend limit appears lifted — RESUME your loop now. Pull latest, re-read your phase STATUS/REVIEW files, and continue from where you stopped; re-arm your loop pacing."
}
# Edge-triggered handoff signalling for the CURRENT phase. Reads the loops' local clones.
# Ping the Adversary only when a gate id NEWLY appears on a "CLAIMED … awaiting" line (never on
# the baseline / restart / a passed-but-kept line). Ping the Builder when the phase REVIEW changes.
_wd_awaiting=""; _wd_baselined=""; _wd_last_review=""
handoff_reset() { _wd_awaiting=""; _wd_baselined=""; _wd_last_review=""; } # call on phase transition
_wd_adv_inbox_seen=""; _wd_builder_inbox_seen=""
handoff_reset() { _wd_awaiting=""; _wd_baselined=""; _wd_last_review=""; _wd_adv_inbox_seen=""; _wd_builder_inbox_seen=""; } # call on phase transition
handoff_check() {
local idx sf rf cur now added
idx="$(cur_idx)"
sf="$BUILDER_DIR/$(phase_status "$idx")"; rf="$ADV_DIR/$(phase_review "$idx")"
sf="$(resolve_state "$BUILDER_DIR" "$(phase_status "$idx")")"; rf="$(resolve_state "$ADV_DIR" "$(phase_review "$idx")")"
if [[ -f "$sf" ]]; then
now="$(grep -iE 'CLAIMED.*awaiting' "$sf" 2>/dev/null | grep -oiE 'M[0-9]+(\.[0-9]+)?|[A-Z][0-9]+' | tr '[:lower:]' '[:upper:]' | sort -u || true)"
if [[ -n "$_wd_baselined" ]]; then
@ -163,6 +183,34 @@ handoff_check() {
_wd_last_review="$cur"
fi
fi
# INBOX side-channel (§6.1). The sender writes the receiver's inbox in their OWN clone, so we
# detect from the sender side. Edge-trigger on content hash so a fresh message (sender re-wrote
# before receiver consumed) re-pings. Receiver deletes after processing => hash empty => next
# write re-triggers.
local adv_inbox builder_inbox h
adv_inbox="$(resolve_state "$BUILDER_DIR" "ADVERSARY-INBOX.md")"
if [[ -f "$adv_inbox" ]]; then
h="$(md5sum "$adv_inbox" 2>/dev/null | awk '{print $1}' || true)"
if [[ -n "$h" && "$h" != "$_wd_adv_inbox_seen" ]]; then
log "handoff: ADVERSARY-INBOX.md new/changed -> pinging Adversary"
ping_session "$ADV_SESSION" "watchdog ping: the Builder wrote machine-docs/ADVERSARY-INBOX.md — pull, read the message, act on it, then delete the file (commit + push) to mark it consumed."
_wd_adv_inbox_seen="$h"
fi
else
_wd_adv_inbox_seen="" # consumed; ready for the next write
fi
builder_inbox="$(resolve_state "$ADV_DIR" "BUILDER-INBOX.md")"
if [[ -f "$builder_inbox" ]]; then
h="$(md5sum "$builder_inbox" 2>/dev/null | awk '{print $1}' || true)"
if [[ -n "$h" && "$h" != "$_wd_builder_inbox_seen" ]]; then
log "handoff: BUILDER-INBOX.md new/changed -> pinging Builder"
ping_session "$BUILDER_SESSION" "watchdog ping: the Adversary wrote machine-docs/BUILDER-INBOX.md — pull, read the message, act on it, then delete the file (commit + push) to mark it consumed."
_wd_builder_inbox_seen="$h"
fi
else
_wd_builder_inbox_seen=""
fi
}
watchdog_loop() {
@ -184,15 +232,15 @@ watchdog_loop() {
handoff_reset
start_loops
else
log "PHASE SEQUENCE COMPLETE (last phase $pid DONE). Stopping loops — MANUAL CHECK-IN required before Phase 2."
log "PHASE SEQUENCE COMPLETE (last phase $pid DONE). Stopping loops — entire build (1c→3) finished."
stop_loops
printf 'cc-ci phase sequence complete %(%F %T)T. Phases: %s. Loops stopped; manual check-in required before Phase 2.\n' -1 "$(all_ids)" > "$LOG_DIR/SEQUENCE-COMPLETE"
printf 'cc-ci phase sequence complete %(%F %T)T. Phases: %s. Loops stopped; entire build finished.\n' -1 "$(all_ids)" > "$LOG_DIR/SEQUENCE-COMPLETE"
log "watchdog exiting."
exit 0
fi
else
session_alive "$BUILDER_SESSION" || { log "builder gone — restarting (phase $pid)"; start_agent builder "$BUILDER_SESSION" "$BUILDER_DIR"; }
session_alive "$ADV_SESSION" || { log "adversary gone — restarting (phase $pid)"; start_agent adversary "$ADV_SESSION" "$ADV_DIR"; }
if session_alive "$BUILDER_SESSION"; then nudge_if_limit_stalled "$BUILDER_SESSION"; else log "builder gone — restarting (phase $pid)"; start_agent builder "$BUILDER_SESSION" "$BUILDER_DIR"; fi
if session_alive "$ADV_SESSION"; then nudge_if_limit_stalled "$ADV_SESSION"; else log "adversary gone — restarting (phase $pid)"; start_agent adversary "$ADV_SESSION" "$ADV_DIR"; fi
fi
fi
sleep "$SIGNAL_INTERVAL"

View File

@ -0,0 +1,135 @@
# Plan — migrate the orchestrator off the Pi onto a dedicated NixOS Incus VM
**Goal:** move everything that drives the cc-ci loops (the Builder/Adversary loops, the watchdog,
the SOCKS proxy, the orchestrator session itself) off the Raspberry Pi and onto a new, dedicated,
**reboot-resilient NixOS VM** on b1 — declared in a new git repo **`cc-ci-orchestrator`**. Finish by
relocating this orchestrator session there too.
**Why:** the Pi has rebooted twice today, each time silently killing the tmux loops + watchdog
(they don't survive reboot, nothing auto-restarts them). A NixOS VM lets us declare the whole rig
(claude CLI, proxy, loop supervisor) as systemd services that come back on boot — turning a reboot
into a non-event. It also consolidates the orchestrator next to the infra it manages.
**Status:** DRAFT — awaiting operator go-ahead before any infra creation / cutover.
---
## 0. Current footprint (what has to move)
On the Pi (`raspberrypi`, aarch64), workspace `/srv/cc-ci` (itself the
`cc-ci-autonomous-orchestrator` git repo):
| Item | What | Move strategy |
|---|---|---|
| `cc-ci-plan/` | loop code: `launch.sh`, `plan*.md`, `prompts/`, `kickoff.md` | in git (this repo) → clone on VM |
| `cc-ci/`, `cc-ci-adv/` | Builder + Adversary working clones (~13M each) | **re-clone from git.autonomic.zone** on the VM (cleaner than copying) |
| `.cc-ci-logs/` | watchdog/loop logs + `.phase-idx` | copy `.phase-idx` (the resume point); logs start fresh |
| `cc-ci-secrets/` | sops-encrypted secrets repo | in git → clone |
| `references/` | recipe-maintainer corpus (read-only parity source) | clone/rsync from `/srv/recipe-maintainer` |
| **`.testenv`** | TS auth key, Gitea bot creds | **out-of-band copy** (gitignored, never in git) |
| **`~/.ssh/cc-ci-root-ed25519`** | root SSH key to cc-ci | **out-of-band copy** |
| **`.sops/master-age.txt`** | master recovery age key | **out-of-band copy** |
| **Incus mTLS certs** (`/srv/incus-terraform-nix-vm-creator/terraform-secrets/`) | `terraform.{crt,key}`, `vm_ssh_key` | **out-of-band copy** — so the VM can itself manage VMs |
| `cc-ci-tailscaled.service` | userspace SOCKS proxy :1055 | **re-declare as NixOS** (see §3) |
| **claude CLI + auth** | `~/.local/bin/claude` v2.1.154 + `~/.claude.json` | install on VM + **operator `claude auth login`** (§4) |
| this orchestrator session | the supervising claude conversation | **operator-assisted cutover** (§6) |
Two hard human-in-the-loop steps, called out explicitly: **claude auth on the new VM** (device-code
login, can't be scripted) and the **final session cutover** (the operator connects to the new
orchestrator session). Everything else I can do.
## 1. Target VM spec
- **Host/API:** b1 Incus, `https://100.117.251.31:8443`, project `terraform-ci`, mTLS certs (have).
- **Name:** `cc-ci-orchestrator` (tailnet hostname too).
- **Resources:** **2 GB RAM, 2 vCPU, 30 GB disk** (dir backend → resize needs a reboot; size at
create time so no later grow). b1 has ample headroom (only cc-nix-test @8GB running).
- **Image:** the existing imported NixOS base VM image (`incus-base-vm`) — already ships tailscale,
openssh, git/jq/curl, flakes, cloud-init.
- **Tailnet:** joins via a fresh `TS_AUTH_KEY` (operator provides, or reuse the keyed approach in
`terraform-secrets/.test.env`). MagicDNS name `cc-ci-orchestrator.taila4a0bf.ts.net`.
- **Bootstrap:** cloud-init writes the `cc-ci-orchestrator` flake config + `nixos-rebuild switch`.
## 2. The new `cc-ci-orchestrator` git repo (NixOS config)
A new **private** repo on `git.autonomic.zone/recipe-maintainers/cc-ci-orchestrator` (bot is org
admin). It is the NixOS config for this VM — the orchestrator's equivalent of what `cc-ci` is for the
test server. Contents:
- `flake.nix` + `hosts/cc-ci-orchestrator/configuration.nix` — the VM's NixOS config.
- **Packages:** `claude-code` (CLI), `git`, `tmux`, `python3`, `jq`, `openssh`, `nodejs` (claude
runtime), `coreutils`, `nettools` (`nc` for the proxy ProxyCommand).
- **`services.cc-ci-tailscaled`** — the userspace tailscaled SOCKS proxy on :1055, as a NixOS
systemd service (port to NixOS from the Pi's `cc-ci-tailscaled.service`). This is the path to b1 +
cc-ci.
- **`services.cc-ci-orchestrator`** — a systemd service that runs `launch.sh start` with
`RESUME_PHASE=1` **on boot** (after the proxy + network are up), as the workspace user. **This is
the reboot-resilience fix** — the loops + watchdog come back automatically after any reboot.
- **Secrets via sops-nix** (like cc-ci): the out-of-band secrets (`.testenv`, ssh key, incus certs)
are sops-encrypted into the repo, decrypted at activation to their runtime paths. The **master age
key** is the one irreducible out-of-band bootstrap secret placed on the VM once.
- `~/.ssh/config` for `cc-ci` (root, ProxyCommand via :1055) declared.
- **Excluded from git:** claude's own auth (`~/.claude.json`) — that's per-user login state, set up
once interactively (§4), not committed.
## 3. Execution phases
### Phase A — provision the VM (reversible; safe to do while Pi loops keep running)
1. Create `cc-ci-orchestrator` VM via the Incus API (2 GB / 2 vCPU / 30 GB, NixOS base image, TS auth
key in cloud-init). Wait for tailnet join + ssh.
2. Verify: `ssh` in, `tailscale status`, `nixos-rebuild` available, can reach b1 API + cc-ci through
its own proxy once configured.
### Phase B — author + apply the `cc-ci-orchestrator` repo
3. Create the private git repo; author the flake/config (§2); commit/push.
4. Place the master age key on the VM; sops-encrypt the out-of-band secrets into the repo.
5. `nixos-rebuild switch` on the VM → proxy service up, packages present, services defined (loop
supervisor **not yet started** — or started in a dry mode).
### Phase C — stage the workspace (no cutover yet)
6. On the VM: clone `cc-ci-autonomous-orchestrator` (the loop code), clone the Builder/Adversary
working repos fresh from git.autonomic.zone, clone `cc-ci-secrets`, rsync `references/`.
7. Copy `.phase-idx` (resume point = phase 2) so the VM watchdog resumes the right phase.
8. **Operator step:** `claude auth login` on the VM (device code) so the loops can run
`--remote-control --dangerously-skip-permissions`. Verify with a throwaway interactive claude.
### Phase D — cutover (the only disruptive moment; pick a clean point)
9. **Quiesce the Pi:** stop the Pi loops + watchdog (`launch.sh stop`); confirm both loops are at a
safe point (no half-written commit; `git status` clean in both clones, last work pushed).
10. **Start on the VM:** enable + start the `cc-ci-orchestrator` systemd service → `launch.sh start`
(RESUME_PHASE=1) brings up Builder + Adversary + watchdog on the VM, resuming phase 2 from the
repo state. Verify all three sessions + a handoff + public health.
11. **Decommission the Pi loops:** disable the Pi's `cc-ci-tailscaled` + leave the workspace in place
(read-only fallback) but not running loops. (Keep the Pi as a cold standby for a few days before
deleting anything.)
### Phase E — move the orchestrator session (operator-assisted)
12. On the VM, start the orchestrator session: `claude --remote-control 'autonomous-orchestrator'
--dangerously-skip-permissions` in a tmux session, seeded with AGENTS.md + this plan so it picks
up the supervising role. The **operator connects** to it (claude.ai/code) — this is the
"move myself" step; a session can't transplant itself across machines, so it's a fresh
orchestrator session on the VM with full context from the repo.
13. This Pi-side orchestrator session hands off (writes a short state note) and goes idle/ends.
## 4. Risks & mitigations
- **claude auth (human step):** unavoidable device-code login on the VM. Mitigation: do it in Phase
C, well before cutover; verify before quiescing the Pi.
- **Loops mid-work at cutover:** pick a quiet point (between gate claims / after a push); the loops
re-orient from git on restart anyway, so worst case is a re-run of an in-flight iteration.
- **Secrets sprawl:** out-of-band secrets are copied once, then sops-managed in the new repo; never
committed in plaintext (same discipline as cc-ci). The master age key is the sole bootstrap secret.
- **Self-move gap:** between Pi-session-ends and VM-session-connected, there's no live orchestrator.
The watchdog (now a boot service) keeps the loops alive independently, so this gap is safe.
- **Rollback:** until the Pi workspace is deleted, reverting = stop VM service, `launch.sh start` on
the Pi again. Keep the Pi intact until the VM has run clean through at least one reboot + one gate
handshake.
- **Reboot-resilience proof:** before trusting the VM, reboot it once and confirm the loops +
watchdog + proxy all come back via systemd (the whole point of the move).
## 5. Operator-assisted steps (the only things I can't fully do)
1. Provide a fresh `TS_AUTH_KEY` for the VM (or confirm reuse of the one in `terraform-secrets`).
2. `claude auth login` on the VM (device code).
3. Connect to the new orchestrator session on the VM at cutover (Phase E).
Everything else (VM create, repo author, NixOS config, secret migration, workspace staging, the
loop cutover) I can drive.

View File

@ -135,3 +135,40 @@ Blocking unless noted; these are *plan-relevant invariants visible only by readi
- Whether to add Python **type-checking** (mypy/pyright) now or defer to `IDEAS.md`.
- The precise **blocking vs advisory** split for the checklist.
- Whether the `.drone.yml` lint stage should **fail** the build or just warn initially.
---
## 7. Operator review items (added 2026-05-27) — repo layout (do in this 1b pass)
Two structural-review items from the operator. Both are **blocking** for 1b. Apply them as part of
this pass, then re-verify (RL3 covers the re-verification). **Mind the coordination caveats — these
touch the live flake build and the running multi-agent machinery.**
### RL5 — Consolidate all Nix-code folders under a root `nix/`
- Move the folders that contain `.nix` code — **`modules/` and `hosts/`** — to **`nix/modules/` and
`nix/hosts/`**. (Add future Nix dirs under `nix/` too.)
- **Keep `flake.nix` / `flake.lock` at the repo root** (entry point) so the build ref is unchanged
(`docs/install.md`'s `nixos-rebuild switch --flake 'git+file://…?submodules=1#cc-ci'` stays valid).
Just update the flake's internal paths (`./modules``./nix/modules`, `./hosts``./nix/hosts`)
and any `imports`/`scripts`/`.drone.yml` references.
- **Re-verify after the move:** the byte-identical clean-room result is the bar. The toplevel store
hash *will* change (paths differ) — that's fine; what must hold is that a fresh recursive clone
still rebuilds **byte-identical to the running system** and the Adversary re-confirms it cold
(folds into RL3). Update `docs/architecture.md` to describe the `nix/` layout.
### RL6 — Move uppercase multi-agent-protocol files into `machine-docs/`
- Move the uppercase protocol files — **`STATUS*.md`, `REVIEW*.md`, `JOURNAL*.md`, `BACKLOG*.md`,
`DECISIONS.md`** — into a root **`machine-docs/`** folder. **`README.md` stays in the repo root**
(operator decision, 2026-05-27) — it is the human-facing repo readme, not a protocol file; do
**not** move it into `machine-docs/`.
- **Update every reference** to the new paths: the `cc-ci-plan/` plans (this file, `plan.md`,
`plan-phase1c-*`, `README.md`, `kickoff.md`, `test-e2e-testme-acceptance.md`), `AGENTS.md`,
`.drone.yml`, `scripts/`, and any in-repo doc that points at `STATUS.md`/`REVIEW.md`/etc.
- **⚠ COORDINATION CAVEAT (do not move these unilaterally mid-run):** the live **watchdog**
(`cc-ci-plan/launch.sh`, the orchestrator's file) reads `STATUS-<id>.md` and `REVIEW-<id>.md` at
the **repo root** to drive handoff pings + the 1c→1b auto-transition. Moving them breaks the
running watchdog until `launch.sh` is updated to the `machine-docs/` paths and the watchdog is
restarted. **So sequence it with the orchestrator:** the orchestrator updates `launch.sh`'s
`PHASES_SPEC`/path logic and restarts the watchdog **in lockstep** with the loops' `git mv`.
Safest to do this **near the end of 1b** (or as its final step), not while a phase transition is
pending. Flag the orchestrator when ready and it will handle `launch.sh` + the watchdog restart.

View File

@ -0,0 +1,243 @@
# cc-ci Phase 1d — Generic test suite + layered recipe overlays (Autonomous Build Plan)
**Status:** QUEUED — runs **after Phase 1b** (`plan-phase1b-review-lint.md`) and **before Phase 2**
(`plan-phase2-recipe-tests.md`). It is the **test-architecture foundation** Phase 2 builds on, so it
must precede it.
**Transition:** **manual** (operator kicks it off at the post-1b check-in).
**Builds on:** the post-1b codebase (the runner/harness, `.drone.yml`, the comment-bridge, the proof
recipes, the `nix/` + `machine-docs/` layout from 1b RL5/RL6).
**Owner agents:** same Builder + Adversary loops (`plan.md` §6/§7); Adversary cold-verifies.
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1d-generic-test-suite.md`
**Phase order:** 1c → 1b → **1d** → 2 → 2b → 3.
---
## 0. Why this phase
Today a recipe only gets tested if someone has authored tests for it. That doesn't scale to ~18+
recipes and gives nothing for a brand-new recipe. The operator's model (2026-05-27): **every recipe
gets a generic lifecycle test suite for free**, and recipe-specific tests *layer on top* rather than
being the only thing that runs. This makes `!testme` meaningful on **any** recipe immediately, turns
Phase 2 into "add overlays where they add value" instead of "write everything from scratch," and
gives Phase 3 a natural basis for a YunoHost-style **level** (how many tiers pass).
Core principle: **the generic is the default for each lifecycle op; a recipe's own `test_<op>.py`
may override *or* extend that default.** The exact additive-vs-override mechanism is the **Builder's
design call** (operator, 2026-05-27: override — e.g. a present `test_install.py` *replaces* the
generic install assertions — is a perfectly good option; additive is fine too; could even be
per-recipe opt-in). What's **fixed** and non-negotiable: (1) when a recipe defines **no** test for an
op, **the generic runs** (so any recipe is testable with zero config); (2) the harness owns a
**single shared deployment** that generic + recipe assertions reuse — **no redeploy** (§2.2); (3)
custom (non-lifecycle) tests are opt-in.
> **SUPERSEDED (2026-05-28) by Phase 1e HC3** (`plan-phase1e-harness-corrections.md`): the override
> default is replaced — the **generic now runs by default *alongside* an overlay (additive)**,
> skipped only via an explicit opt-out. Also: repo-local PR code is gated to approved recipes (HC2),
> and the upgrade tier targets the PR head (HC1). Read 1e for the current behavior.
---
## 1. Definition of Done (Phase 1d exit condition)
Terminates when every item holds **and the Adversary has independently cold-verified** (logged in
`machine-docs/REVIEW-1d.md`):
- [ ] **DG1 — Generic INSTALL test.** A recipe-agnostic install test exists that, given only a
recipe name and **zero** recipe-specific config, does `abra app new` (sane defaults / auto
secrets) → `deploy` → polls to converged (all services running/healthy, no bare `sleep`) →
asserts the app is **actually serving** (real HTTP(S) response on its
`<run>.ci.commoninternet.net` domain through Traefik — **not** a Traefik 404/default cert, not
health-only). Demonstrated **green** on ≥1 simple recipe (e.g. `custom-html`) that has **no**
cc-ci or repo-local tests.
- [ ] **DG2 — Generic UPGRADE test.** Deploy the **previous published version** (the last release
*before* the code under test), then **upgrade to the code under test (PR head) via
`abra app deploy --chaos`** (chaos = the current checkout) — i.e. previous-release → PR-head,
not previous → newest-published-tag. Assert services reconverge and the app still serves.
**OPERATOR CORRECTION (2026-05-28):** the current 1d impl upgrades to the newest *published tag*
and (because deploying the prev tag re-checks-out the recipe) never deploys the PR head — so a
recipe PR's actual changes aren't exercised by the upgrade path. Fix: after deploying prev,
**restore the PR-head checkout** (re-checkout the PR ref / re-snapshot) and `deploy --chaos` to
it as the upgrade target. The "deployment actually moved" assertion (`do_upgrade`) still
applies, but adapt it for prev→PR-head (a PR may not bump the version label — also accept an
image/config change, or assert the running config now matches the PR head). For a non-PR
`!testme`, "current checkout" = the catalogue current, so upgrade tests prev→current. (Data-
continuity assertions remain recipe overlays — see §2.1.)
- [ ] **DG3 — Generic BACKUP + RESTORE tests.** For backup-capable recipes (declare backupbot /
`abra app backup` support): run backup → assert a snapshot artifact is produced; then restore →
assert restore completes and the app is healthy after. For recipes that declare **no** backup
config, backup/restore are cleanly **N/A (skipped)***not* a failure.
- [ ] **DG4 — Layering (override-or-extend; generic is the default).** The harness composes a
recipe's run from the generic default per op, **overridden or extended** by the recipe's
`test_install.py` / `test_upgrade.py` / `test_backup.py` / `test_restore.py` when present (in
cc-ci's tests dir **or** repo-local in the recipe's `tests/`) — the additive-vs-override
mechanism is the Builder's design choice, recorded in `machine-docs/DECISIONS.md`. **Invariant:
if a recipe defines no test for an op, the generic runs.** Arbitrary **custom** `test_*.py` run
**only if defined**. Discovery + cc-ci-vs-repo-local precedence is
implemented and settled in `machine-docs/DECISIONS.md`.
- [ ] **DG4.1 — Overlays reuse the deployment (no redeploy).** Overlay + custom assertions run
against the **same live deployment** the generic tier brought up — **one deploy per run, one
teardown** (§2.2), lifecycle ops mutating it in place. Verified: adding an overlay to a recipe
causes **no** extra `abra app new/deploy/undeploy` beyond the shared run (assert via
deploy-count / harness logs). Re-provisioning only where an op semantically demands it, and
then explicitly.
- [ ] **DG5 — Custom install-steps hook (with the "generic-anyway" rule).** The harness supports
**defined** per-recipe extra install steps (cc-ci or repo-local) that run before/around the
generic install. A recipe with **no** customization still **attempts the generic suite**.
Proven both ways: (a) a recipe needing a custom step **fails the generic install as expected**
when the step is absent; (b) the **same** recipe **passes** once the custom step is added —
demonstrating the hook + the graceful-generic-failure are both real.
- [ ] **DG6 — `!testme` end-to-end on an unconfigured recipe.** A `!testme` on a recipe with no
cc-ci/repo-local customization runs the full generic suite through the real pipeline
(bridge → Drone → deploy → assert → undeploy → report) and reports **per-operation**
pass/fail/skip (install/upgrade/backup/restore).
- [ ] **DG7 — Real, DRY, clean.** No softened/`skip`/`xfail`/can't-fail assertions; generic logic
lives in the **shared harness** (M6.5 — no per-recipe copy-paste); every run **undeploys**
(teardown in `finally`), respects `MAX_TESTS`.
- [ ] **DG8 — Documented + cold-verified.** `docs/` explains the generic suite, the overlay
convention (file names + locations + precedence), and the custom-install-steps hook + how to
add a recipe overlay. The Adversary re-runs the acceptance checks from a cold start within 24h.
When DG1DG8 hold and are confirmed, write `## DONE` to `machine-docs/STATUS-1d.md`.
---
## 2. The layered test model (the core design)
For a given recipe, the harness assembles and runs **tiers**, each = `generic [+ overlays]`:
(`gen` = generic default; `→` = "overridden/extended by, if the recipe defines it" — mechanism is
the Builder's call, §2.2):
```
INSTALL = gen_install → test_install.py (else gen) ← always runs
UPGRADE = gen_upgrade → test_upgrade.py (else gen) ← always runs
BACKUP = gen_backup → test_backup.py (else gen) ← if backup-capable, else N/A
RESTORE = gen_restore → test_restore.py (else gen) ← if backup-capable, else N/A
CUSTOM = recipe test_*.py (anything beyond the four) ← ONLY if defined
```
### 2.1 Generic baseline suite (recipe-agnostic)
- **install** — `abra app new <recipe> --domain <run>.ci.commoninternet.net` with non-interactive
defaults + auto-generated secrets; `abra app deploy`; poll to converged; assert a real HTTP(S)
response from the app over its domain (status + that it's the app, not Traefik's fallback).
- **upgrade** — deploy a prior/pinned version, `abra app upgrade` to target, assert reconverge +
still serving.
- **backup / restore** — only for recipes declaring backup config; verify the **mechanism** (backup
produces an artifact; restore completes + app healthy). **Honest limit:** generic backup/restore
can't assert app-specific *data integrity* without recipe knowledge — that's a recipe overlay
(`test_backup.py`/`test_restore.py` seed a marker + assert it survives). State this in docs.
### 2.2 Layering — override or extend (Builder's call), always **reuse the deployment**
A recipe-defined `test_install.py` (etc.) either **overrides** the generic assertions for that op or
**extends** them — the Builder picks the mechanism (a present `test_<op>.py` replacing the generic is
a fine, simple model; or additive; or per-recipe opt-in). The invariant either way: **if a recipe
defines nothing for an op, the generic runs.** This guarantees every recipe is testable with zero
config, while letting a recipe with a poor generic fit supply its own.
**Reuse the deployment — do NOT redeploy per test (operator requirement, 2026-05-27).** Overlay
assertions run against the **same live deployment** the generic tier already brought up — no extra
`abra app new`/`deploy`/`undeploy` per overlay. The target shape: **one deploy per run**, then the
lifecycle ops mutate that single deployment in sequence and *all* assertions (generic + overlays +
custom) run against it, then **one teardown** at the end:
```
deploy ONCE
→ INSTALL assertions: generic_install + test_install.py (same live app)
→ UPGRADE in place (abra app upgrade)
assertions: generic_upgrade + test_upgrade.py (same app, upgraded)
→ BACKUP (if capable) → generic_backup + test_backup.py
→ RESTORE (if capable) → generic_restore + test_restore.py
→ CUSTOM test_*.py (same live app)
teardown ONCE (in finally)
```
So a seed/marker written by `test_backup.py` is the same data `test_restore.py` checks; an overlay's
extra HTTP assertion hits the app the generic install already deployed. Tiers that intentionally
change state (UPGRADE, RESTORE) do so on the shared deployment in order. The only time a tier
re-provisions is when an op semantically requires it (e.g. a from-scratch restore-into-blank variant)
— and that must be explicit, not the default. This is also the main Phase-2b speed win.
### 2.3 Custom tests
`test_*.py` that aren't one of the four lifecycle names run **only when present** for that recipe —
no generic equivalent, purely opt-in (e.g. `test_sso.py`, `test_federation.py`).
### 2.4 Custom install steps (and the graceful-generic rule)
Some recipes need extra setup the generic flow won't do (pre-seed a DB, set an env/secret, run a
one-off command). The harness exposes a **defined** per-recipe install-steps hook (cc-ci or
repo-local). Rules:
- If a recipe **declares** custom install steps → run them as part of the install tier.
- If a recipe has **no** customization defined anywhere → **still attempt the generic suite.**
Recipes that genuinely need special steps will **fail the generic install — and that's acceptable
and expected**; the failure is reported (per-op), and the fix is to add the custom step (Phase 2
work), not to special-case the harness.
### 2.5 Discovery + precedence
- **Locations:** cc-ci's test dir (e.g. `tests/<recipe>/`) and the recipe repo's `tests/`
(repo-local). The harness discovers overlays + custom-install-steps from both.
- **Precedence (OPEN — settle in DECISIONS):** proposed default — **both layer**; repo-local is the
upstream-authoritative source and always runs, cc-ci's overlay runs in addition (and may pin/extend
for the CI env). Define the rule for same-named collisions explicitly.
---
## 3. Milestones (bounded)
- **G0 — Generic install.** Implement generic_install in the shared harness; green on `custom-html`
with no recipe config. *Accept:* DG1.
- **G1 — Generic upgrade + backup/restore.** Add generic_upgrade; add generic_backup/restore gated
on backup-capability (clean N/A otherwise). *Accept:* DG2, DG3.
- **G2 — Layering + discovery.** Implement the generic+overlay composition and cc-ci/repo-local
discovery + precedence; prove an overlay runs on top of generic. *Accept:* DG4.
- **G3 — Custom install-steps hook + graceful-generic.** Implement the hook; demonstrate the
fail-without / pass-with proof on one recipe needing a step. *Accept:* DG5.
- **G4 — `!testme` integration + per-op reporting + docs + cold verify.** *Accept:* DG6, DG7, DG8;
then flip `machine-docs/STATUS-1d.md` to `## DONE`.
---
## 4. Guardrails
- **Generic is the default; recipe tests override or extend it** (Builder's mechanism) — but the
generic **always** runs when a recipe defines no test for that op. Never let a recipe end up with
*no* assertion for an op it should be tested on.
- **Generic failure ≠ harness bug** — a recipe needing custom steps failing the generic install is a
correct, reported outcome; fix by adding the step (Phase 2), don't weaken/special-case the generic.
- **Deploy once, reuse it** — overlays run against the generic tier's live deployment; no
per-test/per-overlay redeploy. One deploy + one teardown per run; re-provision only when an op
truly needs it, explicitly. (Correctness *and* the main perf lever.)
- **DRY** — generic logic in the shared harness, not per-recipe (M6.5); overlays are thin.
- **No weakened tests** — real assertions on real app state; teardown always; honor `MAX_TESTS`.
- **Bounded** — build the architecture + prove it on a couple of recipes; the full per-recipe overlay
authoring is Phase 2, not here.
---
## 5. Impact on later phases (reshapes the plan set)
- **Phase 2 (`plan-phase2-recipe-tests.md`)** changes from "author every test from scratch" to:
every enrolled recipe gets the generic suite for free; Phase 2 = **author the additive overlays**
(port recipe-maintainer tests as `test_*.py` overlays) **+ define custom install steps** where a
recipe fails generically. Update Phase 2 to reference 1d as its foundation.
- **Phase 3 (`plan-phase3-results-ux.md`)** — the YunoHost-style **level** maps cleanly onto the
tiers: e.g. installs (generic) → +upgrade → +backup/restore → +custom assertions. The level is
derived from which tiers pass.
- **Phase 2b (perf)** — the generic suite is the common hot path, so it's the prime target for the
image-cache / readiness / dedup optimizations.
---
## 6. Open decisions (log in machine-docs/DECISIONS.md)
- **Override vs extend (Builder's call).** Does a present `test_<op>.py` **replace** the generic
assertions for that op, **add** to them, or is it **per-recipe opt-in**? Operator leans: override
is a good, simple model. Builder decides + documents — keeping the invariant "no recipe test for an
op ⇒ generic runs" and the single-shared-deployment rule.
- **cc-ci vs repo-local precedence** for overlays + install-steps (§2.5) and same-name collision rule.
- Exact **overlay file convention**: fixed names (`test_install.py`…) + discovery dir layout
(`tests/<recipe>/` in cc-ci; `tests/` in the recipe repo).
- How **custom install steps** are declared: a shell hook (`install_steps.sh`), a pytest fixture, or
a declarative field — pick the simplest that the harness can run uniformly.
- **Backup-capability detection**: how the harness decides a recipe is backup-capable (backupbot
labels present / `abra app backup` exit) to choose run-vs-N/A for DG3.
- Whether generic **upgrade** should always go previous→latest, or test the specific
version-bump under `!testme` (PR-driven).
- Per-op result vocabulary (`pass`/`fail`/`skip(N/A)`/`error`) feeding the Phase-3 level.
- **Deployment-sharing scope**: confirm one-deploy-per-run for the whole lifecycle (install→upgrade→
backup→restore→custom on a single deployment) vs per-tier deployments; and how a failed earlier
tier (e.g. install) affects later tiers sharing that deployment (fail-fast vs continue-and-report).

View File

@ -0,0 +1,139 @@
# cc-ci Phase 1e — Generic-harness corrections (Autonomous Build Plan)
**Status:** QUEUED — runs **after Phase 1d** and **before Phase 2** (`plan-phase2-recipe-tests.md`).
It corrects the **shared generic-test harness** from 1d, so it must land before Phase 2 authors
overlays on top of it.
**Transition:** **manual** (operator kicks it off).
**Builds on:** the Phase-1d generic suite (`runner/run_recipe_ci.py`, `runner/harness/*`,
`tests/_generic/*`, `tests/conftest.py`) — see `plan-phase1d-generic-test-suite.md`.
**Owner agents:** same Builder + Adversary loops (`plan.md` §6/§7); Adversary cold-verifies.
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1e-harness-corrections.md`
**Phase order:** 1c → 1b → 1d → **1e** → 2 → 2b → 3.
---
## 0. Why this phase
An operator review of the 1d generic suite (2026-05-28) found three corrections to the **shared
harness** — the foundation every recipe overlay (Phase 2) builds on. Fixing them now, once, is far
cheaper than after overlays exist. All three are small in code but change behavior, so each needs a
fresh Adversary cold-verification and must not weaken any existing test.
---
## 1. Definition of Done (Phase 1e exit condition)
Terminates when every item holds **and the Adversary has independently cold-verified** (logged in
`machine-docs/REVIEW-1e.md`):
- [ ] **HC1 — Upgrade tier upgrades to the code under test (PR head), not a published tag.** The
upgrade tier deploys the **previous published version** (last release before the PR) and then
**upgrades to the PR head via `abra app deploy --chaos`** (chaos = the current checkout). The
PR's actual changes are exercised by the upgrade path. (§2.1)
- [ ] **HC2 — Repo-local (PR-authored) code is not executed unless the recipe is approved.** By
default the harness runs **only cc-ci-authored** overlays/install-steps (`tests/<recipe>/…`) +
the generic; PR-authored repo-local `test_*.py` and `install_steps.sh` are **not run**.
Repo-local code is honored **only for recipes on an explicit cc-ci-maintained approval
allowlist** (default-deny). (§2.2)
- [ ] **HC3 — Generic runs by default (additive); skipping it is explicit.** When a recipe ships an
overlay for an op, the **generic still runs** alongside it by default; the generic is skipped
**only** when an explicit env/flag opts out. The baseline floor is never lost silently. (§2.3)
- [ ] **HC4 — No regression, cold-verified.** The Adversary re-runs the relevant D1D10 / DG1DG8
acceptance from a cold start: nothing weakened, deploy-once (DG4.1) still holds, teardown still
sacred, and the three new behaviors are demonstrated (HC1: a PR-head upgrade proven to deploy
PR-head; HC2: a repo-local test is *ignored* for a non-approved recipe and *run* for an approved
one; HC3: generic runs with an overlay present, and is skipped only with the opt-out set).
When HC1HC4 hold and are confirmed, write `## DONE` to `machine-docs/STATUS-1e.md`.
---
## 2. The three corrections
### 2.1 HC1 — Upgrade to the PR head (not a published tag)
Current 1d behavior: deploy previous published version, then `abra app upgrade` to the **newest
published tag** — and because deploying the prev tag re-checks-out the recipe, the **PR-head code is
never deployed**, so a recipe PR's changes aren't exercised by upgrade.
Corrected:
1. Deploy the **previous published version** (the last release before the code under test) as the
"before" state.
2. **Restore the PR-head checkout** (re-checkout the PR ref / re-use the post-fetch snapshot — the
prev-tag deploy will have reset `~/.abra/recipes/<recipe>`).
3. **Upgrade to it via `abra app deploy --chaos`** (chaos = current checkout = PR head) in place on
the shared deployment.
4. Assert reconverge + still serving (as today).
- **Adapt the "deployment moved" assertion** (`generic.do_upgrade`): prev→PR-head may *not* bump the
coop-cloud version label (a PR can change a recipe without a version bump), so also accept an
image/config change, or assert the running config now matches the PR head — keep it non-vacuous
without false-failing a legit unbumped PR.
- **Non-PR `!testme`** (no PR head): "current checkout" = the catalogue current, so upgrade tests
prev→current — still valid.
- Preserve **deploy-once** spirit: this is still one app deployment mutated in place (prev → chaos
redeploy of PR head is the upgrade op, not a fresh second app). Reconcile with the DG4.1
deploy-count guard — define whether a chaos redeploy counts as a "deploy" and adjust the guard so
the legitimate upgrade isn't flagged (e.g. count `abra app new` installs, not in-place redeploys).
### 2.2 HC2 — Repo-local trust gate (default-deny; cc-ci overlays only)
`install_steps.sh` and repo-local `test_*.py` are PR-author-controlled code that runs on the CI host
with `/run/secrets/*` present — an untrusted-code risk. Operator decision (2026-05-28):
- **Default:** the harness runs **only cc-ci-authored** overlays + install-steps
(`tests/<recipe>/…`) and the generic. Repo-local (`<recipe-repo>/tests/`) `test_*.py` and
`install_steps.sh` are **discovered-but-not-executed**.
- **Approved recipes only:** repo-local code is honored **only** when the recipe is on an explicit,
**cc-ci-maintained approval allowlist** (default-empty ⇒ default-deny). Adding a recipe to the
allowlist is a deliberate cc-ci-maintainer act after reviewing that recipe's tests.
- Update `discovery.resolve_op` / `custom_tests` / `install_steps` so the **repo-local source is
only consulted for allowlisted recipes**; otherwise precedence is **cc-ci > generic** only.
- **Open (settle in DECISIONS):** the allowlist's form + location (a checked-in file like
`tests/repo-local-approved.txt`, or a field in a cc-ci config), and the approval workflow. Keep it
simple + auditable + in git.
- (Future hardening, → IDEAS, not this phase: sandbox/network-restrict even cc-ci overlays.)
### 2.3 HC3 — Generic by default (additive), explicit opt-out
Supersedes 1d's pure-override default. New rule: when a recipe ships an overlay for an op, **both the
generic and the overlay run** for that op by default; the generic is skipped **only** when an
explicit opt-out is set.
- **Opt-out mechanism (propose; settle in DECISIONS):** an env flag `CCCI_SKIP_GENERIC` (all ops) and
per-op `CCCI_SKIP_GENERIC_<OP>` (e.g. `..._UPGRADE`), settable via the recipe's `recipe_meta.py`
(a `SKIP_GENERIC` list) so it's declarative per recipe, not a hidden global.
- **Op-vs-assertion split (required by additive + deploy-once):** a mutating op (upgrade/backup/
restore) must run **once**, then **both** the generic assertions and the overlay assertions
evaluate the post-op state — never upgrade/backup twice. So refactor the tiers: the **orchestrator
performs the op once** (the harness owns the op), then runs generic assertions (unless opted out) +
overlay assertions against the shared post-op deployment. For `install` (no op) both assertion sets
just run. This keeps deploy-once and one-op-per-tier intact.
- Net effect: the generic "is it actually serving / did the upgrade move / snapshot produced" floor
is **always** exercised unless a recipe explicitly declares it skips generics — overlays add, they
don't silently subtract.
---
## 3. Method / milestones (bounded)
- **E0 — HC2 trust gate.** Gate repo-local behind the approval allowlist (default-deny); cc-ci+generic
only otherwise. *Accept:* repo-local ignored for a non-approved recipe, run for an approved one.
- **E1 — HC3 additive + op/assertion split.** Generic runs alongside overlays by default; op runs
once; opt-out env skips the generic assertions. *Accept:* overlay + generic both run on one
deployment; opt-out skips generic; deploy-count still 1.
- **E2 — HC1 upgrade-to-PR-head.** prev-release → PR-head via `deploy --chaos`; moved-assertion
adapted; deploy-count guard reconciled. *Accept:* upgrade demonstrably deploys PR-head.
- **E3 — HC4 cold re-verification + docs.** Adversary cold-verifies no regression + the three new
behaviors; update `docs/` + `machine-docs/DECISIONS.md`; flip `STATUS-1e.md` to `## DONE`.
---
## 4. Guardrails
- **Never weaken a test** — these are correctness/security fixes; the cardinal rule still wins.
- **Default-secure** — repo-local PR code is off unless the recipe is explicitly approved; the
allowlist lives in git and is auditable.
- **Floor-by-default** — the generic baseline always runs unless a recipe explicitly opts out.
- **Deploy-once preserved** — one app deployment, one teardown; ops run once; reconcile the DG4.1
guard with the chaos-upgrade redeploy.
- **Bounded** — three fixes + verification, then stop; bigger hardening (sandboxing) → IDEAS.
## 5. Open decisions (log in machine-docs/DECISIONS.md)
- HC2: approval-allowlist form/location + the approval workflow.
- HC3: opt-out flag name/granularity + declaring it via `recipe_meta.py`.
- HC1: how the DG4.1 deploy-count guard treats an in-place chaos upgrade (don't flag the legit op).

View File

@ -1,8 +1,13 @@
# cc-ci Phase 2 — Comprehensive per-recipe test authoring (Autonomous Build Plan)
**Status:** QUEUED — starts after Phase 1 (`plan.md`) and the Phase-1b review/lint pass
(`plan-phase1b-review-lint.md`) reach `## DONE`.
**Builds on:** the Phase-1 cc-ci CI server (`plan.md`). This phase adds **test content**, not infra.
**Status:** QUEUED — starts after Phase 1b (`plan-phase1b-review-lint.md`) **and Phase 1d**
(`plan-phase1d-generic-test-suite.md`) reach `## DONE`.
**Builds on:** the Phase-1 cc-ci CI server (`plan.md`) **and Phase 1d's generic test suite** — every
recipe already gets generic install/upgrade/backup/restore for free, reusing one shared deployment.
So this phase is **not** "write every test from scratch": it's **authoring the recipe overlays**
(`test_<op>.py` that override/extend the generic, per 1d's model) + **defining custom install steps**
for recipes that fail the generic form, porting the recipe-maintainer corpus as overlays. This phase
adds **test content**, not infra.
**Reference corpus:** `references/recipe-maintainer/``/srv/recipe-maintainer/` (the existing,
human-maintained recipe tests — the canonical source to port from).
**Owner agents:** same Builder + Adversary loops + coordination protocol as Phase 1 (`plan.md` §6/§7).
@ -10,6 +15,44 @@ human-maintained recipe tests — the canonical source to port from).
---
## 0c. OIDC / SSO-dep recipes — follow the SSO-dep plan (NOT operator-pending)
Recipes that authenticate via OIDC (lasuite-docs, cryptpad, lasuite-drive, lasuite-meet, future)
do **not** need operator input to wire OIDC. The canonical pattern lives at
**`plan-sso-dep-testing.md`** — declared `DEPS = ["keycloak"]` in `recipe_meta.py`, orchestrator
co-deploys the dep, recipe `install_steps.sh` reads `$CCCI_DEPS_FILE` and writes the OIDC env vars
+ injects the client secret via abra. Authenticated tests use `harness.sso.oidc_password_grant` or
Playwright on the dep's login page. **`DEFERRED.md` entries that cite "operator input needed for
OIDC" are mis-categorised** — re-open and execute them per this plan.
## 0b. Auto-mirror missing recipes (NOT a blocker — autonomous loop work)
A recipe is not on `git.autonomic.zone/recipe-maintainers/<recipe>` is **not** an operator-pending
blocker. The bot is admin on `recipe-maintainers` (see memory `cc-ci-gitea-recipes`) and can create
private mirror repos. **For any recipe you want to enroll/test that isn't mirrored yet, mirror it
yourself** before enrolling, based on the **`recipe-create-pr` skill** —
`/srv/recipe-maintainer/.opencode/skills/recipe-create-pr/SKILL.md` (which references
`/srv/recipe-maintainer/.claude/commands/recipe-create-pr.md` for the full procedure).
The flow (adapt the skill's command for the new-mirror case):
1. Create the **private** mirror repo on `git.autonomic.zone/recipe-maintainers/<recipe>` (Gitea API
POST `/orgs/recipe-maintainers/repos`, bot creds from `.testenv`/§1.5).
2. Mirror the upstream `git.coopcloud.tech/coop-cloud/<recipe>` (clone --mirror → push, including
tags) so the mirror's `main` is upstream-synced and tags carry over.
3. Then proceed with normal enrollment + the lifecycle suite (1d generic + your overlays from this
phase).
Treat this as standard loop work — don't sit idle waiting on the operator for missing recipes.
## 0a. Prerequisite: Phase 1e harness corrections (must be DONE first)
The 1d/1b operator review produced three shared-harness corrections, now their own phase
**`plan-phase1e-harness-corrections.md`** (runs **before** this phase). Do **not** author overlays
until 1e is `## DONE`: it changes the foundation every overlay sits on — (HC1) upgrade goes
prev-release → **PR head** via `deploy --chaos`; (HC2) **repo-local PR code runs only for approved
recipes** (default cc-ci-overlays + generic only); (HC3) the **generic runs by default alongside an
overlay**, skipped only via an explicit opt-out. See that plan for detail.
## 0. Relationship to Phase 1 (read first)
Phase 1 built the **machine**: the Drone pipeline, the `!testme` trigger (polling-primary), the

View File

@ -0,0 +1,113 @@
# cc-ci Phase 4 — Final review, polish & cleanup (capstone) (Autonomous Build Plan)
**Status:** QUEUED — the **LAST** phase, runs after Phase 3 (`plan-phase3-results-ux.md`). A bounded
final review/lint/cleanup pass over the **entire** codebase as it stands after all phases, ending in a
**full cold re-verification that nothing regressed**.
**Transition:** auto (last in the launcher sequence); after it, the whole build is done.
**Builds on:** everything — Phase 1 + 1c + 1b + 1d + 1e + 2 + 2b + 3 (flake/`nix/` modules, the
runner/harness + generic suite + recipe overlays, the comment-bridge, Drone, the dashboard/results
UX, docs, `machine-docs/`).
**Owner agents:** same Builder + Adversary loops (`plan.md` §6/§7); Adversary cold-verifies.
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase4-final-review-polish-cleanup.md`
**Phase order:** 1c → 1b → 1d → 1e → 2 → 2b → 3 → **4 (final)**.
---
## 0. Why this phase (and why it's bounded)
This is the analogue of Phase 1b, but final and whole-codebase. By now the tree has grown a lot —
recipe overlays (Phase 2), performance changes (2b), and results/dashboard UX (3) — all layered on
the foundation. Before calling the build done, do one **bounded** pass to clean and harden it, and —
critically — **re-verify from a cold start that none of the growth/cleanup regressed any earlier
guarantee.** Same discipline as 1b: **good-enough + enforceable**, style→tooling, judgment→checklist,
don't reopen settled design, and **never weaken a test** to satisfy a nit.
---
## 1. Definition of Done (Phase 4 exit condition)
Terminates when every item holds **and the Adversary has independently cold-verified** (logged in
`machine-docs/REVIEW-4.md`):
- [ ] **F1 — Lint/format green across the whole codebase.** Re-run the 1b toolchain (alejandra/
statix/deadnix, ruff, shellcheck/shfmt, yamllint) over everything added since 1b (Phase-2
overlays, 2b changes, 3 UX/dashboard); extend the lint config to any new languages/areas (e.g.
dashboard front-end) so it's covered going forward. The `.drone.yml` lint stage still passes
from a clean checkout; prove with a break-it probe.
- [ ] **F2 — White-box review checklist over all post-1b code.** Run the §3 checklist; fix every
**blocking** finding, triage advisories to `BACKLOG`/`IDEAS`. Findings + resolutions in
`machine-docs/REVIEW-4.md`.
- [ ] **F3 — Cleanup.** Remove dead code/scaffolding and stale TODOs; consistent naming/structure;
reconcile `machine-docs/` (BACKLOG/IDEAS/DECISIONS current, no contradictions); docs match the
final state. No behavior change beyond what F2 mandates.
- [ ] **F4 — FULL cold re-verification (the final gate).** *After* F1F3 land, the Adversary
**independently re-verifies every prior Definition-of-Done from a cold start**, to the same bar
each phase used — fresh PASS + evidence + timestamps in `machine-docs/REVIEW-4.md` within 24h,
**nothing weakened/skipped/softened** by the cleanup:
- **Phase 1 D1D10** (incl. the genuine **D8** byte-identical fresh-clone rebuild + a
category-spanning live `!testme` e2e through the public gateway).
- **Phase 1c C1C7** (secrets-in-git, cert-in-sops, honest reproducibility).
- **Phase 1d DG1DG8** (generic install/upgrade/backup/restore, deploy-once `DG4.1`, override
floor) **as amended by 1e**.
- **Phase 1e HC1HC3** (upgrade→PR-head via `deploy --chaos`; repo-local gated to approved
recipes; generic-by-default + explicit opt-out).
- **Phase 2** recipe-coverage criteria (every enrolled recipe's overlays/ported tests real,
DRY, green).
- **Phase 2b** performance claims (the measured improvements still hold; no test weakened to
get them).
- **Phase 3** results/level/UX criteria (per-run level honest, PR comment + dashboard correct).
- [ ] **F5 — Documented + cold-verified.** Final `docs/` accurate (install reproduces from scratch;
enroll-recipe + overlay/approval flow correct); accepted deviations in `DECISIONS.md`; the
Adversary confirms F1F4 with no standing VETO and no open `[adversary]` finding.
When F1F5 hold and are confirmed, write `## DONE` to `machine-docs/STATUS-4.md` — the build is
complete.
---
## 2. Method
1. **Lint/format first** (F1) — re-run + extend; auto-fix style, don't deliberate.
2. **Review checklist** (F2, §3) — classify blocking vs advisory; fix blocking, triage rest.
3. **Cleanup** (F3) — dead code, naming, docs, `machine-docs/` reconciliation.
4. **Full cold re-verification LAST** (F4) — once everything has landed, the Adversary re-runs the
entire cross-phase acceptance from cold. Order matters: tooling → review/fixes → cleanup → *then*
full re-verify. Cleanup must regress nothing.
5. **Bound it** — a pass, not a rewrite; record dead-ends/deviations and stop.
## 3. White-box review checklist (teeth, not taste) — whole codebase
Blocking unless noted (plan-relevant invariants visible only by reading code):
- **Tests are real** (blocking) — every generic/overlay/custom test asserts actual app state; no
`skip`/`xfail`/can't-fail; per-op `pass/fail/skip` honest; the 1d/1e anti-vacuous guards
(`assert_serving` routing proof, `do_upgrade` "moved", deploy-count==1) intact.
- **1e corrections intact** (blocking) — repo-local code still gated to approved recipes; generic
still runs by default (opt-out explicit); upgrade still targets the PR head.
- **Generic-first / custom-additive invariant** (blocking — `docs/testing.md`). Confirm no path
makes the generic tier depend on custom: deps deploy + `setup_custom_tests` run **after** all
generic tiers, never before; a forced `setup_custom_tests` failure still yields a clean
generic-tier `pass/pass/pass/pass` + `skip(deps-not-ready)` for `@requires_deps` custom tests
(re-exercise the forced-failure case). Future maintainers must be able to operate cc-ci with
the generic tier alone — verify that path stays viable.
- **Harness DRY** (blocking-ish) — recipe quirks are data (`recipe_meta.py`), not shared-harness
conditionals; overlays are thin; no per-recipe copy-paste of lifecycle logic.
- **Server state Nix-declared & idempotent** (blocking) — no imperative drift / run-once sentinels /
manual post-rebuild steps; the `nix/` layout clean.
- **No footguns** (blocking) — no bare `sleep` for readiness (poll); teardown in `finally`; secrets
reused per run not regenerated; no hardcoded versions/domains that break upstream.
- **No secrets in code/committed files** (blocking) — grep source/configs/`.drone.yml`/fixtures;
log/dashboard redaction real (incl. any new Phase-3 UX surface that echoes run data).
- **Phase-3 UX correctness** (advisory→blocking on real drift) — the displayed level/badge/screenshot
reflect the true per-op results; no misleading "pass".
- **Architecture matches the plans; deviations in `DECISIONS.md`** (advisory→blocking on real drift).
- **Readability & docs** (advisory) — clear names, dead code removed, docs reproduce from scratch.
## 4. Guardrails
- **Never weaken a test** to satisfy a lint/review/cleanup nit (cardinal rule wins).
- **Don't reopen settled design** — clean + harden + re-verify; bigger ideas → `IDEAS.md`.
- **Bounded** — one pass; cap iterations; record + stop.
- **Cleanup regresses nothing** — F4 is the proof; if a cleanup breaks a prior guarantee, revert the
cleanup, not the guarantee.
## 5. Open decisions (log in machine-docs/DECISIONS.md)
- Any new linters/formatters for Phase-3 front-end / new areas, and their strictness.
- The precise blocking-vs-advisory split for the §3 checklist on the new code.
- Whether to add Python type-checking now or defer to `IDEAS.md` (carried from 1b).

View File

@ -0,0 +1,205 @@
# SSO-dep testing pattern (OIDC + co-deployed provider) — reference plan
**Status:** active reference for Phase 2 (does not need its own phase; it's a pattern Phase 2
overlays apply per recipe).
**Operator clarification (2026-05-28):** integrating a recipe with an OIDC/SSO dep is **loop work,
not operator work**. Anything that was deferred citing "operator input needed for OIDC" should be
re-opened and done autonomously per this plan.
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-sso-dep-testing.md`
**Companion:** the running harness in `runner/harness/sso.py` (existing primitives:
`setup_keycloak_realm`, `oidc_password_grant`, `assert_discovery_endpoint`).
---
## 0. Why this plan
Several recipes test their authenticated functionality through an OIDC/SSO provider (keycloak,
authentik). The cc-ci pattern is to **co-deploy the provider with the recipe under test in the same
ephemeral run** — one shared deployment per dep, configured at install time, used by the
recipe-under-test's authenticated tests, torn down with it. This file is the canonical pattern for
how to wire that up so any recipe that declares `DEPS = ["keycloak"]` (or `["authentik"]`) Just Works
without per-recipe ad-hoc plumbing. Recipes that need OIDC are not blocked on the operator — they
follow this plan.
## 1. The DEPS model — deps deploy AFTER generic tiers (operator-2026-05-28)
**Critical ordering rule:** generic tiers (install / upgrade / backup / restore) run against the
**recipe alone, with no dep available**, so a failure in dep-deploy or OIDC setup **cannot break
generic-tier signal**. Deps + OIDC wiring move to a **`setup_custom_tests` step** that runs *after*
generic tiers and *before* the custom tier — its failure is isolated to the SSO-marked custom tests.
A recipe's `tests/<recipe>/recipe_meta.py` declares its SSO dep:
```python
DEPS = ["keycloak"] # or ["authentik"] when that backend lands
```
### Lifecycle order (single run, per recipe)
```
1. Deploy recipe-under-test ALONE (no deps, OIDC env unset or stubbed).
- app_new #1 for the recipe; generic install_steps.sh runs RECIPE-ONLY setup (no deps).
2. INSTALL tier: generic [+ overlay] assertions against the recipe alone.
3. UPGRADE tier: abra app upgrade in place, assertions against the recipe alone.
4. BACKUP tier: in place (if backup-capable), recipe-alone marker.
5. RESTORE tier: in place, recipe-alone marker.
6. setup_custom_tests step ← NEW (operator-2026-05-28)
a. For each dep in DEPS, deploy + provision realm/client via harness.sso.setup_<provider>_realm.
b. Write $CCCI_DEPS_FILE with each dep's {domain, realm, client_id, client_secret, admin_*}.
c. Run the per-recipe post-deps hook `tests/<recipe>/setup_custom_tests.sh` to wire the OIDC
env into the running recipe (abra app config set + abra app secret insert) and trigger an
in-place redeploy of the affected services so the env takes effect.
d. Mark deps-ready = True on success; on ANY failure mark deps-ready = False and CONTINUE
(log the error; do NOT abort the run).
7. CUSTOM tier:
- If deps-ready: run all custom tests, including those tagged @pytest.mark.requires_deps.
- If NOT deps-ready: still run custom tests, but tests tagged @pytest.mark.requires_deps are
reported as ERROR/SKIP (with the captured setup_custom_tests error attached). Non-deps
custom tests still run normally.
8. Teardown (in finally): recipe first; then each dep in reverse declaration order.
```
### DG4.1 deploy-count guard, generalised
The "one deploy per run" guard becomes **one `abra app new` per app in the run** (recipe + each
dep). In-place reconfigure-and-redeploy (the step 6c env update) is **NOT** a fresh `app_new` and
does NOT increment the per-recipe count. So a run with `DEPS = ["keycloak"]` has exactly 2
`app_new` calls (recipe + keycloak), no matter how many tiers ran. The per-run summary reports
deploy-count per app for verification.
### Why this ordering
- **Generic-tier signal is preserved** when SSO/dep setup is broken — the recipe's own deploy/
upgrade/backup/restore behaviour is still tested honestly.
- **Failure isolation**: a recipe whose generic tier passes but whose SSO setup is broken yields
per-op `pass/pass/pass/pass/skip(deps-not-ready)` — far more useful than the previous
all-or-nothing.
- A recipe that genuinely can't boot without OIDC fails its generic install honestly (the recipe
should accept a stubbed/empty OIDC env at install time and only require the env when an
authenticated endpoint is hit). That's a real recipe finding, not a CI artifact.
## 2. Provider pluggability
- **Provider-agnostic primitives** (today, in `harness/sso.py`) — these stay pluggable:
- `oidc_password_grant(discovery_url, client_id, client_secret, username, password) -> token`
pure OIDC; works against any compliant provider.
- `assert_discovery_endpoint(discovery_url, expected_issuer)` — pure OIDC.
- **Provider-specific setup** (admin API calls) — one function per provider:
- `setup_keycloak_realm(domain, admin_user, admin_password, realm, client_id, redirect_uris) ->
{client_secret, discovery_url}` — exists today.
- `setup_authentik_realm(...)` — same shape, authentik admin API; **deferred** to a future Q4
enrollment that actually wants authentik (see `machine-docs/DEFERRED.md`). Pluggable: a recipe
declaring `DEPS = ["authentik"]` would just call this instead. No change to the per-recipe
`install_steps.sh` shape beyond which provider it asks for from `$CCCI_DEPS_FILE`.
- **Don't write per-recipe SSO logic.** All recipes use the same DEPS+install_steps shape.
## 3. Per-recipe hooks — two distinct scripts (recipe-only vs post-deps)
A recipe with `DEPS = ["keycloak"]` ships **two** optional hook scripts (either may be absent if
not needed):
### 3.1 `tests/<recipe>/install_steps.sh` — RECIPE-ONLY setup, runs at install time
This is the Phase-1d custom-install-steps hook. It runs **before** the recipe deploys, **with no
dep available** (the dep hasn't been deployed yet at this point). Use it only for recipe-only
setup that the recipe needs to boot at all (e.g. seed a fixture, set a non-OIDC env). **Do NOT
read `$CCCI_DEPS_FILE` here** — it doesn't exist yet. If the recipe requires OIDC to *boot at
all*, set a safe stub here (e.g. disable auth) so the recipe can come up for generic tiers; the
real OIDC wiring happens in §3.2.
### 3.2 `tests/<recipe>/setup_custom_tests.sh` — POST-DEPS wiring, runs after generic tiers
This is the new (operator-2026-05-28) hook that wires the recipe to its already-deployed dep,
*after* the generic tiers have run. The orchestrator has already deployed each dep and written
`$CCCI_DEPS_FILE` by the time this runs. Roughly:
```sh
#!/usr/bin/env bash
set -euo pipefail
# Read the dep's connection info from $CCCI_DEPS_FILE (orchestrator-written).
KC_DOMAIN=$(jq -r '.keycloak.domain' "$CCCI_DEPS_FILE")
KC_CLIENT=$(jq -r '.keycloak.client_id' "$CCCI_DEPS_FILE")
KC_SECRET=$(jq -r '.keycloak.client_secret' "$CCCI_DEPS_FILE")
KC_REALM=$( jq -r '.keycloak.realm' "$CCCI_DEPS_FILE")
# Inject the OIDC client secret as an abra app secret (recipe-conventional name varies — match
# the recipe's .env.sample SECRET_*).
echo "$KC_SECRET" | abra app secret insert -n "$CCCI_APP_DOMAIN" oidc_rpcs v1 -
# Write the OIDC env vars to the parent .env (names per the recipe's .env.sample).
abra app config set "$CCCI_APP_DOMAIN" \
OIDC_REALM="$KC_REALM" \
OIDC_OP_DISCOVERY_ENDPOINT="https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration" \
OIDC_OP_AUTHORIZATION_URL="https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth" \
OIDC_OP_TOKEN_URL="https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token" \
OIDC_OP_USER_URL="https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo" \
OIDC_RP_CLIENT_ID="$KC_CLIENT" \
OIDC_RP_REDIRECT_URI="https://${CCCI_APP_DOMAIN}/auth/oidc/callback"
# Force an in-place redeploy of the affected services to pick up the new env. This is NOT a fresh
# app_new (deploy-count guard still 1 for this recipe).
abra app deploy --force --chaos --no-input "$CCCI_APP_DOMAIN"
```
The OIDC env-var **names are recipe-specific** (`OIDC_OP_*` for lasuite-docs, different prefixes
elsewhere). Read the recipe's `.env.sample` to see which keys the recipe expects; the *values* follow
this template. If a recipe needs more than this (extra group/claim mappings, etc.), extend its
`setup_custom_tests.sh` only — never the shared harness.
## 4. Test pattern: authenticated endpoints (mark + isolate)
- **Mark dep-requiring tests:** every custom test that needs the dep up + OIDC wired must use
`@pytest.mark.requires_deps`. The orchestrator skips these with reason `"deps-not-ready: <err>"`
if `setup_custom_tests` failed. Non-deps custom tests are unaffected by SSO setup failures.
- **Headless API tests** — use `harness.sso.oidc_password_grant` to mint an access token, then call
the recipe's authenticated endpoint with `Authorization: Bearer <token>`. Asserts on the response.
- **Browser flows (Playwright)** — navigate to the recipe, follow the redirect to keycloak, fill the
pre-provisioned test user's credentials, return to the recipe, exercise the UI. (Use the
pre-provisioned `ci-user@example.com` / known password the realm setup creates.)
- **The realm/client is fresh per run** — no cross-run state, no shared accounts. The realm setup
creates one or more test users with known passwords (pass-through from a per-run secret) so the
tests can authenticate without prompts.
## 5. Concrete recipes that use this pattern (Phase-2 scope)
These are **loop work** under this plan, not deferred:
- **lasuite-docs** — `DEPS = ["keycloak"]`; ports the upstream `oidc_login.py` +
`upload_conversion.py` parity tests + the §4.3-prescribed `create-a-doc + read-back via
authenticated /api/v1.0/documents/`. (Re-enters `DEFERRED.md` entry #5 — this plan IS the
re-entry, not operator input.)
- **cryptpad** — `DEPS = ["keycloak"]` (cryptpad upstream tests use authentik, but a keycloak-backed
cryptpad OIDC test is equally valid and uses the same primitives). The cryptpad create-a-pad
Playwright test (DEFERRED #6) is a separate concern — that one really does need a stable
CryptPad app-launch contract; it stays deferred.
- **lasuite-drive, lasuite-meet** — same pattern when mirrored (`recipe-create-pr` skill — loop work).
- Any future recipe that requires OIDC follows this plan; no operator handoff.
## 6. What stays deferred (genuinely operator-input)
- **authentik enrollment + `setup_authentik_realm` backend** (DEFERRED #9) — provider breadth, not
blocking any Phase-2 recipe under keycloak. Open question for the operator: do we want
cross-provider coverage as part of Phase-2 DONE? If yes, lift; if not, leave deferred.
- The `--extra-tests` flag IDEA is **not** a precondition for this plan; OIDC-dep tests are part
of the default suite for the recipes that need them.
## 7. Definition of done for this pattern
- [ ] `DEPS = [...]` honored by `runner/run_recipe_ci.py`, with the **deps-AFTER-generic** ordering
(§1): deps deploy + `setup_custom_tests` step runs between RESTORE and CUSTOM tiers;
`$CCCI_DEPS_FILE` written; deps torn down LAST in reverse order.
- [ ] **Failure isolation proven:** a forced `setup_custom_tests` failure (e.g. simulate keycloak
realm-setup error) yields a run where generic tiers report **pass** and CUSTOM
`requires_deps` tests report **skip(deps-not-ready)** — no false fail of the generic tier,
no aborted run.
- [ ] **lasuite-docs** ships `tests/lasuite-docs/setup_custom_tests.sh` per §3.2 + authenticated
tests per §4 marked `@pytest.mark.requires_deps` (closes DEFERRED #5 — keep the entry there
with the closing commit, do not re-defer).
- [ ] At least one other OIDC-dep recipe (cryptpad oidc_login or a lasuite-* once mirrored) lands
cold-green using the same pattern, demonstrating reuse.
- [ ] `docs/sso-dep-testing.md` (in the cc-ci repo) explains the pattern for future recipe
enrollments — link to this plan.
- [ ] Adversary cold-verifies the full run for one such recipe + the forced-failure isolation
case, posts PASS in `REVIEW-2.md`.
## 8. Mirror+enroll reminder (also loop work)
If a recipe in scope (e.g. `lasuite-drive`, `lasuite-meet`, `immich`) **isn't mirrored to
`git.autonomic.zone/recipe-maintainers/`**, mirror it autonomously via the `recipe-create-pr` skill
at `/srv/recipe-maintainer/.opencode/skills/recipe-create-pr/SKILL.md` (see also
`plan-phase2-recipe-tests.md §0b`). Mirror+enroll is **not** operator-pending; the bot is admin on
the org.

View File

@ -126,7 +126,13 @@ without the auth key.
- **Wildcard TLS cert — PROVIDED, not a token.** The operator has pre-issued the wildcard SAN cert
(`*.ci.commoninternet.net` + `ci.commoninternet.net`) and placed it on cc-ci at
`/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` (§4.0). The agent feeds these into the
`/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` (§4.0).
> **Phase-1c update (supersedes the cert references in §1.5/§4.0/§4.4 below):** the cert is no longer
> an out-of-band operator file-drop — it is now **sops-encrypted in the private `cc-ci-secrets` repo**
> (a git submodule) and **decrypted at activation to that same path** by sops-nix. Issuance stays
> operator-only (LE/Gandi, no token on the box); to rotate, the operator re-issues then re-encrypts
> the cert into `cc-ci-secrets` and rebuilds. The ONE out-of-band secret is now the bootstrap age key
> at `/var/lib/sops-nix/key.txt`. Authoritative model: `cc-ci/docs/secrets.md` + `docs/install.md`. The agent feeds these into the
`coop-cloud/traefik` recipe as its `ssl_cert`/`ssl_key` swarm secrets (wildcard/file-provider
mode) and runs **no ACME** for this domain. **Do not request or expect a `commoninternet.net` DNS
token** — issuance/renewal is handled out-of-band by the operator (LE 90-day cert; next renewal
@ -597,8 +603,37 @@ its own pacing. To make concurrent writes conflict-free:
merges the two cleanly. Closing an item = checking the box *in your own section*; the Builder
fixes an `[adversary]` finding and notes the fix in JOURNAL, but only the Adversary ticks it
closed after re-test.
- `DEFERRED.md` (in `machine-docs/`) is the **single canonical registry for things the loops
have deliberately decided not to do autonomously and that need operator input to move on.**
Append-only; either agent may file. Each entry should clearly say *what's needed from the
operator* to lift the deferral (an opt-in flag, a resource decision, an architectural call,
plain "go ahead"). The list is **open-ended** — items can sit indefinitely, **no obligation
to close every item**, closure is operator-driven. A re-entry trigger / IDEA cross-link is
**optional** (include when there's a natural mechanism, e.g. an opt-in flag in
`cc-ci-plan/IDEAS.md`). Don't park deferrals as a vague "Q4 follow-up" / buried JOURNAL note
— file them here so the operator can review the whole list. The Phase-4 cleanup pass should
**surface** DEFERRED.md to the operator at least once but does **not** force closure.
Future-aspirational ideas (out of current scope) still go to `cc-ci-plan/IDEAS.md`; DEFERRED
is for considered-and-parked work the loops won't tackle without operator input.
- **Append-only where possible.** `JOURNAL.md` and `REVIEW.md` are append-only logs → they never
conflict. Prefer appending over rewriting.
- **Artifact-layer isolation — facts in STATUS, reasoning in JOURNAL (anti-anchoring).** Rigorous
adversarial verification requires the Adversary NOT to consume the Builder's rationalisations
before forming its verdict. The split:
- `STATUS.md` MUST carry **everything the Adversary needs to verify the claim** — withholding
verification context defeats the verification: **WHAT** is claimed (gate id, DoD items), **HOW**
to verify (the exact command/check the Adversary can re-run from its own clone), the
**EXPECTED** outcome (build hashes, file contents, leaf fingerprints, status codes), and
**WHERE** the inputs live (commit shas, paths). If it's essential for the Adversary to verify,
it goes in STATUS.
- `STATUS.md` MUST NOT carry rationalisations / "why I think this passes" / design narrative /
dead-ends explored. Those go in `JOURNAL.md` (Builder-private to write).
- The Adversary reads STATUS for the claim + verification info, the plan as SSOT, and the code /
git history; it forms its verdict from those + its own **cold** acceptance run, and does **not**
read `JOURNAL.md` before the verdict. After an independent verdict, consulting JOURNAL is fine
(e.g. to contextualise a finding) — note in REVIEW that you did.
In short: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.**
- **Git discipline (both loops, every write):** `git pull --rebase` before editing, make the
smallest change, commit, `git push`. On a rebase conflict, it will be inside the *other* agent's
file/section only if a rule was broken — re-pull and keep to your own files. Never `--force`.
@ -613,6 +648,22 @@ its own pacing. To make concurrent writes conflict-free:
- **Liveness.** If the Adversary sees a gate `CLAIMED` for too long with no Builder progress, or
the Builder sees no Adversary verdict on a standing claim, note it in your own ledger and keep
doing independent work — neither loop blocks idle waiting on the other beyond its gate.
- **INBOX — explicit cross-loop messaging beyond CLAIMS.** Sometimes you have something to say to
the other loop that isn't a gate claim or a REVIEW verdict (a heads-up, a request for
early-look, a "I refactored X, please re-verify Y", an observation outside the normal flow). For
those, use the inbox files in `machine-docs/`:
- **Builder → Adversary:** the Builder writes/appends `machine-docs/ADVERSARY-INBOX.md` in its
own clone, commits, pushes.
- **Adversary → Builder:** the Adversary writes/appends `machine-docs/BUILDER-INBOX.md` in its
own clone, commits, pushes.
- The watchdog edge-triggers on **newly-present** inbox files in the relevant clone and pings
the receiver. The receiver, on receipt, reads + processes the message, then **deletes the
inbox file** (commits + pushes) — deletion is the "message consumed" signal. Single-writer
discipline: only the sender writes their counterpart's inbox; only the receiver deletes it.
- **Use for:** non-gate signals — "heads-up I'm about to refactor X," "please cold-verify this
while I keep going," "I observed Y outside our normal flow," "I'm taking a long e2e now."
**Do NOT use for:** formal gate claims (`STATUS.md` still owns those) or verdicts (`REVIEW.md`
still owns those). The inbox is a side-channel, not a replacement.
(If you are ever forced to run with a single process, the degraded fallback is to alternate
roles per iteration and keep `JOURNAL.md` and `REVIEW.md` strictly separate — but two loops is
@ -649,8 +700,12 @@ every wake, `git pull --rebase` first, then:
**Pacing.** Use `/loop` (self-paced) or `ScheduleWakeup`. Most waits here are for things the
harness can't notify you about — a Drone build, a `nixos-rebuild`, a deploy converging — so poll
the *specific* thing. Three cases:
1. **Something in flight** (build/deploy/`nixos-rebuild`) → re-check on a short cadence (≈4 min) to
stay cache-warm; keep polling *it*, don't treat it as idle, and don't spin on a minutes-long build.
1. **Something in flight** (build/deploy/`nixos-rebuild`/e2e/heavy test) → **poll every ~5 min** to
stay cache-warm and to **see failures as they happen**, not at the end of a 25-minute sleep. Do
**NOT** `ScheduleWakeup` for the expected total runtime of the task in a single big sleep — a 25
min e2e gets 5 short cache-warm polls, not one 25-min cache-cold blackout. The wakeup that wakes
you mid-task is *cheap* (one cache hit, one quick status check); the value of catching a deploy
that died at minute 4 of a 25-min budget is large. Keep polling *it*, don't treat it as idle.
2. **Blocked on the *other* loop** — Builder parked at a `CLAIMED` gate awaiting the Adversary, or
Adversary waiting for the Builder to fix an `[adversary]` finding. **You don't need to busy-poll
here: the watchdog signals across the handoff.** The moment the Builder writes a `CLAIMED` gate,

View File

@ -8,6 +8,8 @@ You run as a SEPARATE process and coordinate ONLY through the git repo per §6.1
- Keep your OWN clone at /srv/cc-ci/cc-ci-adv. If the repo doesn't exist yet, wait and retry on your next wake — the Builder creates it during §1 Bootstrap.
- git pull --rebase before every edit; commit; push; never --force.
- Write ONLY your files: REVIEW.md and the "## Adversary findings" section of BACKLOG.md. Everything else (code, STATUS.md, JOURNAL.md, "## Build backlog") is read-only to you.
- INBOX side-channel (§6.1). For non-gate messages to the Builder (heads-up, "I'm running a break-it probe on X," request for clarification, etc.), write/append `machine-docs/BUILDER-INBOX.md` in your clone and push — the watchdog edge-pings the Builder on appearance. To receive a message from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then DELETE the file (commit + push) — deletion is the "consumed" signal. Do NOT use the inbox for formal verdicts — REVIEW.md still owns those.
- ISOLATION DISCIPLINE (anti-anchoring — critical). The Builder is REQUIRED to give you in STATUS.md the essential verification info you need: WHAT is claimed (gate id, DoD items), HOW to verify (the exact command/check), the EXPECTED outcome (hashes, fingerprints, status codes, file contents), WHERE the inputs live (commit shas, paths). **Read STATUS for that — you need all of it to verify.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL.md before your verdict — is the Builder's REASONING / RATIONALISATIONS: "I think this passes because…", design narrative, dead-ends, justifications. Reading those anchors you. Form your verdict from (a) the phase plan = SSOT for what is being verified, (b) the code / git history, (c) the verification info the Builder passed you in STATUS, and (d) your own COLD acceptance run that re-executes the check against the expected outcomes. Only AFTER you have written your verdict may you consult JOURNAL.md (e.g. to contextualise a finding) — note in REVIEW.md that you did. Do not trust the Builder's narrative; trust observable behaviour, the plan, and your own re-run.
Each wake:
1. Pull. Read STATUS.md for any "Gate: <Mn> CLAIMED, awaiting Adversary".

View File

@ -2,12 +2,16 @@ You are the Builder agent for the cc-ci project — one of two independent loops
Single source of truth: /srv/cc-ci/cc-ci-plan/plan.md. Read it in full now, then begin at §1 Bootstrap. The original brief /srv/cc-ci/cc-ci-plan/brief.md is context only — do not edit it.
Start a self-paced loop now: invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work (see §7). Pace per §7 (three cases): (1) build/deploy/rebuild in flight → poll ~4m, keep polling it; (2) parked at a CLAIMED gate awaiting the Adversary with no other unblocked work → the watchdog will PING you the moment the Adversary updates REVIEW.md, so you may wait, but keep a fallback self-poll ~24m in case a ping is missed (don't sit in a long idle while a verdict may be landing); (3) genuinely idle, nothing pending → sleep ~1015m. Prefer keeping an unblocked backlog item in hand so you rarely hit case 2. Do NOT spin on a minutes-long build. Stop the loop only when STATUS.md says ## DONE.
Start a self-paced loop now: invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work (see §7). Pace per §7 (three cases): (1) build/deploy/rebuild/e2e/heavy-test in flight → **poll every ~5 min, NEVER a single big ScheduleWakeup matching the expected runtime** (catch failures at minute 4 of a 25-min e2e, not at minute 25); the cache-warm 5-min poll is cheap, the long blackout is not; (2) parked at a CLAIMED gate awaiting the Adversary with no other unblocked work → the watchdog will PING you the moment the Adversary updates REVIEW.md OR writes a BUILDER-INBOX.md, so you may wait, but keep a fallback self-poll ~24m in case a ping is missed; (3) genuinely idle, nothing pending → sleep ~1015m. Prefer keeping an unblocked backlog item in hand so you rarely hit case 2. Stop the loop only when STATUS.md says ## DONE.
You run as a SEPARATE process from the Adversary loop and coordinate ONLY through the git repo per §6.1:
- git pull --rebase before every edit; make the smallest change; commit; git push. Never --force.
- Write ONLY your files: source/config, STATUS.md, JOURNAL.md, DECISIONS.md, and the "## Build backlog" section of BACKLOG.md. Treat REVIEW.md and "## Adversary findings" as read-only — the Adversary owns them.
- ARTIFACT-LAYER ISOLATION (facts in STATUS, reasoning in JOURNAL). STATUS.md **MUST** give the Adversary everything it needs to verify your claim — withholding verification context defeats the verification: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check the Adversary can re-run from its own clone), the **EXPECTED** outcome (build hashes, file contents, status codes, leaf fingerprints, command exit), and **WHERE** the inputs live (commit shas, paths). If something is essential for the Adversary to verify, put it in STATUS. STATUS **MUST NOT** include rationalisations / "I think this passes because…" / design narrative / dead-ends explored / design choices and their justification — those go in JOURNAL.md, which the Adversary is instructed not to read before forming its verdict (anti-anchoring), so keeping reasoning out of STATUS preserves that. The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions (joint authority), not in-the-moment rationale.
- At each milestone gate, set "Gate: <Mn> CLAIMED, awaiting Adversary" in STATUS.md and work other unblocked items; do NOT advance past the gate until REVIEW.md shows its PASS.
- INBOX side-channel (§6.1). For non-gate messages to the Adversary (heads-up, "I'm starting a long e2e," "please cold-verify this while I keep going," etc.), write/append `machine-docs/ADVERSARY-INBOX.md` in your clone and push — the watchdog edge-pings the Adversary on appearance. To receive a message from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then DELETE the file (commit + push) — deletion is the "consumed" signal. Do NOT use the inbox for formal gate claims or verdicts — STATUS.md / REVIEW.md still own those.
- INBOX — for non-gate cross-loop messages (heads-ups, requests for early-look, "I refactored X please re-verify Y", "starting a 25-min e2e"), write `machine-docs/ADVERSARY-INBOX.md` in your clone and push. The watchdog edge-triggers and pings the Adversary. The Adversary deletes the file on consumption. If you receive `machine-docs/BUILDER-INBOX.md` (Adversary side-channel to you), read+process+`git rm` it+push — deletion is the "consumed" signal. Use the inbox for things that aren't a formal gate claim or a verdict; CLAIMS still live in STATUS.md and verdicts in REVIEW.md (the inbox is a side-channel, not a replacement).
- PACING for long-running tasks (e2e / deploy / nixos-rebuild / heavy test): POLL every ~5 min, not a single big ScheduleWakeup that matches the expected runtime. A 25-min e2e gets ~5 short cache-warm polls so you see failures as they happen — never a 25-min cache-cold blackout. (plan.md §7 case 1.)
- Write "## DONE" only when REVIEW.md shows a PASS dated <24h for every D1D10 and there is no standing "## VETO".
Overriding rules:

22
cc-ci-plan/reboot-log.sh Executable file
View File

@ -0,0 +1,22 @@
#!/usr/bin/env bash
# Runs as ExecStartPre of cc-ci-loops.service. Appends ONE line to REBOOTS.md per genuine reboot.
# Uses the kernel boot_id to distinguish a real reboot from a mere `systemctl restart` of the unit:
# only logs when the current boot_id differs from the last one we recorded.
set -u
REBOOTS="/srv/cc-ci/cc-ci-plan/REBOOTS.md"
LAST_BOOT_FILE="/srv/cc-ci/.cc-ci-logs/.last-boot-id"
PHASE_IDX_FILE="/srv/cc-ci/.cc-ci-logs/.phase-idx"
cur_boot="$(cat /proc/sys/kernel/random/boot_id 2>/dev/null || echo unknown)"
last_boot="$(cat "$LAST_BOOT_FILE" 2>/dev/null || echo '')"
# Same boot_id => this is a manual service restart, not a reboot => do nothing.
[ "$cur_boot" = "$last_boot" ] && exit 0
idx="$(cat "$PHASE_IDX_FILE" 2>/dev/null || echo '?')"
ts="$(date '+%Y-%m-%d %H:%M:%S %Z')"
mkdir -p "$(dirname "$LAST_BOOT_FILE")"
printf '%s\n' "- $ts — reboot detected; loops auto-started by systemd (resuming phase index $idx). boot_id=$cur_boot" >> "$REBOOTS" 2>/dev/null || true
echo "$cur_boot" > "$LAST_BOOT_FILE" 2>/dev/null || true
exit 0

View File

@ -0,0 +1,32 @@
[Unit]
# Canonical, version-controlled copy of the unit installed at /etc/systemd/system/cc-ci-loops.service.
# Install: sudo install -m0644 cc-ci-plan/systemd/cc-ci-loops.service /etc/systemd/system/ \
# && sudo systemctl daemon-reload && sudo systemctl enable cc-ci-loops.service
# Brings the WHOLE rig back after a reboot of the orchestrator Pi: loops + watchdog (launch.sh) AND
# the orchestrator supervisory session (launch-orchestrator.sh), plus a reboot record (reboot-log.sh).
Description=cc-ci autonomous loops + watchdog + orchestrator (reboot-resilient)
Documentation=file:///srv/cc-ci/cc-ci-plan/plan.md
After=network-online.target cc-ci-tailscaled.service
Wants=network-online.target
Requires=cc-ci-tailscaled.service
[Service]
Type=oneshot
RemainAfterExit=yes
User=notplants
Group=notplants
Environment=HOME=/home/notplants
Environment=PATH=/home/notplants/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
# RESUME_PHASE=1 so a reboot resumes the SAVED phase (e.g. phase 2), never restarts from phase 0/1c.
Environment=RESUME_PHASE=1
# 1) record the reboot (boot_id-gated); 2) start loops + watchdog; 3) resume the orchestrator session.
ExecStartPre=/srv/cc-ci/cc-ci-plan/reboot-log.sh
ExecStart=/srv/cc-ci/cc-ci-plan/launch.sh start
ExecStartPost=/srv/cc-ci/cc-ci-plan/launch-orchestrator.sh start
# Stop only the loops + watchdog. The orchestrator session is intentionally LEFT running on a manual
# `systemctl stop` (stopping the loops shouldn't kill your steering session; it resumes from disk).
ExecStop=/srv/cc-ci/cc-ci-plan/launch.sh stop
TimeoutStartSec=180
[Install]
WantedBy=multi-user.target

View File

@ -0,0 +1,123 @@
# Acceptance test — real end-to-end `!testme` on the clean-room-rebuilt VM
**Owner:** the Builder + Adversary loops (they execute *and* independently verify this).
**When:** after **C4/C5 PASS** (genuine throwaway-VM clean-room rebuild verified). The Builder then
performs the tailnet swap (§1) and runs the e2e; the Adversary independently verifies. It is the
**functional acceptance** of D8/clean-room: proof that the rebuilt-from-git VM doesn't just match
byte-for-byte, but actually *serves a real CI run end-to-end through the public domain*.
**This file:** `/srv/cc-ci/cc-ci-plan/test-e2e-testme-acceptance.md`
---
## 0. Why
The reproducibility gates (C1C5) prove the rebuilt VM is structurally identical and boots clean.
This test proves it is **operationally** a working CI server: a maintainer comment triggers a build,
the app deploys and is reachable on its real public URL through the operator's gateway, the test
passes, and it tears down — the whole `!testme` pipeline, on the from-git VM, over the real domain.
---
## 1. Setup — the Builder performs the tailnet swap (then the e2e)
The rebuilt throwaway must become the live `cc-nix-test` so that the public gateway routes real
`ci.commoninternet.net` traffic to it (the gateway TLS-passthroughs via MagicDNS to
`cc-nix-test.taila4a0bf.ts.net` and re-resolves every ~10s, so it auto-follows the name). The swap is
**two reversible `tailscale set --hostname` commands** on VMs you already control — the Builder does
it. **Do this only after C4/C5 PASS** and after the rebuilt VM's full stack
(traefik + bridge + drone + dashboard) is up and serving locally.
**Order matters** (rename the original *aside first*, or the throwaway will get `cc-nix-test-1`):
1. **Rename the original prod VM aside** (it stays running — do NOT destroy it; needed for swap-back):
```
ssh cc-ci 'tailscale set --hostname=cc-nix-test-orig'
```
(`ssh cc-ci` is pinned to the original's IP `100.90.116.4`, so it keeps reaching the original
regardless of the name change.)
2. **Rename the rebuilt throwaway → `cc-nix-test`.** Re-derive its current tailscale IP (throwaways
get a fresh IP each rebuild): pick the ONLINE throwaway node from
`tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock status | grep -i throwaway`, then:
```
ssh -i /srv/incus-terraform-nix-vm-creator/terraform-secrets/vm_ssh_key \
-o ProxyCommand='nc -X 5 -x 127.0.0.1:1055 %h %p' root@<throwaway-ip> \
'tailscale set --hostname=cc-nix-test'
```
**Heads-up — tailnet-wide effect:** after the swap, `cc-nix-test.taila4a0bf.ts.net` resolves to the
rebuilt VM for *everyone* on the tailnet, so any of your own tooling that targets cc-nix-test **by
MagicDNS name** will now hit the rebuilt VM (tooling pinned to the raw IP `100.90.116.4` still hits
the original). Account for that when you point `!testme`/deploys.
**Verify the swap took (P1+P2) before starting the e2e** — must pass:
```
tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock status | grep cc-nix-test # → the throwaway's IP
curl -sS -o /dev/null -w '%{http_code} ssl_verify=%{ssl_verify_result}\n' https://ci.commoninternet.net/
# expect: 200 ssl_verify=0 (real public path now served by the rebuilt VM, valid cert)
```
**Swap-back when testing is done** (reversible): rename the throwaway back to its old name, then
`ssh cc-ci 'tailscale set --hostname=cc-nix-test'` to restore the original; the gateway re-follows.
---
## 2. Procedure
1. **Pick one fast, already-enrolled recipe.** Prefer the lightest enrolled app (e.g. `custom-html`)
so the run is quick and resource-cheap. Note the recipe + the repo/issue or PR where `!testme` is
recognised (the same place prior runs were triggered).
2. **Record the baseline.** Capture the recipe's *current* latest Drone run number and the dashboard
row (`https://ci.commoninternet.net/` and `https://drone.ci.commoninternet.net/...`) so you can
prove the run you trigger is **new**.
3. **Trigger via the real path.** Post `!testme` as the **bot** (the normal maintainer-comment
trigger) on that recipe — exactly as a real maintainer would. Do **not** invoke Drone directly or
shortcut the bridge; the comment→bridge→Drone path is part of what's under test.
4. **Confirm the bridge picked it up.** Within the bridge's poll interval, a **new** Drone build for
that recipe starts. Capture the new run number (must be > the baseline from step 2).
5. **Confirm the app deploys and is reachable on its PUBLIC URL.** While the build runs, the app is
deployed to its `*.ci.commoninternet.net` test domain. From **off the VM** (external — through the
gateway, not `localhost`/`127.0.0.1`), confirm a real request succeeds:
```
curl -sS -D- -o /dev/null https://<app-test-subdomain>.ci.commoninternet.net/
# expect: HTTP 200 (or the app's expected status), valid *.ci.commoninternet.net cert,
# served content from the deployed app — NOT a Traefik 404 / default-cert.
```
This is the crux: it proves routing public-DNS → gateway → MagicDNS → rebuilt VM → Traefik →
deployed app all works on the rebuilt server.
6. **Confirm the test logic passed.** The Drone build runs the recipe's real test assertions (app
state, not health-only) and finishes **success**.
7. **Confirm teardown.** After the run, the app is **undeployed** (no leftover stack/containers), per
the standard post-run cleanup — verify it's gone.
8. **Confirm the result was reported.** The outcome posts back to the trigger location and the
dashboard row updates to the new run with `success`.
---
## 3. Pass criteria (all must hold; Adversary verifies independently)
- [ ] **E1.** Self-check §1 passed (`ci.commoninternet.net` = 200, valid cert, on the rebuilt VM).
- [ ] **E2.** Posting `!testme` produced a **new** Drone build (run # > baseline) via the bridge —
not a manual Drone trigger.
- [ ] **E3.** The deployed app answered an **external** request on its real
`<app>.ci.commoninternet.net` URL (through the gateway) with the expected response + valid cert
— captured with headers/body evidence.
- [ ] **E4.** The Drone build's **real test assertions** ran and the build finished **success**
(no skipped/softened tests).
- [ ] **E5.** The app **undeployed** cleanly afterward (no residual stack).
- [ ] **E6.** Result reported back + dashboard updated to the new successful run.
Evidence (run #, the external `curl` headers/body, dashboard before/after, undeploy proof) is logged
in `JOURNAL-1c.md`, and the verdict in `REVIEW-1c.md` / `STATUS-1c.md` as **E2E-TESTME — PASS**.
## 4. If it fails
Treat as a clean-room finding, not a config patch: a failure here means the from-git rebuild is
missing something the running server had out-of-band (a secret, a manual step, drift). Capture the
failing stage + logs in `JOURNAL-1c.md`, raise it as a blocker, and fix it in the **git source**
(base or `cc-ci-secrets`) so the next rebuild includes it — do **not** hand-fix the live VM. Re-run
this test after the fix.
## 5. Bound
One recipe, one green run. This is a functional smoke test of the rebuilt VM, not a full recipe-test
campaign (that's Phase 2). Don't expand scope here.