Files
cc-ci-orchestrator/cc-ci-plan/plan-phase1b-review-lint.md
autonomic-bot 36a6c9872a orchestrator: reboot-resilience + session auto-resume + full session plan/tooling
Reboot survival for the Pi orchestrator host:
- systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot
  records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the
  orchestrator session.
- reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count).
- launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed
  orchestrator announces itself (PushNotification) + reports reboots.
- AGENTS.md: on-startup notify routine documented.

Plans/tooling accumulated this session:
- plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review),
  sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance.
- launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution,
  limit-stall re-nudge, INBOX side-channel detection.
- plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED.
- prompts: isolation discipline + INBOX + pacing.
- .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, *.tmp.*).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 20:28:10 +01:00

175 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 1b — Review & lint pass (Autonomous Build Plan)
**Status:** QUEUED — a **bounded** pass that runs after Phase 1 **and Phase 1c**
(`plan-phase1c-full-reproducibility.md`), and **before** Phase 2 (`plan-phase2-recipe-tests.md`). It
runs *after* 1c on purpose: the review/lint + full D1D10 re-verification then covers the final,
refactored state (the `cc-ci-secrets` split, cert-in-sops, the genuine D8 live rebuild).
**Transition:** **manual** (operator kicks it off).
**Builds on:** the complete post-1c codebase (flake/modules, `runner/` + harness, the comment-bridge,
dashboard, scripts, the first recipes' tests, the `cc-ci-secrets` split, docs).
**Owner agents:** same Builder + Adversary loops (`plan.md` §6/§7). Here the Adversary also acts as
**white-box reviewer** (reads the code, not just black-box behavior).
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1b-review-lint.md`
---
## 0. Why this phase (and why it's bounded)
Before scaling to ~18 recipes (Phase 2), clean and harden the foundation built in Phase 1 — it's far
cheaper to fix the harness/modules now than after every recipe copies their patterns. This is a
**one-time, bounded pass**, not an open-ended loop. Two distinct workstreams that go to two places,
to avoid bikeshedding:
1. **Style & hygiene → deterministic tooling** (formatters/linters in CI). No judgment calls, no
churn, enforceable going forward.
2. **Code health a linter can't judge → a white-box review checklist with teeth** (§3) — anchored to
plan-relevant invariants, *not* taste. Blocking items get fixed; the rest are advisory.
Discipline: **good-enough + enforceable**, not gold-plating. Don't reopen settled design; don't
weaken any test in the name of "cleanliness." When in doubt, lint it deterministically or leave it.
---
## 1. Definition of Done (Phase 1b exit condition)
Terminates when every item holds **and the Adversary/reviewer has independently confirmed** (logged
in `REVIEW.md`):
- [ ] **RL1 — Lint/format tooling in place + green.** Formatters + linters are added to the repo (a
`lint` entrypoint + a Nix devshell) and wired as a **stage in cc-ci's own `.drone.yml`** so
future commits stay clean. The whole Phase-1 codebase **passes** them. Tooling (defaults; settle
in `DECISIONS.md`):
- **Nix:** `alejandra` (or `nixpkgs-fmt`) format · `statix` (lints) · `deadnix` (dead code).
- **Python** (`runner/`, harness, `bridge/`, `tests/`): `ruff` (lint + format).
- **Shell:** `shellcheck` + `shfmt`.
- **(optional) YAML:** `yamllint` for `.drone.yml`/compose.
- [ ] **RL2 — Review checklist run, blocking items fixed.** The white-box checklist (§3) has been run
over the codebase; every **blocking** finding is fixed; every **advisory** finding is either
fixed or moved to `BACKLOG.md`/`IDEAS.md` as `[idea]` with a one-line rationale. Findings +
resolutions recorded in `REVIEW.md`.
- [ ] **RL3 — Full Phase-1 re-verification (the final gate).** This is the **last step before 1b is
marked DONE**, performed *after* all lint/review findings (RL1+RL2) have been responded to and
resolved. The Adversary **independently re-verifies every Phase-1 Definition-of-Done item
D1D10 from a cold start** — re-running each acceptance check itself, to the **same bar as
Phase 1's own DONE** (fresh PASS + evidence + timestamps logged in `REVIEW.md` within 24h). 1b
cannot be DONE until **all of D1D10 are re-confirmed green post-cleanup** and the Adversary
confirms no test/assertion was softened, skipped, or otherwise regressed by the cleanup.
- [ ] **RL4 — Documented.** `docs/` notes how to run lint/format locally and that CI enforces it; any
accepted deviations from the review checklist are in `DECISIONS.md`.
When RL1RL4 hold and are confirmed, write `## DONE` to Phase-1b `STATUS.md`.
---
## 2. Method
1. **Tooling first.** Add the formatters/linters + a `lint` entrypoint + devshell; run format, then
fix lint findings. Wire the `.drone.yml` lint stage so it's enforced from here on. (Auto-fixable
style should just be auto-fixed — don't deliberate over it.)
2. **Then the review checklist (§3).** Read the code against the checklist; classify each finding
**blocking** vs **advisory**; fix blocking, triage advisory.
3. **Re-verify everything last.** Once all fixes have landed, the Adversary re-runs the **entire**
Phase-1 D1D10 acceptance suite from a cold start (RL3) — this is the final gate; cleanup must
regress nothing. Order matters: tooling → review responses/fixes → *then* full re-verification.
4. **Bound it.** Cap iterations; this is a pass, not a rewrite. Record dead-ends/deviations and stop.
---
## 3. The white-box review checklist (teeth, not taste)
Blocking unless noted; these are *plan-relevant invariants visible only by reading code*:
- **Tests are real (blocking).** Every recipe/harness test asserts actual app state — not
health-only, not `skip`/`xfail`, not assertions that can't fail. (Pre-empts Phase-2 §7.1 at the
foundation.)
- **Harness is DRY (blocking-ish).** Recipe-specific quirks live in the **shared harness**, not
copy-pasted per recipe (the M6.5 "no harness surgery / no one-off hacks" rule) — so 18 recipes
don't multiply the mess.
- **Server state is Nix-declared & idempotent (blocking).** Bring-up is declarative idempotent
reconciliation (the `swarm-init`/`deploy-proxy` oneshot pattern, §9) — no imperative drift, no
run-once sentinels, no manual post-rebuild steps in `install.md`.
- **No footguns (blocking).** No bare `sleep` for readiness (use polling); correct `--chaos` usage;
no hardcoded versions/domains that break across upstream versions; teardown in `try/finally`;
per-run secrets reused across stages (not regenerated).
- **No secrets in code or committed files (blocking).** Grep + read for tokens/passwords/keys in
source, configs, `.drone.yml`, fixtures (complements the Adversary's behavioral leak test).
- **Architecture matches the plan (advisory→blocking on real drift).** Layout/》 modules follow
`plan.md` §3; the trigger is poll-primary + optional manual webhook (§4.1); Traefik is the
coop-cloud recipe (§4.2); deviations are recorded in `DECISIONS.md` (not silent).
- **Log redaction is real (blocking).** The log/dashboard pipeline scrubs secrets before publish.
- **Readability & error handling (advisory).** Clear names, sensible structure, errors surfaced not
swallowed, dead code removed — advisory; auto-fixable parts belong to the linters, not debate.
- **Docs accurate (advisory).** `install.md` reproduces from scratch; `enroll-recipe.md` matches the
real enroll flow.
---
## 4. Milestones (bounded)
- **W0 — Tooling + format.** Linters/formatters + `lint` entrypoint + devshell added; codebase
auto-formatted; `.drone.yml` lint stage wired and green. *Accept:* `lint` passes in CI from a clean
checkout.
- **W1 — Review + fixes.** Checklist (§3) run; blocking findings fixed; advisory triaged. *Accept:*
`REVIEW.md` lists findings + resolutions; no blocking item open.
- **W2 — Full re-verification + document.** *After* W0+W1 fixes land, the Adversary re-verifies **all
of D1D10** cold (RL3) and docs/deviations are recorded. *Accept:* every Phase-1 D1D10 item has a
fresh post-cleanup PASS + evidence in `REVIEW.md` (nothing weakened); then flip Phase-1b
`STATUS.md` to `## DONE`.
---
## 5. Guardrails
- **Never weaken a test to satisfy a lint/review nit** (the cardinal rule still wins).
- **Style → tooling, judgment → checklist.** Don't spend agent effort (or loop iterations) on what a
formatter/linter settles deterministically; don't let the review drift into subjective taste.
- **Bounded.** This is a pass, not a refactor project — fix invariants + enforce linting, then stop;
bigger improvements go to `IDEAS.md`, not this phase.
- **Don't reopen settled design.** Architecture decisions from Phase 1 stand unless there's real
drift from the plan; record, don't relitigate.
---
## 6. Open decisions (log in DECISIONS.md)
- Exact formatter choices (alejandra vs nixpkgs-fmt; ruff-format vs black) and lint strictness.
- Whether to add Python **type-checking** (mypy/pyright) now or defer to `IDEAS.md`.
- The precise **blocking vs advisory** split for the checklist.
- Whether the `.drone.yml` lint stage should **fail** the build or just warn initially.
---
## 7. Operator review items (added 2026-05-27) — repo layout (do in this 1b pass)
Two structural-review items from the operator. Both are **blocking** for 1b. Apply them as part of
this pass, then re-verify (RL3 covers the re-verification). **Mind the coordination caveats — these
touch the live flake build and the running multi-agent machinery.**
### RL5 — Consolidate all Nix-code folders under a root `nix/`
- Move the folders that contain `.nix` code — **`modules/` and `hosts/`** — to **`nix/modules/` and
`nix/hosts/`**. (Add future Nix dirs under `nix/` too.)
- **Keep `flake.nix` / `flake.lock` at the repo root** (entry point) so the build ref is unchanged
(`docs/install.md`'s `nixos-rebuild switch --flake 'git+file://…?submodules=1#cc-ci'` stays valid).
Just update the flake's internal paths (`./modules``./nix/modules`, `./hosts``./nix/hosts`)
and any `imports`/`scripts`/`.drone.yml` references.
- **Re-verify after the move:** the byte-identical clean-room result is the bar. The toplevel store
hash *will* change (paths differ) — that's fine; what must hold is that a fresh recursive clone
still rebuilds **byte-identical to the running system** and the Adversary re-confirms it cold
(folds into RL3). Update `docs/architecture.md` to describe the `nix/` layout.
### RL6 — Move uppercase multi-agent-protocol files into `machine-docs/`
- Move the uppercase protocol files — **`STATUS*.md`, `REVIEW*.md`, `JOURNAL*.md`, `BACKLOG*.md`,
`DECISIONS.md`** — into a root **`machine-docs/`** folder. **`README.md` stays in the repo root**
(operator decision, 2026-05-27) — it is the human-facing repo readme, not a protocol file; do
**not** move it into `machine-docs/`.
- **Update every reference** to the new paths: the `cc-ci-plan/` plans (this file, `plan.md`,
`plan-phase1c-*`, `README.md`, `kickoff.md`, `test-e2e-testme-acceptance.md`), `AGENTS.md`,
`.drone.yml`, `scripts/`, and any in-repo doc that points at `STATUS.md`/`REVIEW.md`/etc.
- **⚠ COORDINATION CAVEAT (do not move these unilaterally mid-run):** the live **watchdog**
(`cc-ci-plan/launch.sh`, the orchestrator's file) reads `STATUS-<id>.md` and `REVIEW-<id>.md` at
the **repo root** to drive handoff pings + the 1c→1b auto-transition. Moving them breaks the
running watchdog until `launch.sh` is updated to the `machine-docs/` paths and the watchdog is
restarted. **So sequence it with the orchestrator:** the orchestrator updates `launch.sh`'s
`PHASES_SPEC`/path logic and restarts the watchdog **in lockstep** with the loops' `git mv`.
Safest to do this **near the end of 1b** (or as its final step), not while a phase transition is
pending. Flag the orchestrator when ready and it will handle `launch.sh` + the watchdog restart.