Files
cc-ci/machine-docs/JOURNAL-1b.md

187 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL — Phase 1b (review & lint pass)
Append-only Builder log: what I did + verifying command/output + next. (Adversary logs to REVIEW-1b.)
---
## 2026-05-27 — Phase 1b kickoff (first wake)
Read the phase plan (`plan-phase1b-review-lint.md`) + plan.md §6.1/§7/§9. Confirmed Phase 1c is
genuinely DONE (STATUS-1c `## DONE`, REVIEW-1c all C1C7 + E2E PASS, no VETO, ADV-1c-1 closed). Phase
1b state files did not exist — seeded STATUS-1b / BACKLOG-1b / JOURNAL-1b / REVIEW-1b (stub).
Access + environment probes:
- `ssh cc-ci 'hostname && systemctl is-system-running'``nixos` / `running`.
- Lint tools are NOT in the sandbox and `nix` is not installed locally, so linting must run on cc-ci
(NixOS, nix 2.24.14, flakes enabled). `nix build github:NixOS/nixpkgs/<our-pin>#ruff` resolves from
cache.nixos.org (ruff 0.7.3) → building a `lint` devshell from the already-pinned nixpkgs is viable
with no registry/network surprises. shellcheck-0.10.0 already realized in the host store.
Lint-target inventory: 14 `.nix`, 32 `.py`, 1 `.sh` (`scripts/bootstrap-drone-oauth.sh`), plus
`.drone.yml` / `.sops.yaml` YAML. No prior lint/format decisions in DECISIONS.md (clean slate).
Next: W0 — add the `lint` devshell + entrypoint + tool configs to the flake; auto-format; fix
findings; wire the `.drone.yml` lint stage.
## 2026-05-27 — W0 built: lint toolchain + format + drone stage
Added (commits 2cede01 format/fixes, 4af427c drone stage, + tooling commits):
- `flake.nix`: `lint` devshell (`nix develop .#lint`) = nixpkgs-fmt, statix, deadnix, ruff,
shellcheck, shfmt, yamllint, built from the already-pinned nixpkgs (no registry/network surprise —
`nix build <pin>#ruff` resolves from cache.nixos.org). Default devshell also gets them.
- `scripts/lint.sh` (check / `--fix`), `ruff.toml`, `.yamllint.yaml`.
- `.drone.yml`: a `lint` step in the `event: push` pipeline running
`nix develop .#lint --command bash scripts/lint.sh` (FAILs the build on any unclean file).
Format/lint cleanup (semantics-preserving): ruff format on all 32 .py; nixpkgs-fmt drone-runner.nix;
shfmt scripts; ruff SIM105/SIM115 (contextlib.suppress / `with open`); statix (merge sops
`secrets.*`, empty-pattern → `_`); deadnix (drop unused `self`/`lib`/overlay `final`).
Verification (on cc-ci, clean tar'd checkout /tmp/ccci-lint):
```
$ nix develop .#lint --command bash scripts/lint.sh
=== Nix — nixpkgs-fmt === 0 / 14 would have been reformatted
=== Nix — statix === (clean)
=== Nix — deadnix === (clean)
=== Python — ruff format === 32 files already formatted
=== Python — ruff check === All checks passed!
=== Shell — shfmt/shellcheck === (clean)
=== YAML — yamllint === (clean)
lint: PASS
```
nix eval `.#nixosConfigurations.cc-ci.config.system.build.toplevel` → a derivation (evals OK; the
networkd/dhcp warning is pre-existing). Built toplevel `8i3jcad9…` differs from running
`cqym8knjg7…` — EXPECTED: bridge.py/dashboard.py (and runner) are `cp`'d into the store, so the
reformat changes their hash. cc-ci will be rebuilt to the formatted closure in W2 before RL3.
All Python byte-compiles (store python 3.12.8).
Drone CI note: triggered build #150 via API but that's `event=custom` (→ recipe-ci pipeline, not the
push lint pipeline) — cancelled it. The Gitea→Drone push webhook (hook 211) shows `last_status: None`
and Drone logs show no inbound hook deliveries → the documented flaky webhook (§4.1). Public and
canonical (100.90.116.4) Drone build lists are identical, so the gateway routes to canonical cc-ci
(no rebuild-VM split). Recorded the flaky-webhook as a pre-existing infra item in DECISIONS.md; the
lint stage itself is wired + proven green via the identical command.
Claimed W0 gate (RL1) in STATUS-1b. Next: W1 white-box review checklist over the cleaned codebase.
## 2026-05-27 — W0 PASS (Adversary cold, RL1) + W1 Builder-side §3 self-review
Adversary logged **W0/RL1 PASS** (REVIEW-1b): cold checkout of my HEAD `233939a` archived to cc-ci,
`nix develop .#lint --command bash scripts/lint.sh` → exit 0 `lint: PASS`, plus a break-it probe
(injected bad .py/.nix → exit 1 `lint: FAIL`) proving the gate has teeth. Advisory only (flaky push
webhook → confirm a real push fires the Drone lint build at RL3); not a finding.
W1 — ran the §3 white-box checklist myself (Builder side), to fix anything blocking before the
Adversary's RL2 confirmation. Findings over the post-W0 (cleaned) codebase:
- **Tests real (blocking)** — holds. (Adversary pass #1 PASS; my W0 cleanup touched only formatting +
SIM/contextlib rewrites, no assertion changed.)
- **Harness DRY (blocking-ish)** — holds. `grep` for recipe-name conditionals in the SHARED harness
(`runner/harness/*.py`, `run_recipe_ci.py`, `conftest.py`) → NONE. Per-recipe quirks are data:
optional `tests/<recipe>/recipe_meta.py` (HEALTH_PATH/HEALTH_OK/DEPLOY_TIMEOUT/HTTP_TIMEOUT) +
per-recipe test files (e.g. keycloak `kc_admin.py`). Enrolling needs no shared-harness edit (D5).
- **Nix idempotent (blocking)** — holds (no `.bootstrapped` sentinels; reconcile oneshots; Adversary
pass #1 confirmed).
- **No footguns (blocking)** — holds. Every `time.sleep()` (lifecycle.py 160/170/226/252,
bridge.py 304) sits inside a `while time.time() < deadline:` poll/retry loop (verified each), not a
bare readiness wait. `--chaos` appears ONLY in "never pass it" comments (abra.py). No `shell=True`.
- **No secrets in code (blocking)** — holds (Adversary pass #1 grep clean; full leak re-verify is RL3).
- **Log redaction real (blocking)** — holds. `run_recipe_ci.py` `run_stage_redacted()` masks any
>=8-char `/run/secrets/*` value from streamed stage output; no secret-named value is print/logged in
`bridge.py`/`dashboard.py` (grep clean).
- **Architecture matches plan (advisory→blocking on drift)** — holds; settled in Phase 1/1c (poll is
primary in `bridge.py`'s loop; `/hook` optional; traefik is the coop-cloud recipe via `proxy.nix`).
No drift; not reopening settled design (guardrail §5).
- **Readability / docs (advisory)** — fine; nothing worth churning in a bounded pass.
**No blocking finding; nothing to fix; no advisory item to file.** The Adversary owns the RL2
confirmation and is running its own §3 pass #2 (harness-DRY / redaction / architecture). Awaiting that;
W2 (rebuild cc-ci to the formatted closure + request cold RL3 D1D10) follows once RL2 is confirmed.
## 2026-05-27 — RL2 clean + RL5 (nix/ consolidation) + W2 switch to cleaned closure
**RL2 (Adversary §3 pass #2):** no blocking findings; 2 advisories — (a) `old_app` upgrade-fixture
copy-paste across recipes → triaged to IDEAS (per-recipe upgrade tests are by design; sharing is a
nicety, not a DRY-blocker); (b) app-secret redaction: the `cc-ci-run` Drone step path isn't wrapped by
`run_stage_redacted`, so the Adversary will re-run the behavioral D6 leak test at RL3 (grep published
Drone logs + dashboard for a known generated app password). My Builder §3 self-review agreed (no
blockers). W1 is light/clean.
**RL5 — consolidate Nix code under `nix/`** (operator item, plan §7). `git mv modules nix/modules`,
`git mv hosts nix/hosts`; flake.nix/flake.lock stay at root (`#cc-ci` unchanged); only flake's
internal configuration.nix path + the moved modules' root-relative refs changed (`../X``../../X`).
Built on cc-ci → toplevel `8i3jcad9…` **byte-identical to the pre-move build** (content-addressed;
module .nix not in the runtime closure). Living docs + `.drone.yml` comment updated to `nix/…`.
**W2 — switched canonical cc-ci to the cleaned+RL5 closure** so `build == running` (required before
RL3: a fresh clone builds `8i3jcad9`; running had to match or the byte-identical-to-running check
would fail). Re-synced `/root/cc-ci` to HEAD, `nixos-rebuild switch --flake 'path:/root/cc-ci#cc-ci'`:
```
stopping units: deploy-bridge.service, deploy-dashboard.service
sops-install-secrets: Imported …ssh_host_ed25519_key as age key (age1h90utdz…)
starting units: deploy-bridge.service, deploy-dashboard.service
```
Post-switch health (all green):
- `readlink /run/current-system``8i3jcad9mrr01558lqckpi26nxn2ra3m-…` (== fresh-clone build; was
`cqym8knjg7…` pre-format).
- `systemctl is-system-running``running`, **0 failed**. deploy-bridge/deploy-dashboard `active`.
- 5 stacks up (backups, ccci-bridge, ccci-dashboard, drone, traefik); `ccci-bridge_app` +
`ccci-dashboard_app` 1/1 with NEW content-hash image tags (reformatted source redeployed).
- Public via SOCKS proxy → gateway → cc-ci: `https://ci.commoninternet.net/`**200**
(`<title>cc-ci — Co-op Cloud recipe CI</title>`); `/badge/custom-html.svg`**200**.
Net: RL1 PASS, RL2 clean, RL4 docs landed (README lint section + architecture.md `nix/` layout),
RL5 done + healthy, running==build==`8i3jcad9`. Remaining for DONE: **RL3** (Adversary cold D1D10
re-verify, now also covering the RL5 byte-identical rebuild) and **RL6** (coordinated machine-docs/
move — LAST, with orchestrator lockstep). Claiming the RL3 gate.
## 2026-05-27 — push-webhook diagnostic (the RL1 "future commits stay clean" advisory)
Timeboxed root-cause on why pushes don't auto-create a Drone lint build. Fired Gitea's webhook test
for the Drone hook (211) while tailing the Drone server logs:
- `POST /repos/recipe-maintainers/cc-ci/hooks/211/tests` → Gitea returns **204** (accepted).
- `docker service logs --since 20s drone_…_app`**NOTHING** — no inbound request logged at all.
So the delivery `git.autonomic.zone (Gitea) → drone.ci.commoninternet.net (public gateway) → cc-ci`
isn't reaching Drone. This is a **gateway/network reachability** condition, NOT a Drone-side config
I can fix — and per §9 the gateway is operator-managed (not ours to reconfigure). Leaving it as the
documented pre-existing advisory (hook `last_status: None`, §4.1). Impact is limited to cc-ci's OWN
self-test/lint pipeline auto-firing; **recipe-CI triggering is unaffected** — the comment-bridge
polls Gitea *outbound* (cc-ci → git.autonomic.zone, the reliable direction), which is the plan's
primary trigger (§4.1). The lint stage is wired + proven green via its exact command; manual/API
Drone builds work. Not expanding scope to re-engineer the inbound path (bounded pass).
## 2026-05-27 — RL3 FULL D1D10 PASS (Adversary cold). Only RL6 (coordinated) left.
Adversary logged **RL3 PASS** (REVIEW-1b): all D1D10 re-verified cold on the cleaned+RL5
byte-identical closure (`8i3jcad9`==running==fresh-clone build), fresh <24h evidence, nothing
weakened. Highlights: D1 trigger 20s/8s; D2 install/upgrade/backup green (upgrade actually ran, not
skipped) on custom-html + keycloak; D6 leak test 0 hits (8/8 infra + cert/key + generated keycloak
admin pw absent from logs/dashboard); D8 fresh-recursive-clone rebuild == running; D10 = 2 fresh
category runs (#151 custom-html, #152 keycloak) + carry-forward of the Phase-1 Adversary-verified
6/6 set (byte-identical harness/test/closure). Cardinal-rule PASS. **RL1RL5 Adversary-PASS, no open
findings, NO VETO.**
Flagged the orchestrator (STATUS-1b) that I'm **ready for the RL6 coordinated cutover**: it updates
`launch.sh` to `machine-docs/` paths + restarts the watchdog; on its signal I `git mv`
STATUS*/BACKLOG*/JOURNAL*/DECISIONS.md into `machine-docs/` (README stays root), the Adversary moves
REVIEW*, I fix the only in-repo refs (README Loop-state + docs/install.md:15), Adversary re-verifies,
then I write `## DONE`. Holding all root protocol files in place until that signal (moving them early
breaks the live watchdog). Loop continues; not idling on a long sleep short fallback while awaiting
the orchestrator go-ahead.
## 2026-05-27 — RL6 PASS → Phase 1b DONE
Adversary logged **RL6 PASS** + **FINAL SIGN-OFF: all RL1RL6 Adversary-PASS, NO VETO** (it moved its
own REVIEW*.md machine-docs/, re-verified refs + the watchdog `resolve_state` handoff survived the
lockstep cutover). No open `[adversary]` findings; advisories IDEAS + the documented push-webhook one.
DONE-handshake conditions (plan §6.1) met: a <24h Adversary PASS for every RL1RL6 + the full cold
D1D10, no standing `## VETO`. Final Builder health: cc-ci `running`/0-failed, toplevel
`8i3jcad9mrr01558lqckpi26nxn2ra3m` == fresh-clone build (build==running, byte-identical), 5 stacks up,
public `https://ci.commoninternet.net/` 200. Wrote `## DONE` to machine-docs/STATUS-1b.md.
**Phase 1b is genuinely DONE.** The foundation is now: formatted + lint-clean (CI-enforced via the
`.drone.yml` lint stage), all Nix code under `nix/` (flake at root, `#cc-ci` unchanged), multi-agent
protocol files under `machine-docs/`, and every Phase-1 D1D10 re-verified cold on the cleaned closure
with nothing weakened. Builder loop terminating.