M5: upgrade + backup/restore stages green (custom-html); backup-bot-two oneshot
All checks were successful
continuous-integration/drone/push Build is passing

3-stage run green (install/upgrade/backup), clean teardown. backupbot deployed
via reconcile oneshot; PTY (script) for abra backup/restore; -m for secret generate
(no value leak). M5 CLAIMED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-27 00:53:13 +01:00
parent 0fe3d7cda7
commit 7eb0dd3c77
12 changed files with 266 additions and 17 deletions

View File

@ -46,8 +46,10 @@ Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adver
→ 2 passed (http 200 + playwright); teardown leaves services/volumes/secrets/containers/env = 0.
### M5 — Upgrade + backup/restore stages
- [ ] Add upgrade + backup/restore stages for recipe #1
- [ ] Gate: M5 — upgrade preserves data; backup→mutate→restore returns original
- [x] Add upgrade + backup/restore stages for recipe #1 (custom-html). backup-bot-two deployed as a
reconcile oneshot (modules/backupbot.nix). Data marker served via nginx for assertions.
- [x] Gate: M5 — upgrade preserves data; backup→mutate→restore returns original → CLAIMED 2026-05-27.
Full 3-stage run green: install(2)+upgrade(1)+backup(1) passed; teardown leaves 0 orphans, infra intact.
### M6 — Recipe-local tests + second recipe
- [ ] Discover/run recipe-repo tests/; enroll DB-backed recipe #2

View File

@ -340,3 +340,29 @@ services-converged + HTTPS, teardown_app = undeploy+volume+secret+env-config, ja
**M3 still blocked** (Gitea webhook delivery — operator); no response yet. Next: M5 (upgrade +
backup/restore for custom-html), then wire the parameterized Drone pipeline (API-triggerable).
## 2026-05-27 — M5: upgrade + backup/restore stages green (custom-html)
**Upgrade stage** (tests/custom-html/test_upgrade.py): deploy previous published version
(git-tag sort, second-newest), write a data marker into the served volume (nginx serves
/usr/share/nginx/html, so the marker is HTTP-fetchable), `abra app upgrade` to current, assert
healthy + marker survived. Fix: `upgrade` has no `--chaos` flag (used `-f -D -n`).
**backup-bot-two** deployed as reconcile oneshot (modules/backupbot.nix): restic repo in a local
`backups` volume, restic_password abra-generated (only if missing). Fixes: `abra app secret generate`
needs `-m` (machine) to avoid the TTY/ioctl path, and stdout redirected so generated values never
hit the journal (D6). `abra app backup create`/`restore` need a real PTY ('input device is not a
TTY') → run via util-linux `script -qec` (harness `_run_pty`; util-linux added to cc-ci-run).
**Backup stage** (test_backup.py): write "original" → `abra app backup create` → mutate to
"mutated" → `abra app restore` → assert state back to "original".
**Full 3-stage run** (`STAGES=install,upgrade,backup`):
- install: 2 passed (http 200 + playwright)
- upgrade: 1 passed (data survives upgrade)
- backup: 1 passed (restore returns pre-mutation state)
- teardown: 0 orphaned run services/volumes/secrets; infra (traefik/drone/bridge/backupbot) all 1/1.
M5 CLAIMED.
**M3 still blocked** (webhook; no operator response across several ticks). Plan: if still blocked,
pivot the bridge to poll the Gitea API (self-service, Adversary-endorsed) to unblock D1. Next: M6.

View File

@ -1,9 +1,9 @@
# STATUS — cc-ci Builder
**Phase:** M4 complete & CLAIMED. M0/M1/M2 PASS. M3 gate BLOCKED (Gitea webhook delivery; operator).
M4 awaiting verdict. Next: M5 (upgrade + backup/restore for custom-html).
**In-flight:** M5add upgrade + backup/restore stages for recipe #1.
**Last updated:** 2026-05-27 (M4 claimed; install stage green)
**Phase:** M5 complete & CLAIMED. M0/M1/M2 PASS. M3 gate BLOCKED (Gitea webhook; operator). M4/M5 awaiting verdict.
Next: M6 (recipe-local tests + DB-backed recipe #2) — and pivot M3 to polling if webhook stays blocked.
**In-flight:** M6discover/run recipe-repo tests/ + enroll a second (DB-backed) recipe.
**Last updated:** 2026-05-27 (M5 claimed; 3-stage green for custom-html)
## Gates
- **Gate: M0 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: flake rebuilds cc-ci from repo

View File

@ -13,6 +13,7 @@
../../modules/drone.nix
../../modules/drone-runner.nix
../../modules/bridge.nix
../../modules/backupbot.nix
../../modules/harness.nix
];

49
modules/backupbot.nix Normal file
View File

@ -0,0 +1,49 @@
# backup-bot-two (M5): the Co-op Cloud backup service. `abra app backup create <app>` / restore
# talk to it; it snapshots volumes labelled `backupbot.backup=true` into a local restic repo.
# Idempotent-reconcile oneshot (same pattern as proxy/drone). restic_password is abra-generated
# (class-B-style internal secret) and kept stable across reconciles (only generated if missing).
{ pkgs, ... }:
let
reconcile = pkgs.writeShellApplication {
name = "cc-ci-reconcile-backupbot";
runtimeInputs = with pkgs; [ abra docker gnused gnugrep coreutils git ];
text = ''
DOMAIN="backups.ci.commoninternet.net" # identity/stack name only; no web route
ENV_FILE="$HOME/.abra/servers/default/$DOMAIN.env"
abra server ls -m -n >/dev/null 2>&1 || abra server add --local -n || true
abra recipe fetch backup-bot-two -n >/dev/null
[ -f "$ENV_FILE" ] || abra app new backup-bot-two -s default -D "$DOMAIN" -n
set_env() {
sed -i -E "/^[[:space:]]*#?[[:space:]]*$1=/d" "$ENV_FILE"
printf '%s=%s\n' "$1" "$2" >> "$ENV_FILE"
}
set_env RESTIC_REPOSITORY /backups/restic
set_env SECRET_RESTIC_PASSWORD_VERSION v1
set_env CRONJOB_VERSION v1
have_secret() { docker secret ls --format '{{.Name}}' | grep -q "_$1_v1$"; }
# -m avoids the TTY/table (ioctl) path; redirect stdout so generated values never hit logs (D6).
have_secret restic_password || abra app secret generate "$DOMAIN" --all -m -n >/dev/null
abra app deploy "$DOMAIN" -n -C
'';
};
in
{
systemd.services.deploy-backupbot = {
description = "Reconcile backup-bot-two (volume backups via restic) via abra";
after = [ "swarm-init.service" "docker.service" "network-online.target" ];
requires = [ "swarm-init.service" "docker.service" ];
wants = [ "network-online.target" ];
wantedBy = [ "multi-user.target" ];
environment.HOME = "/root";
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = "${reconcile}/bin/cc-ci-reconcile-backupbot";
};
};
}

View File

@ -7,7 +7,7 @@ let
pyEnv = pkgs.python3.withPackages (ps: with ps; [ pytest playwright ]);
ccciRun = pkgs.writeShellApplication {
name = "cc-ci-run";
runtimeInputs = [ pyEnv pkgs.abra pkgs.docker pkgs.git pkgs.coreutils ];
runtimeInputs = [ pyEnv pkgs.abra pkgs.docker pkgs.git pkgs.coreutils pkgs.util-linux ];
text = ''
export PLAYWRIGHT_BROWSERS_PATH=${pkgs.playwright-driver.browsers}
export PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1

View File

@ -19,6 +19,19 @@ class AbraError(RuntimeError):
pass
def _run_pty(args: list[str], timeout: int = 900, check: bool = True) -> subprocess.CompletedProcess:
"""Run abra under a pseudo-TTY (via util-linux `script`). Needed for commands that exec into
a container interactively (backup create / restore: 'the input device is not a TTY')."""
cmd = "abra " + " ".join(args)
proc = subprocess.run(
["script", "-qec", cmd, "/dev/null"],
capture_output=True, text=True, timeout=timeout,
)
if check and proc.returncode != 0:
raise AbraError(f"[pty] {cmd} failed ({proc.returncode}):\n{proc.stdout}\n{proc.stderr}")
return proc
def _run(args: list[str], timeout: int = 300, check: bool = True) -> subprocess.CompletedProcess:
proc = subprocess.run(
[ABRA, *args],
@ -64,7 +77,9 @@ def env_set(domain: str, key: str, value: str) -> None:
def secret_generate(domain: str, timeout: int = 300) -> None:
_run(["app", "secret", "generate", domain, "--all", "-n"], timeout=timeout, check=False)
# -m avoids the TTY/table (ioctl) path; output (which contains the generated values) is
# captured by _run and never logged. check=False: recipes with no secrets are a no-op.
_run(["app", "secret", "generate", domain, "--all", "-m", "-n"], timeout=timeout, check=False)
def deploy(domain: str, chaos: bool = True, timeout: int = 900) -> None:
@ -74,6 +89,34 @@ def deploy(domain: str, chaos: bool = True, timeout: int = 900) -> None:
_run(args, timeout=timeout)
def upgrade(domain: str, version: Optional[str] = None, timeout: int = 900) -> None:
args = ["app", "upgrade", domain]
if version:
args.append(version)
# -f no prompt, -D skip public-DNS checks (our per-run domains route via the gateway).
# (upgrade has no --chaos flag.)
args += ["-f", "-D", "-n"]
_run(args, timeout=timeout)
def backup_create(domain: str, timeout: int = 900) -> None:
_run_pty(["app", "backup", "create", domain, "-n"], timeout=timeout)
def restore(domain: str, timeout: int = 900) -> None:
_run_pty(["app", "restore", domain, "-n"], timeout=timeout)
def recipe_versions(recipe: str) -> list[str]:
"""Published versions of a recipe, oldest→newest (from the recipe git tags)."""
import os
import subprocess
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
proc = subprocess.run(["git", "-C", path, "tag", "--sort=creatordate"],
capture_output=True, text=True)
return [t for t in proc.stdout.split("\n") if t.strip()]
def undeploy(domain: str, timeout: int = 600) -> None:
# NB: no --chaos here (unsupported).
_run(["app", "undeploy", domain, "-n"], timeout=timeout, check=False)

View File

@ -87,6 +87,52 @@ def wait_healthy(domain: str, ok_codes=(200, 301, 302), deploy_timeout: int = 60
raise TimeoutError(f"{domain}: not healthy over HTTPS (last status {last})")
def upgrade_app(domain: str, version: str | None = None) -> None:
abra.upgrade(domain, version=version)
def backup_app(domain: str) -> None:
abra.backup_create(domain)
def restore_app(domain: str) -> None:
abra.restore(domain)
def previous_version(recipe: str) -> str | None:
"""The second-newest published version (to deploy before upgrading to latest)."""
vers = abra.recipe_versions(recipe)
return vers[-2] if len(vers) >= 2 else None
def _app_container(domain: str, service: str = "app") -> str:
"""The running container id for <stack>_<service>."""
name = f"{_stack_name(domain)}_{service}"
proc = subprocess.run(
["docker", "ps", "--filter", f"name={name}", "--format", "{{.ID}}"],
capture_output=True, text=True,
)
cid = proc.stdout.strip().split("\n")[0]
if not cid:
raise RuntimeError(f"no running container for {name}")
return cid
def exec_in_app(domain: str, cmd: list[str], service: str = "app") -> str:
cid = _app_container(domain, service)
proc = subprocess.run(["docker", "exec", cid, *cmd], capture_output=True, text=True)
return proc.stdout
def http_body(domain: str, path: str = "/", timeout: int = 15) -> str:
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
req = urllib.request.Request(f"https://{domain}{path}", method="GET")
with urllib.request.urlopen(req, timeout=timeout, context=ctx) as resp:
return resp.read().decode(errors="replace")
def teardown_app(domain: str) -> None:
"""Idempotent, best-effort full teardown. Never raises (finalizer-safe)."""
abra.undeploy(domain)

View File

@ -56,24 +56,27 @@ def main() -> int:
fetch_recipe(recipe, ref, src)
test_dir = os.path.join(ROOT, "tests", recipe)
targets = []
overall = 0
ran = 0
for stage in stages:
fname = STAGE_FILES.get(stage)
if not fname:
print(f"unknown stage {stage}", file=sys.stderr)
return 2
path = os.path.join(test_dir, fname)
if os.path.exists(path):
targets.append(path)
else:
if not os.path.exists(path):
print(f" (skip {stage}: {path} not present)")
# also discover recipe-local tests later (D4); install stage first (M4)
if not targets:
continue
print(f"\n===== STAGE: {stage} =====", flush=True)
# each stage is its own pytest invocation => its own reported result (D2 separate stages)
rc = subprocess.call([sys.executable, "-m", "pytest", "-v", "-rA", path], cwd=ROOT)
ran += 1
if rc != 0:
overall = rc
if ran == 0:
print("no stage test files found", file=sys.stderr)
return 1
rc = subprocess.call([sys.executable, "-m", "pytest", "-v", "-rA", *targets], cwd=ROOT)
return rc
return overall
if __name__ == "__main__":

View File

@ -40,6 +40,18 @@ def app_domain(recipe) -> str:
return f"{tag}-{h}.ci.commoninternet.net"
@pytest.fixture
def deployed(recipe, app_domain, request):
"""Function-scoped: deploy the current/$REF version healthy, guaranteed teardown after.
Used by stages that start from current (install/backup)."""
version = os.environ.get("VERSION") or None
lifecycle.janitor()
request.addfinalizer(lambda: lifecycle.teardown_app(app_domain))
lifecycle.deploy_app(recipe, app_domain, version=version)
lifecycle.wait_healthy(app_domain)
return app_domain
@pytest.fixture(scope="session")
def deployed_app(recipe, app_domain):
"""Install stage: deploy the recipe and wait until healthy; tear down at session end."""

View File

@ -0,0 +1,28 @@
"""custom-html — backup/restore stage (D2): backup, mutate state, restore, assert the restored
state matches the pre-mutation (backed-up) state."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
MARKER_PATH = "/usr/share/nginx/html/ci-marker.txt"
def test_backup_mutate_restore(deployed):
domain = deployed
# 1) establish original state, then back it up
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo original > {MARKER_PATH}"])
assert lifecycle.http_body(domain, "/ci-marker.txt").strip() == "original"
lifecycle.backup_app(domain)
# 2) mutate state (diverge from the backup)
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])
assert lifecycle.http_body(domain, "/ci-marker.txt").strip() == "mutated"
# 3) restore -> state returns to the backed-up "original"
lifecycle.restore_app(domain)
lifecycle.wait_healthy(domain)
assert lifecycle.http_body(domain, "/ci-marker.txt").strip() == "original", \
"restore did not return the pre-mutation state"

View File

@ -0,0 +1,39 @@
"""custom-html — upgrade stage (D2): deploy the previous published version, write data, upgrade
to the current/$REF version, and assert the app stays healthy and data survives."""
import os
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
MARKER_PATH = "/usr/share/nginx/html/ci-marker.txt"
@pytest.fixture
def old_app(recipe, app_domain, request):
prev = lifecycle.previous_version(recipe)
if not prev:
pytest.skip(f"{recipe}: no previous published version to upgrade from")
lifecycle.janitor()
request.addfinalizer(lambda: lifecycle.teardown_app(app_domain))
lifecycle.deploy_app(recipe, app_domain, version=prev)
lifecycle.wait_healthy(app_domain)
return app_domain, prev
def test_upgrade_preserves_data(old_app):
domain, prev = old_app
# write a data marker into the served volume (nginx serves /usr/share/nginx/html)
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo upgrade-survives > {MARKER_PATH}"])
assert lifecycle.http_body(domain, "/ci-marker.txt").strip() == "upgrade-survives"
# upgrade previous -> current/$REF
lifecycle.upgrade_app(domain, version=os.environ.get("VERSION") or None)
lifecycle.wait_healthy(domain)
# app healthy and the data written before the upgrade is still there
assert lifecycle.http_get(domain, "/") == 200
assert lifecycle.http_body(domain, "/ci-marker.txt").strip() == "upgrade-survives", \
"data did not survive the upgrade"