Skip to content

Phase 29: Deployment Hardening & Agents Admin (closes v4.0)#63

Open
SimplicityGuy wants to merge 62 commits into
mainfrom
gsd/phase-29-deployment-hardening-agents-admin
Open

Phase 29: Deployment Hardening & Agents Admin (closes v4.0)#63
SimplicityGuy wants to merge 62 commits into
mainfrom
gsd/phase-29-deployment-hardening-agents-admin

Conversation

@SimplicityGuy
Copy link
Copy Markdown
Owner

Summary

Phase 29: Deployment Hardening & Agents Admin — final phase of milestone v4.0 Distributed Agents.

Goal: A real two-host deployment runs end-to-end with the application server holding no file mounts, HTTPS + Redis hardening in place, and an admin can see at a glance which agents are alive and healthy.

Verification: 6/6 must-haves verified (docs/code/test). One human-UAT item — real two-host hardware smoke — is operator-deferred as verified-docs-only (hardware not yet available); tracked in 29-HUMAN-UAT.md and explicitly accepted by the v4.0 milestone audit (status: passed).

This PR ships Phase 29 and closes milestone v4.0: strips file mounts from the application server compose, adds a self-signed internal CA + HTTPS termination, hardens Redis (requirepass + LAN bind), introduces docker-compose.agent.yml for file servers, per-file-server model download, 30 s heartbeat, and the /admin/agents page — then archives the v4.0 milestone artifacts and tags v4.0.

Changes

Phase 29 plans (8/8 complete)

Plan Wave What
29-01 1 TLS termination + phaze.cert_bootstrap (idempotent CA + leaf x509 via cryptography), pre-uvicorn entrypoint shim (bootstrap → execvp uvicorn), PhazeAgentClient.verify= kwarg defaulting to AgentSettings.agent_ca_file, wrong-CA → ConnectError integration test
29-02 1 AgentSettings.agent_env field + production-mode redis_url password validator (fail-fast on passwordless URL when PHAZE_AGENT_ENV=production)
29-03 1 Root docker-compose.yml: strip SCAN_PATH/MODELS_PATH mounts, delete watcher/agent-worker/audfprint/panako services, harden Redis with ${REDIS_PASSWORD:?} + ${REDIS_BIND_IP:-127.0.0.1}, .env.example updates, filesystem-isolation YAML-parse tests
29-04 2 docker-compose.agent.yml (4 services: worker, watcher, audfprint, panako) + .env.example.agent + agent-compose YAML-parse tests + docker-publish.yml extended for agent-compose image tags
29-05 2 Models auto-download: phaze.scripts.download_models Python helper + phaze.tasks._shared.model_bootstrap (rejects partial-download .part state) + agent_worker/watcher startup wiring + bash shim rewrite
29-06 3 Heartbeat caller: phaze.tasks.heartbeat.heartbeat_tick + SAQ CronJob registration in agent_worker.settings (30 s cron)
29-07 3 Agents admin page: services.agent_liveness classifier + utils.humanize + routers.admin_agents + 3 Jinja templates + base.html nav link + HTMX 5 s auto-refresh
29-08 4 justfile recipes (up-agent, up-all) + docs/deployment.md + PROJECT.md Deployment subsection + scripts/update-project.sh touch + human-verify checkpoint

Bug fixes from code review (Phase 29 CR pass)

  • fix(29-cr-01,cr-02): production https:// guard on PhazeAgentClient; bind PHAZE_REDIS_URL env var through to compose
  • fix(29-cr-03): model_bootstrap rejects partial-download (.part) state on startup

v4.0 milestone close

  • docs(milestone-v4.0): audit report (status: passed) + documentation drift closure (REQUIREMENTS.md traceability sync, ROADMAP.md checkbox sync)
  • chore: archive v4.0 milestone filesmilestones/v4.0-ROADMAP.md, milestones/v4.0-REQUIREMENTS.md, milestones/v4.0-MILESTONE-AUDIT.md; MILESTONES.md prepended; PROJECT.md full evolution review; STATE.mdmilestone_complete; ROADMAP.md collapses Phases 24-29 into <details>
  • chore: remove REQUIREMENTS.md for v4.0 milestone closegit rm for fresh-per-milestone convention
  • docs: update retrospective for v4.0 milestone — v4.0 section (built, worked, inefficient, 12 patterns, 8 lessons) + Cross-Milestone Trends updated

Diff scope

  • 84 files changed, 13,998 insertions / 577 deletions
  • 35 Python files (3,008 / 52)
  • 60 commits

Requirements Addressed

Phase 29 closes the final 6 v4.0 requirements:

  • DIST-01 — App-server runs API/UI/Postgres/Redis/fileless worker; no SCAN_PATH/MODELS_PATH mounts
  • AUTH-02 — All agent → app-server traffic over HTTPS via self-signed internal CA
  • AUTH-03 — Redis requirepass + LAN-only bind; agents connect with redis://default:<password>@<host>:6379
  • OPS-02docker-compose.agent.yml brings up worker/watcher/audfprint/panako on file server
  • OPS-03just download-models populates per-file-server local /models volume
  • OPS-04 — 30 s heartbeat + /admin/agents page with liveness/queue depth/last-seen

With this, 26/26 v4.0 requirements satisfied (full traceability in milestones/v4.0-REQUIREMENTS.md).

Verification

  • Automated verification: 6/6 must-haves verified (29-VERIFICATION.md)
    • TLS cert generation + httpx verify boundary
    • Production Redis URL validator
    • Compose filesystem isolation (YAML parse)
    • Agent compose template parse + tag publish
    • Models bootstrap (incl. partial-download rejection)
    • Heartbeat cron + agents admin page contract
  • v4.0 milestone audit: status: passed (milestones/v4.0-MILESTONE-AUDIT.md) — 26/26 reqs, 6/6 phases, 22/22 cross-phase wires, 4/5 flows complete + 1 advisory edge
  • Human UAT (deferred): real two-host hardware smoke — operator-accepted as verified-docs-only; tracked in 29-HUMAN-UAT.md. Will be run when file-server hardware is available.

Key Decisions

  • Pre-uvicorn entrypoint shim — bootstrap-then-execvp uvicorn so signals and PID-1 propagate cleanly; no double-process tree under Docker
  • Self-signed internal CA generated in api container on first start — no DNS dependency, no public ACME, no rotation pain for single-user LAN. Operator distributes public cert via scp; CA private key (phaze-ca.key, mode 0600) never leaves the app-server
  • AgentSettings.agent_env boots-fail-fast on passwordless redis_url when PHAZE_AGENT_ENV=production — surfaces misconfig at startup rather than as a silent unauthenticated connection
  • docker-compose.agent.yml enforces ${SCAN_PATH:?...} on all four services — compose parse fails before any container starts on a misconfigured file-server host
  • phaze.tasks._shared.model_bootstrap rejects .part files at startup (CR-03) — prevents corruption from interrupted download_models runs
  • Banner emission via both print() AND logger.warning() (CONTEXT D-02) — visible regardless of logging configuration

Tech Debt Carried Forward

Documented in 29-REVIEW.md and milestones/v4.0-MILESTONE-AUDIT.md for follow-up:

  • WR-01Path.write_bytes() applies umask before chmod(0o600); CA + leaf private keys are world-readable for a brief window on the bind mount (cert_bootstrap.py:215-234)
  • WR-02docker-compose.yml binds Postgres on 0.0.0.0 (no equivalent guard to Redis's ${REDIS_BIND_IP:-127.0.0.1} hardening)
  • WR-03.part temp file not cleaned up on download error; next run leaves a stale sibling (download_models.py:84-90)
  • WR-04watcher service mounts MODELS_PATH:rw despite never writing (unnecessary write grant)
  • flow-exec-revoked-breakdown (Phase 28 advisory) — execution router passes skipped_revoked count but not revoked_agents list to progress.html

Milestone v4.0 Distributed Agents — SHIPPED

This PR closes the milestone. Tag v4.0 is created locally on this branch and should be pushed from main after merge:

git checkout main && git pull && git push origin v4.0

🤖 Generated with Claude Code

SimplicityGuy and others added 30 commits May 16, 2026 10:17
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Checker verdict: PASS on 5 dimensions; FLAG on Dimension 5 (spacing) for
two pre-existing project invariants inherited from Phases 27/28
(py-0.5 pill padding, py-3 table cell padding). Both are documented
in the spec's Spacing Exceptions section. Non-blocking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified against installed binaries:
- SAQ 0.26.3 + croniter 6.2.2 — 6-field trailing-seconds cron
- httpx 0.28.1 supports verify=<path>
- cryptography NOT a transitive dep — must be added as new runtime dep
- agent_worker is single .py file (not a package) — no refactor needed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 plans across 4 waves covering DIST-01, AUTH-02, AUTH-03, OPS-02, OPS-03, OPS-04. Plan-checker passed after one revision round (3 BLOCKERs + 8 WARNINGs addressed). Decision coverage 23/23.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RED phase of Plan 29-01 Task 1. Adds:
- cryptography>=46.0.0,<49 runtime dep (NOT transitive, per RESEARCH
  Critical Discovery #1; verified via uv pip list)
- tests/test_cert_bootstrap.py with 7 LOCKED cases (D-22):
    1. 4-file generation + x509 parseability
    2. idempotency (mtimes unchanged on second call)
    3. banner-via-stdout (capsys) + Pitfall-4 no-secret-leak guard
    4. file modes (0o644 certs, 0o600 keys)
    5. SubjectAlternativeName entries match input
    6. _parse_san_entries DNS vs IP dispatch
    7. WARNING-8 banner-via-logger.warning (caplog) +
       Pitfall-4 no-secret-leak parity for logger channel
- tests/test_task_split.py::test_cert_bootstrap_stays_postgres_free
  (Phase 29 D-22 extension of D-25)

All 7 cert_bootstrap tests + the new task_split case currently fail
with ModuleNotFoundError -- this is the expected RED state.

Refs: D-01, D-02, D-22; AUTH-02
GREEN phase of Plan 29-01 Task 1. Implements:

src/phaze/cert_bootstrap.py
    - ensure_certs_present(certs_dir, cn, sans_csv): idempotent CA + leaf
      generation via cryptography.x509. ECDSA P-256, 10-year CA, 2-year
      leaf. Writes phaze-ca.{crt,key} + phaze-server.{crt,key} with
      0o644 / 0o600 modes.
    - _parse_san_entries(): DNSName for hostnames, IPAddress for IPs
      via ipaddress.ip_address dispatch.
    - _generate_ca / _generate_leaf: x509.CertificateBuilder with
      BasicConstraints + KeyUsage extensions per RESEARCH Pattern 1.
    - Loud banner on actual generation via BOTH print() AND
      logger.warning() (CONTEXT D-02 D-discretion "Both", WARNING-8).
      Banner is a LITERAL CONSTANT referencing only phaze-ca.crt --
      never the private key (Pitfall 4).
    - IMPORT-BOUNDARY INVARIANT: no phaze.database / sqlalchemy.ext.asyncio
      imports (extends Phase 26 D-25; verified by test_task_split).

src/phaze/entrypoint.py
    - main(): reads PHAZE_CERTS_DIR / PHAZE_API_HOST / PHAZE_API_TLS_SANS,
      calls ensure_certs_present, then os.execvp uvicorn with
      --ssl-keyfile / --ssl-certfile flags pointing at the generated
      leaf cert. Process replacement so signals + PID-1 propagate
      cleanly (RESEARCH Pattern 2).
    - Invoked from compose as `uv run python -m phaze.entrypoint`.

All 7 cert_bootstrap tests + the new test_task_split case pass.
ruff + mypy + bandit clean on both new modules.

Refs: D-01, D-02, D-22; AUTH-02
RED phase of Plan 29-01 Task 2. Adds:

tests/test_services/test_agent_client_tls.py
    - Real-TLS smoke server fixture (uvicorn in background asyncio task,
      two independent CA bundles in tmp_path/{server,wrong}_certs).
    - test_wrong_ca_raises_connect_error: D-04 success criterion --
      httpx.AsyncClient(verify=wrong_ca) against a server presenting
      the correct cert raises httpx.ConnectError.
    - test_correct_ca_succeeds: same setup with verify=correct_ca
      returns 200 OK.
    - test_construct_agent_client_missing_ca_raises +
      test_construct_agent_client_empty_ca_raises: D-03 fail-fast --
      RuntimeError("CA file empty or unreadable: ...") when the CA path
      is non-existent or zero-byte. Currently RED -- AgentSettings does
      not yet expose agent_ca_file, construct_agent_client does not yet
      validate.

[Rule 1 - Bug] src/phaze/cert_bootstrap.py:
    - Add AuthorityKeyIdentifier + SubjectKeyIdentifier extensions on
      the leaf cert; SubjectKeyIdentifier on the CA cert. Python 3.13's
      ssl module rejects the validation chain with "Missing Authority
      Key Identifier" without these (discovered while running
      test_correct_ca_succeeds against the real cert chain).
    - Add ExtendedKeyUsage(SERVER_AUTH) on the leaf cert; required by
      Python 3.13's strict TLS validation path -- otherwise the leaf
      is rejected when presented to a TLS client expecting a server cert.
    All 7 cert_bootstrap unit tests still pass; the chain now validates
    end-to-end (test_correct_ca_succeeds passes).

Refs: D-03, D-04, D-22; AUTH-02
GREEN phase of Plan 29-01 Task 2. Implements:

src/phaze/config.py
    - BaseSettings.api_tls_sans (D-02): comma-separated SAN list for the
      auto-generated leaf cert. Default "localhost,127.0.0.1,api" covers
      single-host dev (loopback) + docker-compose service-name DNS.
      Env alias PHAZE_API_TLS_SANS.
    - AgentSettings.agent_ca_file (D-03): path to the operator-distributed
      CA cert. Default "/certs/phaze-ca.crt" matches the in-container
      bind-mount path on the agent side. Env alias PHAZE_AGENT_CA_FILE.

src/phaze/services/agent_client.py
    - PhazeAgentClient.__init__ accepts keyword-only `verify` parameter
      (type: ssl.SSLContext | str | bool, default True). Threaded through
      to httpx.AsyncClient(verify=...). Default True preserves backwards
      compat with all existing respx-based tests (Pitfall 10) -- respx
      mocks below the TLS layer.
    - `ssl` moved to TYPE_CHECKING block (annotation-only usage).

src/phaze/tasks/_shared/agent_bootstrap.py
    - construct_agent_client(cfg) now validates cfg.agent_ca_file at
      construction time: if the path does not exist OR is zero-byte,
      raises RuntimeError("CA file empty or unreadable: ...") so
      misconfiguration surfaces fast (D-03 fail-fast).
    - Passes verify=cfg.agent_ca_file through to PhazeAgentClient.

All 4 TLS integration tests pass. 26 existing respx-based
test_agent_client*.py cases still pass (Pitfall 10 confirmed:
default verify=True preserves the transport-level mock behavior).
Postgres-free import boundary still holds (test_task_split: 5/5).
ruff + mypy clean on all modified modules.

Refs: D-02, D-03, D-04; AUTH-02
Adds 29-01-SUMMARY.md documenting the 4 plan commits:
- ffdbf5f (test RED): 7 cert_bootstrap cases + task_split extension
- 5840bfe (feat GREEN): cert_bootstrap + entrypoint shim
- 57d9843 (test RED): TLS integration tests + Rule 1 bug fix
  (CA chain extensions: AKI, SKI, EKU(SERVER_AUTH))
- 25c4ca4 (feat GREEN): verify= plumbing through PhazeAgentClient
  + AgentSettings.agent_ca_file + BaseSettings.api_tls_sans

12 net new tests passing. AUTH-02 partially closed (full closure in
Plan 03 once docker-compose api command switches to phaze.entrypoint).

Refs: D-01..D-04, D-22; AUTH-02
…ssword validator

- New tests/test_config/__init__.py marker so pytest discovers the sub-package
- 4 RED cases covering Phase 29 D-06:
  1. agent_env=production + passwordless redis_url -> ValidationError with
     "requires a password in redis_url" substring
  2. agent_env=production + redis://default:<pw>@host:6379/0 constructs OK
  3. agent_env=dev + passwordless redis_url constructs OK (Pitfall 7)
  4. Default agent_env is "dev" when omitted
- Tests pass kwargs directly (cleaner than env-var monkeypatching for model contract)

RED state confirmed: first test fails with "DID NOT RAISE ValidationError"
because the agent_env field + model_validator do not exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion mode

Phase 29 D-06 / AUTH-03 (agent-side half):

- Add `Literal` to typing imports
- Add AgentSettings.agent_env: Literal["dev", "production"] field
  (default "dev"; env alias PHAZE_AGENT_ENV)
- Add AgentSettings._enforce_redis_password_in_production model_validator:
  when agent_env=="production", `urlparse(self.redis_url).password` must be
  set; otherwise raise ValueError("agent_env=production requires a password
  in redis_url (Phase 29 D-06)").
- Field placed adjacent to other PHAZE_AGENT_* fields for grouping.
- model_validator placed after _enforce_required_agent_fields so the
  redis-url check runs after the required-field check.

The pairing server-side hardening (Redis `requirepass` + LAN-bound port)
lands in Plan 03 alongside the docker-compose rewrite; together they fully
close AUTH-03. Dev mode preserves Pitfall 7: fresh clones do `docker compose
up` with no Redis password ceremony.

Verification:
- 4/4 tests in tests/test_config/test_agent_settings_redis_password.py pass
- 22 existing tests in tests/test_config_role_split.py + test_config_worker.py
  pass (no regression)
- uv run mypy src/phaze/config.py: clean
- uv run ruff check + ruff format --check: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 29 Plan 02 SUMMARY. Documents:
- Files created (tests/test_config/__init__.py, tests/test_config/test_agent_settings_redis_password.py)
- File modified (src/phaze/config.py: Literal import + agent_env field + _enforce_redis_password_in_production model_validator)
- 2 commits (4b95029 RED, a7741ff GREEN; no REFACTOR needed)
- 4 new tests; 0 regressions in 22 existing config tests
- D-06 fully implemented; AUTH-03 partial (server-side half lands in Plan 03)
- TDD gate compliance verified; self-check passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation

- tests/test_deployment/__init__.py (pytest sub-package marker)
- tests/test_deployment/test_api_filesystem_isolation.py with 4 tests:
  * test_api_service_has_no_file_mounts (DIST-01)
  * test_controller_worker_has_no_file_mounts (DIST-01)
  * test_no_watcher_or_agent_worker_in_root_compose (D-15 / D-17;
    also asserts audfprint + panako absent)
  * test_redis_hardened (D-05 / AUTH-03; requirepass + LAN bind +
    --no-auth-warning healthcheck)
- All 4 tests FAIL against the current docker-compose.yml — RED step
  of TDD. Task 2 lands the compose rewrite that turns them GREEN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… redis

Rewrite root docker-compose.yml as the application-server-only compose
(DIST-01, D-05, D-17, D-19; AUTH-03 server-side). End state: services
block is exactly {api, worker, postgres, redis}.

- api: swap `command:` to `uv run python -m phaze.entrypoint` (Plan 01
  cert-bootstrap shim) and replace the SCAN_PATH mount with a single
  bind volume `${CA_PATH:-./certs}:/certs:rw`.
- worker (controller): drop `MODELS_PATH=/models` from environment and
  remove all three file mounts (SCAN_PATH, MODELS_PATH, OUTPUT_PATH).
  Controller is now fileless.
- DELETE the watcher, agent-worker, audfprint, panako service blocks —
  they live in docker-compose.agent.yml on the file server (Plan 04+).
- DELETE the unused audfprint_data and panako_data named volumes.
- redis: list-form command with `--requirepass ${REDIS_PASSWORD:?...}`
  (fail-fast at compose-parse time); ports bound via
  `${REDIS_BIND_IP:-127.0.0.1}:6379:6379` (loopback default, prod sets
  LAN IP); healthcheck uses
  `redis-cli --no-auth-warning -a ${REDIS_PASSWORD} ping`.

.env.example: add the three Phase-29 variables with comment blocks:
REDIS_PASSWORD=changeme (dev placeholder; Pitfall-7 mitigation),
REDIS_BIND_IP=127.0.0.1, PHAZE_API_TLS_SANS=localhost,127.0.0.1,api.

Dockerfile audit: no MODELS_PATH/SCAN_PATH/OUTPUT_PATH ENV defaults
were present — verify step only, no changes needed.

All 4 tests in tests/test_deployment/test_api_filesystem_isolation.py
now pass (GREEN step of TDD).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- 4 YAML-parse structural tests (D-19) in tests/test_deployment/
- docker-compose.yml rewritten: services = {api, worker, postgres, redis}
- api volumes stripped to /certs:rw only; controller worker is fileless
- watcher, agent-worker, audfprint, panako removed (move to agent.yml)
- redis hardened: --requirepass, LAN-bound port, authenticated healthcheck
- .env.example documents REDIS_PASSWORD, REDIS_BIND_IP, PHAZE_API_TLS_SANS
- Dockerfile audited — no MODELS_PATH/SCAN_PATH/OUTPUT_PATH ENV defaults

Closes DIST-01 (app server has no file mounts) and the server-side half
of AUTH-03 (Redis requirepass + LAN binding). Decision IDs D-05, D-17,
D-19 fully implemented; D-15 partial pending Plan 04's agent.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the essentia model URL list (33 classifier paths + 1 genre model)
from scripts/download-models.sh into a Python helper so both bash and the
agent bootstrap can drive the download from a single source of truth.

- src/phaze/scripts/__init__.py: new package marker
- src/phaze/scripts/download_models.py: download_to(target_dir) public
  entry; _download_one uses .part atomic-rename pattern (T-29-05-03);
  CLI entry via `python -m phaze.scripts.download_models <dir>`
- scripts/download-models.sh: rewritten as a 6-line bash shim that execs
  the Python module (signals + exit code pass through cleanly)
- tests/test_services/test_model_bootstrap.py: scaffold with three
  ensure_models_present cases (RED until Task 2 lands) plus three
  download_to/_download_one cases (GREEN now)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 29 D-21 completes OPS-03's auto-download path. A new Postgres-free
shared module owns the .pb-glob + download orchestration; agent_worker
calls it AFTER /whoami succeeds so a bad token / unreachable app server
fails fast in ~60s instead of after a 5-minute 150MB download.

- src/phaze/tasks/_shared/model_bootstrap.py: ensure_models_present
  module (Postgres-free; stdlib + phaze.scripts.download_models only)
- src/phaze/tasks/agent_worker.py: drop the in-place RuntimeError checks;
  call ensure_models_present(Path(cfg.models_path)) as Step 3a after
  whoami_with_retry
- src/phaze/agent_watcher/__main__.py: add WARNING-7 documentation
  comment explaining why the watcher intentionally does NOT auto-download
- tests/test_task_split.py: add test_model_bootstrap_stays_postgres_free
  subprocess case (BLOCKER-1 resolution; parallel to the existing
  agent_bootstrap case)
- tests/test_phase04_gaps.py: replace the two old fail-fast model-dir
  RuntimeError tests with ordering + propagation tests that match the
  new auto-download semantics

Deferred (out of scope, pre-existing from Plan 29-03 compose hardening):
test_docker_compose_has_agent_worker_consuming_agent_queue -- Plan 29-04
moves the agent-worker block into docker-compose.agent.yml, and the
test must be updated to scan both compose files there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OPS-03 + D-21 fully implemented. SUMMARY records the 33+1 URL migration
from bash to Python, the WARNING-7 watcher-no-download choice, and the
BLOCKER-1 subprocess import-boundary test (test_model_bootstrap_stays_postgres_free).

deferred-items.md notes the pre-existing
test_docker_compose_has_agent_worker_consuming_agent_queue failure from
Plan 29-03 (compose hardening removed the agent-worker block); the test
will be updated by Plan 29-04 (parallel wave) which lands docker-compose.agent.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tests/test_services/test_agent_liveness.py: 5-state classify matrix
  (12 boundary cases) + sort_key ordering invariants per UI-SPEC.
- tests/test_utils/test_humanize.py: relative_time output table (UI-SPEC
  LOCKED) covering all bucket boundaries, the 89.7s → "89s ago" truncation
  case, and format invariants (no plural-s suffix, single-letter unit).

Both modules ImportError today; GREEN commit follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…REEN)

Wave 0 of plan 29-07 (OPS-04 UI half — pure-function tier):

- src/phaze/constants.py: add AGENT_LIVENESS_ALIVE_SECONDS=90 +
  AGENT_LIVENESS_STALE_SECONDS=300 (Phase 29 D-12 LOCKED thresholds).
- src/phaze/services/agent_liveness.py: pure-function classify(agent, now)
  → AgentStatus literal in {alive, stale, dead, revoked, never} with
  precedence revoked → never → alive/stale/dead. sort_key returns
  (revoked_int, status_rank, neg_last_seen) so revoked agents land last,
  non-revoked sort alive→stale→dead→never, ties break by last_seen DESC.
- src/phaze/utils/__init__.py + src/phaze/utils/humanize.py: relative_time
  helper producing "never" / "just now" / "Ns ago" / "Nm ago" / "Nh ago" /
  "Nd ago" with int-truncate semantics (UI-SPEC LOCKED bucket table).

Reconciled UI-SPEC documentation defect (line 248 prose example "89.7s →
89s ago" is inconsistent with its own bucket table lines 232-241; the
table is authoritative — see test docstring for Rule-1 fix rationale).

All 51 tests pass; mypy + ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 1 RED gate for plan 29-07. 10 tests covering:
- Full page render with base.html chrome
- HX-Request: true returns partial only
- Dedicated /_table partial route (always partial, never halts polling)
- 5-state status pill rendering with LOCKED Tailwind classes
- Empty state UI-SPEC §Empty State LOCKED copy
- Sort order alive → stale → dead → never → revoked
- 3 BLOCKER-2 tests: htmx event listener + role=alert failure footer +
  localStorage `phaze:agents:lastError` plumbing
- Production-wiring smoke (router registered in main.create_app)

Currently fails at import — phaze.routers.admin_agents does not exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 1 of plan 29-07 (OPS-04 UI half — Wave-1 deliverables):

- src/phaze/routers/admin_agents.py: APIRouter(prefix="/admin/agents"); two
  handlers — `page` (HX-Request-aware, full page OR partial) and
  `table_partial` (always partial, the canonical 5s polling target).
  `_load_agents` injects transient `agent._status` via classify(a, now) and
  sorts via sort_key (Phase 27 transient-attr pattern). No
  get_authenticated_agent dep (operator-facing on private LAN, consistent
  with pipeline.py / pipeline_scans.py precedent).

- src/phaze/templates/admin/agents.html: page shell extending base.html;
  current_page="admin_agents"; hosts the MANDATORY htmx:responseError +
  htmx:sendError + htmx:afterSwap listener writing/clearing
  `phaze:agents:lastError` localStorage (BLOCKER-2 UI-SPEC §Error /
  Failure-Tolerant Refresh LOCKED).

- src/phaze/templates/admin/partials/agents_table.html: HTMX
  self-replacing <section> (hx-get/hx-trigger/hx-swap=outerHTML, never
  halts — UI-SPEC §Polling LOCKED). Empty state + 6-column table + happy-
  path "Last refreshed Ns ago" Alpine footer + MANDATORY red role=alert
  "Refresh failed at HH:MM:SS" footer driven by localStorage (BLOCKER-2).

- src/phaze/templates/admin/partials/_status_pill.html: 5-state liveness
  pill with LOCKED Tailwind palette (alive=green-100/950,
  stale=amber-100/950, dead=red-100/950, revoked/never=gray-100/800) +
  redundant aria-label="Status: <state>" for screen readers.

- src/phaze/templates/base.html: new "Agents" nav link inserted between
  Audit Log and the theme toggle. Uses short-slug
  `current_page == 'admin_agents'` per WARNING-1 (matches live convention
  where Audit Log uses 'audit' not 'audit_log'). aria-current="page" is
  a forward-looking a11y upgrade applied only to this new link.

- src/phaze/main.py: register admin_agents.router alongside Phase 27/28
  routers.

All BLOCKER-2 grep gates pass:
  agents.html: htmx:responseError × 2, htmx:sendError × 2, htmx:afterSwap × 2,
              phaze:agents:lastError × 2, localStorage.setItem × 1,
              localStorage.removeItem × 1.
  agents_table.html: localStorage.getItem × 2, phaze:agents:lastError × 4,
                     "Refresh failed" × 2, role="alert" × 1.

Test status:
- test_router_registered_in_main_app passes (non-DB, structural).
- 9 DB-backed tests collect cleanly; cannot execute locally (no Postgres
  on the executor host). Logic verified via direct Jinja render smoke
  (positions, pill classes, BLOCKER-2 markup, relative_time output).
- mypy + ruff clean on all new files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SimplicityGuy and others added 28 commits May 16, 2026 16:15
…l (RED)

Four LOCKED structural assertions for the file-server compose:
  1. test_agent_compose_service_list — exactly {worker, watcher, audfprint, panako}
  2. test_agent_compose_has_no_postgres_env — DIST-04 invariant
  3. test_worker_service_has_phaze_role_agent — D-17
  4. test_all_scan_path_mounts_use_failfast_syntax — WARNING-3

All four currently fail because docker-compose.agent.yml does not yet
exist. The GREEN commit will create the compose file + env template.

Per WARNING-3: the fail-fast regex test rejects future drift to
${SCAN_PATH:-/data/music} loose-default form that would let
docker compose up succeed on a misconfigured file-server host.

Refs phase 29-04 plan, D-15..D-17, D-22 test surface.
- 4 happy-path tests in test_heartbeat_cron.py:
  * success: heartbeat_tick POSTs HeartbeatRequest with correct payload
  * ctx-missing: missing api_client/agent_identity -> WARNING + return
  * queue.info-fail: queue_depth defaults to 0; heartbeat still POSTs
  * importlib metadata: agent_version sourced from importlib.metadata
- 1 failure test in test_heartbeat_failure.py:
  * AgentApiServerError -> WARNING logged; no exception escapes (D-09)
- Tests use ctx["worker"].queue (NOT ctx["queue"]) per RESEARCH Pitfall 8
- AgentApiServerError constructed positional-only (no status_code= kwarg)
- RED step: tests FAIL with ModuleNotFoundError until heartbeat.py lands

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EEN)

The file-server-host compose file (D-15) declares exactly 4 services:
worker, watcher, audfprint, panako. Worker + watcher pull from
ghcr.io/simplicityguy/phaze (D-16 with `${PHAZE_IMAGE_TAG:-latest}`);
sidecars retain `build:` (not yet on GHCR per D-15).

All 4 services use `${SCAN_PATH:?SCAN_PATH required}` fail-fast
interpolation (WARNING-2 unified explicit-message form; WARNING-3
test enforces). MODELS_PATH bind mount is rw on worker + watcher for
D-21 auto-download; CA_PATH bind mount is ro everywhere.

Adds `.env.example.agent` with every variable a file-server host
needs (D-23 portion): PHAZE_IMAGE_TAG, PHAZE_AGENT_API_URL,
PHAZE_REDIS_URL, PHAZE_AGENT_{ID,TOKEN,QUEUE}, PHAZE_AGENT_CA_FILE,
PHAZE_AGENT_ENV, SCAN_PATH, MODELS_PATH, CA_PATH,
PHAZE_AGENT_SCAN_ROOTS. Production-pin guidance lives in inline
comments.

Also resolves the Plan 29-05 deferred test failure: updates
`tests/test_phase04_gaps.py::test_docker_compose_has_agent_worker_consuming_agent_queue`
to scan BOTH `docker-compose.yml` and `docker-compose.agent.yml`. The
agent-worker now lives in `docker-compose.agent.yml::worker`, so the
Phase 27 UAT gap-13 invariant is again codified across the split
compose surface. Marks the deferred-items.md entry resolved.

All 4 RED tests now pass.

Refs phase 29-04 plan, D-15, D-16, D-17, D-22 (agent-compose portion),
D-23 (.env.example.agent portion).
Adds test_docker_publish_workflow_tags_both_latest_and_version — a 5th
test in test_agent_compose.py. Replaces the original checkpoint:human-verify
task with an automated YAML-parse check that .github/workflows/docker-publish.yml
emits BOTH a `:latest` tag and a `:v<version>` tag (D-16).

Currently fails because docker-publish.yml's docker/metadata-action step
only declares `type=raw,value=latest`, `type=ref,event=branch`, `type=ref,event=pr`,
and `type=schedule,pattern=...` — no `type=semver,pattern={{version}}` and
no `type=ref,event=tag`. The GREEN commit will extend the workflow.

The plan stays autonomous (no human checkpoint); the regression-detection
is now in CI permanently.

Refs phase 29-04 plan WARNING-4 resolution.
… (GREEN)

Two coupled changes to satisfy the WARNING-4 automated test and align
the published api image URL with docker-compose.agent.yml:

1. Tag strategy (D-16 + WARNING-4): the docker/metadata-action step now
   emits `type=raw,value=latest`, `type=semver,pattern={{version}}`,
   `type=semver,pattern={{major}}.{{minor}}`, `type=ref,event=tag`,
   `type=ref,event=branch`, `type=ref,event=pr`, and the schedule tag.
   On a tagged release `v4.0.0`, this produces `:latest`, `:v4.0.0` (via
   ref,event=tag), `:4.0.0` and `:4.0` (via semver). Operators get the
   full set of stability rungs the .env.example.agent comments
   reference (PHAZE_IMAGE_TAG=v4.0.0 production pin).

2. Image URL realignment (D-15): the matrix entry for `api` now sets
   `image_suffix: ""`, pushing to the BARE-repo URL
   `ghcr.io/simplicityguy/phaze:<tag>` — the exact URL
   docker-compose.agent.yml's worker + watcher pull from. The sidecars
   keep their `/audfprint` and `/panako` sub-paths because agent.yml
   builds them locally (D-15) and does not pull them from GHCR.

This makes `test_docker_publish_workflow_tags_both_latest_and_version`
pass, closing WARNING-4 with an autonomous test instead of a
checkpoint:human-verify task.

Verification result: `fixed` + `url-realigned` (both the tag pattern
and the image URL needed adjustment).

Refs phase 29-04 plan WARNING-4 resolution.
…n plan

OPS-02 fully closed. Lands the file-server-host compose surface
(docker-compose.agent.yml + .env.example.agent) with exactly 4
services and replaces the original GHCR-tag human-verify checkpoint
with an automated YAML-parse test (WARNING-4 resolution). Workflow
extended to emit :v<version> tags and api image realigned to the
bare-repo URL ghcr.io/simplicityguy/phaze.

Also resolves the Plan 29-05 deferred test (gap-13 invariant now
codified across the split compose surface).

Plan stays autonomous: true. 5 new tests, all green, no regressions
across deployment + adjacent suites (22 tests).
Land the agent-side half of OPS-04 (D-07..D-10):

- New src/phaze/tasks/heartbeat.py with heartbeat_tick(ctx) async cron
  handler. Reads ctx["api_client"], ctx["agent_identity"],
  ctx["worker"].queue (NOT ctx["queue"] per RESEARCH Pitfall 8); builds
  HeartbeatRequest(agent_version=importlib.metadata.version("phaze"),
  worker_pid=os.getpid(), queue_depth=Queue.info()["queued"]) and POSTs
  it via PhazeAgentClient.heartbeat.
- Defensive: ctx not initialized -> WARNING + return; queue.info()
  failure -> default queue_depth=0 + still POST; AgentApiError -> WARNING
  + swallow (D-09 fire-and-forget; SAQ retries on next tick).
- agent_worker.py: import CronJob + heartbeat_tick; add heartbeat_tick
  to settings.functions; add cron_jobs=[CronJob(heartbeat_tick,
  cron="* * * * * */30", unique=True, timeout=10)]. Trailing-seconds
  6-field form per RESEARCH Critical Discovery #2 (the CONTEXT.md D-08
  leading-seconds example would fire every second; verified empirically
  with croniter -- gaps are 30s vs 1s).
- agent_worker.py stays a single .py file (Pitfall 9 avoided).
- heartbeat.py is Postgres-free (banner documents the invariant).

Tests: all 5 heartbeat tests pass (GREEN); test_task_split still passes;
tests/test_tasks/ suite passes (excluding the pre-existing Plan 29-03/04
deferred test_docker_compose_has_agent_worker_consuming_agent_queue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures Task 1 (RED tests) + Task 2 (GREEN implementation) outcomes,
threat-model mitigations, and the one notable RESEARCH Critical
Discovery #2 fact: the cron string is the trailing-seconds 6-field form
`* * * * * */30` (NOT the leading-seconds form from CONTEXT.md D-08).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ect.sh (Task 1)

- justfile: add up-agent + up-all recipes under [group('dev')]; existing `up` unchanged
- docs/deployment.md (new, 230 lines): 6-step two-host operator walkthrough,
  D-20 filesystem-isolation smoke, CA rotation guidance, production checklist;
  required strings present: phaze-ca.crt (9), just up-agent (5), REDIS_PASSWORD (3),
  /admin/agents (2), PHAZE_AGENT_TOKEN (2)
- .planning/PROJECT.md: new "### Deployment (v4.0 — Distributed Agents)" subsection
  under Constraints; documents two-compose-file invariant, HTTPS internal CA,
  Redis password-bound LAN, zero-new-pip-deps beyond cryptography
- scripts/update-project.sh: audited (pure dependency/version orchestrator with no
  Python module enumeration); left untouched per plan rule

Closes D-18 (justfile recipes), D-20 (filesystem-isolation smoke documented),
D-23 (operator workflow + doc sweep).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator reviewed docs/deployment.md end-to-end against the live codebase
and confirmed all commands, env vars, routes (/admin/agents,
/api/internal/agent/heartbeat), and paths (/data/music, /certs/phaze-ca.crt)
match the compose mounts and router prefixes. Cert-bootstrap banner text
matches src/phaze/cert_bootstrap.py verbatim.

Resume signal: verified-docs-only (Option C from the checkpoint).

Follow-up: real-deployment smoke deferred until file-server hardware is
available — tracked as a v4.0 outstanding UAT item in the SUMMARY's
"Outstanding Items" section. Structural CI tests under tests/test_deployment/
cover compose-file invariants in the meantime.

Plan 29-08 complete — Phase 29 (deployment-hardening-agents-admin) closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… env-var

Two critical issues from gsd-code-reviewer's REVIEW.md, both in
src/phaze/config.py:

CR-01: .env.example.agent:22 references AgentSettings._enforce_https_in_production
as a guard that refuses http:// agent_api_url in production. The guard did
not exist — operators following the docs could ship plaintext bearer tokens.
Adds the validator (and three test cases covering https-ok / http-blocked /
dev-permits-http).

CR-02: BaseSettings.redis_url and database_url had no validation_alias, so
pydantic-settings only accepted bare REDIS_URL / DATABASE_URL env-var names.
Operators using PHAZE_REDIS_URL (as documented in .env.example.agent) hit
the default passwordless URL and tripped _enforce_redis_password_in_production
with a misleading "requires a password" error — preventing the production
agent from starting. Adds AliasChoices on both fields and an env-var
integration test that exercises the binding via AgentSettings() with no
kwargs (the 29-02 tests pass kwargs directly and never hit the env path).
ensure_models_present previously short-circuited whenever *any* .pb file
existed in the models directory. An interrupted first download (e.g., 1/34
files written before SIGTERM) permanently left every subsequent agent
start skipping re-download — the agent would silently break at analysis
time when essentia tried to load the 2-33 missing weights.

Compares the observed .pb count against
len(CLASSIFIER_MODELS) + len(GENRE_MODELS). Partial state logs WARNING
with the observed/expected counts and re-invokes download_to, which is
idempotent at the per-file level (_download_one skips existing dests).

Two test updates:
- tests/test_services/test_model_bootstrap.py: populated-no-op test now
  writes all 34 expected files; new partial-triggers-redownload test
  pins the WARNING path.
- tests/test_tasks/test_agent_startup_banner.py: two pre-existing tests
  patched pathlib.Path.glob to return a single fake .pb and relied on
  the old loose check. Patch ensure_models_present directly so the
  banner / queue-mismatch logic under test is not coupled to the
  completeness rule.

Caught by the gsd-code-reviewer agent (Phase 29 REVIEW.md CR-03).
3-source requirements cross-reference: all 26 requirements satisfied
in code (verification + summary + wiring). 22/22 cross-phase exports
wired, 12/12 internal API routes consumed, all 5 E2E flows traced.

Documentation drift surfaced (does not affect runtime):
- 13 REQUIREMENTS.md traceability entries stale at Pending despite
  verified-passed phases (DIST-04/05, DATA-01..04, AUTH-01/04,
  TASK-04, EXEC-01..04)
- ROADMAP.md Phase 24 checkbox still [ ]
- Phase 24 VERIFICATION.md filename unprefixed

Tech debt carried into post-milestone backlog: P28-WR-03, P28-RACE-01,
P29-WR-01..04, P29-IN-01..03 (all advisory; none block archive).
1. REQUIREMENTS.md: 13 stale `[ ]` checkboxes → `[x]` and 13 `| Pending |`
   traceability rows → `| Complete |` (DIST-04, DIST-05, DATA-01..04,
   AUTH-01, AUTH-04, TASK-04, EXEC-01..04). Traceability table footer
   bumped to today with the v4.0 completion note. The integration
   checker confirmed all 26 requirements are wired in code; the rows
   were already satisfied — the doc just hadn't been touched since
   2026-05-11.

2. ROADMAP.md: flip Phase 24 checkbox `[ ]` → `[x] (completed 2026-05-11)`.
   Phase 24 VERIFICATION.md `status: passed, score 4/4` since 2026-05-11.

3. Rename `phases/24-schema-foundation-agent-registry/VERIFICATION.md`
   → `24-VERIFICATION.md` to match the v4.0 convention used by every
   other phase (`{phase_num}-VERIFICATION.md`). The unprefixed name
   broke `gsd-sdk query find-phase` discovery.
Snapshot v4.0 Distributed Agents milestone:
- .planning/milestones/v4.0-ROADMAP.md (full phase details)
- .planning/milestones/v4.0-REQUIREMENTS.md (26/26 satisfied)
- .planning/milestones/v4.0-MILESTONE-AUDIT.md (moved from .planning/)
- MILESTONES.md prepended with v4.0 entry + delivered summary
- PROJECT.md full evolution review (Current State, Validated, Key Decisions outcomes)
- STATE.md status -> milestone_complete; v4.0 velocity recorded
- ROADMAP.md collapses v4.0 Phases 24-29 into <details>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Archived to .planning/milestones/v4.0-REQUIREMENTS.md (all 26 reqs
satisfied). Next milestone will define fresh requirements via
/gsd:new-milestone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append v4.0 Distributed Agents section: what was built, what worked,
what was inefficient, patterns established (settings split factory,
subprocess import-boundary tests, 403-before-state-machine guard,
pre-uvicorn entrypoint shim, etc.), key lessons (8), cost
observations.

Update Cross-Milestone Trends: process evolution row for v3.0 + v4.0,
cumulative quality table, top lessons (7) verified across milestones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. **`admin_agents` router 422s on all 9 tests + production** —
   `AsyncSession` was imported under `if TYPE_CHECKING:` with
   `from __future__ import annotations`, so FastAPI's `get_type_hints`
   could not resolve the runtime annotation in
   `Annotated[AsyncSession, Depends(get_session)]` and treated
   `session` as a query parameter (`422 {"detail":[{"type":"missing","loc":["query","session"]}]}`).
   Move `from sqlalchemy.ext.asyncio import AsyncSession` to a
   runtime import (matches `agent_files.py` pattern, noqa TC002 since
   FastAPI requires runtime resolution). Fixes all 9
   `test_admin_agents.py` failures and the production route.

2. **`test_settings_redis_url_default` env leak** — Phase 29-02 added
   `PHAZE_REDIS_URL` as a pydantic `AliasChoices` alias for the
   `redis_url` field, and CI sets `PHAZE_REDIS_URL=redis://localhost:6379/0`
   for the test Redis service. The default-value test only deleted
   `REDIS_URL`, not the new alias. Delete all three spellings
   (`PHAZE_REDIS_URL`, `REDIS_URL`, `redis_url`) before asserting on
   the default.

3. **`validate-docker-compose` job parse-fail** — Phase 29-03 added
   `${REDIS_PASSWORD:?REDIS_PASSWORD required}` to the application-server
   `docker-compose.yml`, and Phase 29-04 added `${SCAN_PATH:?...}` on
   four services in the new `docker-compose.agent.yml`. The CI job
   only did `touch .env`, so compose-parse fail-fast tripped before
   anything could be validated. Supply placeholders for both compose
   files and validate the new agent compose alongside the existing
   app-server compose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Codecov flagged 26 lines missing across 3 Phase 29 modules
(patch coverage 90.33%). This commit takes all three files to 100%.

- tests/test_entrypoint.py (new, 14 lines covered)
  Monkeypatches ensure_certs_present + os.execvp; verifies
  env-var defaults, env-var overrides, and the ensure→execvp
  sequencing invariant (RESEARCH Pattern 2: cert files must
  exist before uvicorn boots against --ssl-keyfile/--ssl-certfile).

- tests/test_scripts/test_download_models.py (new, 10 lines covered)
  respx-mocked tests for _download_one (idempotent skip on existing
  dest, atomic .part-then-rename on success, 4xx leaves dest absent
  so model_bootstrap's *.pb glob retries) and download_to (walks
  CLASSIFIER_MODELS + GENRE_MODELS, no-ops on a populated dir).

- tests/test_cert_bootstrap.py — added test_unparseable_existing_
  certs_trigger_regeneration to cover lines 202-203 (the WARNING
  + regeneration branch when all 4 files exist but parse as garbage).

- Tagged the two `if __name__ == "__main__":` CLI invocation guards
  in entrypoint.py and download_models.py with `# pragma: no cover`
  so coverage reflects what is reachable from `python -m`.

Coverage after this commit:
  cert_bootstrap.py:    96.77% -> 100.00%
  entrypoint.py:         0.00% -> 100.00%
  download_models.py:   67.74% -> 100.00%

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant