Phase 29: Deployment Hardening & Agents Admin (closes v4.0)#63
Open
SimplicityGuy wants to merge 62 commits into
Open
Phase 29: Deployment Hardening & Agents Admin (closes v4.0)#63SimplicityGuy wants to merge 62 commits into
SimplicityGuy wants to merge 62 commits into
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Checker verdict: PASS on 5 dimensions; FLAG on Dimension 5 (spacing) for two pre-existing project invariants inherited from Phases 27/28 (py-0.5 pill padding, py-3 table cell padding). Both are documented in the spec's Spacing Exceptions section. Non-blocking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified against installed binaries: - SAQ 0.26.3 + croniter 6.2.2 — 6-field trailing-seconds cron - httpx 0.28.1 supports verify=<path> - cryptography NOT a transitive dep — must be added as new runtime dep - agent_worker is single .py file (not a package) — no refactor needed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 plans across 4 waves covering DIST-01, AUTH-02, AUTH-03, OPS-02, OPS-03, OPS-04. Plan-checker passed after one revision round (3 BLOCKERs + 8 WARNINGs addressed). Decision coverage 23/23. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RED phase of Plan 29-01 Task 1. Adds: - cryptography>=46.0.0,<49 runtime dep (NOT transitive, per RESEARCH Critical Discovery #1; verified via uv pip list) - tests/test_cert_bootstrap.py with 7 LOCKED cases (D-22): 1. 4-file generation + x509 parseability 2. idempotency (mtimes unchanged on second call) 3. banner-via-stdout (capsys) + Pitfall-4 no-secret-leak guard 4. file modes (0o644 certs, 0o600 keys) 5. SubjectAlternativeName entries match input 6. _parse_san_entries DNS vs IP dispatch 7. WARNING-8 banner-via-logger.warning (caplog) + Pitfall-4 no-secret-leak parity for logger channel - tests/test_task_split.py::test_cert_bootstrap_stays_postgres_free (Phase 29 D-22 extension of D-25) All 7 cert_bootstrap tests + the new task_split case currently fail with ModuleNotFoundError -- this is the expected RED state. Refs: D-01, D-02, D-22; AUTH-02
GREEN phase of Plan 29-01 Task 1. Implements:
src/phaze/cert_bootstrap.py
- ensure_certs_present(certs_dir, cn, sans_csv): idempotent CA + leaf
generation via cryptography.x509. ECDSA P-256, 10-year CA, 2-year
leaf. Writes phaze-ca.{crt,key} + phaze-server.{crt,key} with
0o644 / 0o600 modes.
- _parse_san_entries(): DNSName for hostnames, IPAddress for IPs
via ipaddress.ip_address dispatch.
- _generate_ca / _generate_leaf: x509.CertificateBuilder with
BasicConstraints + KeyUsage extensions per RESEARCH Pattern 1.
- Loud banner on actual generation via BOTH print() AND
logger.warning() (CONTEXT D-02 D-discretion "Both", WARNING-8).
Banner is a LITERAL CONSTANT referencing only phaze-ca.crt --
never the private key (Pitfall 4).
- IMPORT-BOUNDARY INVARIANT: no phaze.database / sqlalchemy.ext.asyncio
imports (extends Phase 26 D-25; verified by test_task_split).
src/phaze/entrypoint.py
- main(): reads PHAZE_CERTS_DIR / PHAZE_API_HOST / PHAZE_API_TLS_SANS,
calls ensure_certs_present, then os.execvp uvicorn with
--ssl-keyfile / --ssl-certfile flags pointing at the generated
leaf cert. Process replacement so signals + PID-1 propagate
cleanly (RESEARCH Pattern 2).
- Invoked from compose as `uv run python -m phaze.entrypoint`.
All 7 cert_bootstrap tests + the new test_task_split case pass.
ruff + mypy + bandit clean on both new modules.
Refs: D-01, D-02, D-22; AUTH-02
RED phase of Plan 29-01 Task 2. Adds:
tests/test_services/test_agent_client_tls.py
- Real-TLS smoke server fixture (uvicorn in background asyncio task,
two independent CA bundles in tmp_path/{server,wrong}_certs).
- test_wrong_ca_raises_connect_error: D-04 success criterion --
httpx.AsyncClient(verify=wrong_ca) against a server presenting
the correct cert raises httpx.ConnectError.
- test_correct_ca_succeeds: same setup with verify=correct_ca
returns 200 OK.
- test_construct_agent_client_missing_ca_raises +
test_construct_agent_client_empty_ca_raises: D-03 fail-fast --
RuntimeError("CA file empty or unreadable: ...") when the CA path
is non-existent or zero-byte. Currently RED -- AgentSettings does
not yet expose agent_ca_file, construct_agent_client does not yet
validate.
[Rule 1 - Bug] src/phaze/cert_bootstrap.py:
- Add AuthorityKeyIdentifier + SubjectKeyIdentifier extensions on
the leaf cert; SubjectKeyIdentifier on the CA cert. Python 3.13's
ssl module rejects the validation chain with "Missing Authority
Key Identifier" without these (discovered while running
test_correct_ca_succeeds against the real cert chain).
- Add ExtendedKeyUsage(SERVER_AUTH) on the leaf cert; required by
Python 3.13's strict TLS validation path -- otherwise the leaf
is rejected when presented to a TLS client expecting a server cert.
All 7 cert_bootstrap unit tests still pass; the chain now validates
end-to-end (test_correct_ca_succeeds passes).
Refs: D-03, D-04, D-22; AUTH-02
GREEN phase of Plan 29-01 Task 2. Implements:
src/phaze/config.py
- BaseSettings.api_tls_sans (D-02): comma-separated SAN list for the
auto-generated leaf cert. Default "localhost,127.0.0.1,api" covers
single-host dev (loopback) + docker-compose service-name DNS.
Env alias PHAZE_API_TLS_SANS.
- AgentSettings.agent_ca_file (D-03): path to the operator-distributed
CA cert. Default "/certs/phaze-ca.crt" matches the in-container
bind-mount path on the agent side. Env alias PHAZE_AGENT_CA_FILE.
src/phaze/services/agent_client.py
- PhazeAgentClient.__init__ accepts keyword-only `verify` parameter
(type: ssl.SSLContext | str | bool, default True). Threaded through
to httpx.AsyncClient(verify=...). Default True preserves backwards
compat with all existing respx-based tests (Pitfall 10) -- respx
mocks below the TLS layer.
- `ssl` moved to TYPE_CHECKING block (annotation-only usage).
src/phaze/tasks/_shared/agent_bootstrap.py
- construct_agent_client(cfg) now validates cfg.agent_ca_file at
construction time: if the path does not exist OR is zero-byte,
raises RuntimeError("CA file empty or unreadable: ...") so
misconfiguration surfaces fast (D-03 fail-fast).
- Passes verify=cfg.agent_ca_file through to PhazeAgentClient.
All 4 TLS integration tests pass. 26 existing respx-based
test_agent_client*.py cases still pass (Pitfall 10 confirmed:
default verify=True preserves the transport-level mock behavior).
Postgres-free import boundary still holds (test_task_split: 5/5).
ruff + mypy clean on all modified modules.
Refs: D-02, D-03, D-04; AUTH-02
Adds 29-01-SUMMARY.md documenting the 4 plan commits: - ffdbf5f (test RED): 7 cert_bootstrap cases + task_split extension - 5840bfe (feat GREEN): cert_bootstrap + entrypoint shim - 57d9843 (test RED): TLS integration tests + Rule 1 bug fix (CA chain extensions: AKI, SKI, EKU(SERVER_AUTH)) - 25c4ca4 (feat GREEN): verify= plumbing through PhazeAgentClient + AgentSettings.agent_ca_file + BaseSettings.api_tls_sans 12 net new tests passing. AUTH-02 partially closed (full closure in Plan 03 once docker-compose api command switches to phaze.entrypoint). Refs: D-01..D-04, D-22; AUTH-02
…ssword validator
- New tests/test_config/__init__.py marker so pytest discovers the sub-package
- 4 RED cases covering Phase 29 D-06:
1. agent_env=production + passwordless redis_url -> ValidationError with
"requires a password in redis_url" substring
2. agent_env=production + redis://default:<pw>@host:6379/0 constructs OK
3. agent_env=dev + passwordless redis_url constructs OK (Pitfall 7)
4. Default agent_env is "dev" when omitted
- Tests pass kwargs directly (cleaner than env-var monkeypatching for model contract)
RED state confirmed: first test fails with "DID NOT RAISE ValidationError"
because the agent_env field + model_validator do not exist yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion mode
Phase 29 D-06 / AUTH-03 (agent-side half):
- Add `Literal` to typing imports
- Add AgentSettings.agent_env: Literal["dev", "production"] field
(default "dev"; env alias PHAZE_AGENT_ENV)
- Add AgentSettings._enforce_redis_password_in_production model_validator:
when agent_env=="production", `urlparse(self.redis_url).password` must be
set; otherwise raise ValueError("agent_env=production requires a password
in redis_url (Phase 29 D-06)").
- Field placed adjacent to other PHAZE_AGENT_* fields for grouping.
- model_validator placed after _enforce_required_agent_fields so the
redis-url check runs after the required-field check.
The pairing server-side hardening (Redis `requirepass` + LAN-bound port)
lands in Plan 03 alongside the docker-compose rewrite; together they fully
close AUTH-03. Dev mode preserves Pitfall 7: fresh clones do `docker compose
up` with no Redis password ceremony.
Verification:
- 4/4 tests in tests/test_config/test_agent_settings_redis_password.py pass
- 22 existing tests in tests/test_config_role_split.py + test_config_worker.py
pass (no regression)
- uv run mypy src/phaze/config.py: clean
- uv run ruff check + ruff format --check: clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 29 Plan 02 SUMMARY. Documents: - Files created (tests/test_config/__init__.py, tests/test_config/test_agent_settings_redis_password.py) - File modified (src/phaze/config.py: Literal import + agent_env field + _enforce_redis_password_in_production model_validator) - 2 commits (4b95029 RED, a7741ff GREEN; no REFACTOR needed) - 4 new tests; 0 regressions in 22 existing config tests - D-06 fully implemented; AUTH-03 partial (server-side half lands in Plan 03) - TDD gate compliance verified; self-check passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation
- tests/test_deployment/__init__.py (pytest sub-package marker)
- tests/test_deployment/test_api_filesystem_isolation.py with 4 tests:
* test_api_service_has_no_file_mounts (DIST-01)
* test_controller_worker_has_no_file_mounts (DIST-01)
* test_no_watcher_or_agent_worker_in_root_compose (D-15 / D-17;
also asserts audfprint + panako absent)
* test_redis_hardened (D-05 / AUTH-03; requirepass + LAN bind +
--no-auth-warning healthcheck)
- All 4 tests FAIL against the current docker-compose.yml — RED step
of TDD. Task 2 lands the compose rewrite that turns them GREEN.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… redis
Rewrite root docker-compose.yml as the application-server-only compose
(DIST-01, D-05, D-17, D-19; AUTH-03 server-side). End state: services
block is exactly {api, worker, postgres, redis}.
- api: swap `command:` to `uv run python -m phaze.entrypoint` (Plan 01
cert-bootstrap shim) and replace the SCAN_PATH mount with a single
bind volume `${CA_PATH:-./certs}:/certs:rw`.
- worker (controller): drop `MODELS_PATH=/models` from environment and
remove all three file mounts (SCAN_PATH, MODELS_PATH, OUTPUT_PATH).
Controller is now fileless.
- DELETE the watcher, agent-worker, audfprint, panako service blocks —
they live in docker-compose.agent.yml on the file server (Plan 04+).
- DELETE the unused audfprint_data and panako_data named volumes.
- redis: list-form command with `--requirepass ${REDIS_PASSWORD:?...}`
(fail-fast at compose-parse time); ports bound via
`${REDIS_BIND_IP:-127.0.0.1}:6379:6379` (loopback default, prod sets
LAN IP); healthcheck uses
`redis-cli --no-auth-warning -a ${REDIS_PASSWORD} ping`.
.env.example: add the three Phase-29 variables with comment blocks:
REDIS_PASSWORD=changeme (dev placeholder; Pitfall-7 mitigation),
REDIS_BIND_IP=127.0.0.1, PHAZE_API_TLS_SANS=localhost,127.0.0.1,api.
Dockerfile audit: no MODELS_PATH/SCAN_PATH/OUTPUT_PATH ENV defaults
were present — verify step only, no changes needed.
All 4 tests in tests/test_deployment/test_api_filesystem_isolation.py
now pass (GREEN step of TDD).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- 4 YAML-parse structural tests (D-19) in tests/test_deployment/
- docker-compose.yml rewritten: services = {api, worker, postgres, redis}
- api volumes stripped to /certs:rw only; controller worker is fileless
- watcher, agent-worker, audfprint, panako removed (move to agent.yml)
- redis hardened: --requirepass, LAN-bound port, authenticated healthcheck
- .env.example documents REDIS_PASSWORD, REDIS_BIND_IP, PHAZE_API_TLS_SANS
- Dockerfile audited — no MODELS_PATH/SCAN_PATH/OUTPUT_PATH ENV defaults
Closes DIST-01 (app server has no file mounts) and the server-side half
of AUTH-03 (Redis requirepass + LAN binding). Decision IDs D-05, D-17,
D-19 fully implemented; D-15 partial pending Plan 04's agent.yml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the essentia model URL list (33 classifier paths + 1 genre model) from scripts/download-models.sh into a Python helper so both bash and the agent bootstrap can drive the download from a single source of truth. - src/phaze/scripts/__init__.py: new package marker - src/phaze/scripts/download_models.py: download_to(target_dir) public entry; _download_one uses .part atomic-rename pattern (T-29-05-03); CLI entry via `python -m phaze.scripts.download_models <dir>` - scripts/download-models.sh: rewritten as a 6-line bash shim that execs the Python module (signals + exit code pass through cleanly) - tests/test_services/test_model_bootstrap.py: scaffold with three ensure_models_present cases (RED until Task 2 lands) plus three download_to/_download_one cases (GREEN now) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 29 D-21 completes OPS-03's auto-download path. A new Postgres-free shared module owns the .pb-glob + download orchestration; agent_worker calls it AFTER /whoami succeeds so a bad token / unreachable app server fails fast in ~60s instead of after a 5-minute 150MB download. - src/phaze/tasks/_shared/model_bootstrap.py: ensure_models_present module (Postgres-free; stdlib + phaze.scripts.download_models only) - src/phaze/tasks/agent_worker.py: drop the in-place RuntimeError checks; call ensure_models_present(Path(cfg.models_path)) as Step 3a after whoami_with_retry - src/phaze/agent_watcher/__main__.py: add WARNING-7 documentation comment explaining why the watcher intentionally does NOT auto-download - tests/test_task_split.py: add test_model_bootstrap_stays_postgres_free subprocess case (BLOCKER-1 resolution; parallel to the existing agent_bootstrap case) - tests/test_phase04_gaps.py: replace the two old fail-fast model-dir RuntimeError tests with ordering + propagation tests that match the new auto-download semantics Deferred (out of scope, pre-existing from Plan 29-03 compose hardening): test_docker_compose_has_agent_worker_consuming_agent_queue -- Plan 29-04 moves the agent-worker block into docker-compose.agent.yml, and the test must be updated to scan both compose files there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OPS-03 + D-21 fully implemented. SUMMARY records the 33+1 URL migration from bash to Python, the WARNING-7 watcher-no-download choice, and the BLOCKER-1 subprocess import-boundary test (test_model_bootstrap_stays_postgres_free). deferred-items.md notes the pre-existing test_docker_compose_has_agent_worker_consuming_agent_queue failure from Plan 29-03 (compose hardening removed the agent-worker block); the test will be updated by Plan 29-04 (parallel wave) which lands docker-compose.agent.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tests/test_services/test_agent_liveness.py: 5-state classify matrix (12 boundary cases) + sort_key ordering invariants per UI-SPEC. - tests/test_utils/test_humanize.py: relative_time output table (UI-SPEC LOCKED) covering all bucket boundaries, the 89.7s → "89s ago" truncation case, and format invariants (no plural-s suffix, single-letter unit). Both modules ImportError today; GREEN commit follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…REEN)
Wave 0 of plan 29-07 (OPS-04 UI half — pure-function tier):
- src/phaze/constants.py: add AGENT_LIVENESS_ALIVE_SECONDS=90 +
AGENT_LIVENESS_STALE_SECONDS=300 (Phase 29 D-12 LOCKED thresholds).
- src/phaze/services/agent_liveness.py: pure-function classify(agent, now)
→ AgentStatus literal in {alive, stale, dead, revoked, never} with
precedence revoked → never → alive/stale/dead. sort_key returns
(revoked_int, status_rank, neg_last_seen) so revoked agents land last,
non-revoked sort alive→stale→dead→never, ties break by last_seen DESC.
- src/phaze/utils/__init__.py + src/phaze/utils/humanize.py: relative_time
helper producing "never" / "just now" / "Ns ago" / "Nm ago" / "Nh ago" /
"Nd ago" with int-truncate semantics (UI-SPEC LOCKED bucket table).
Reconciled UI-SPEC documentation defect (line 248 prose example "89.7s →
89s ago" is inconsistent with its own bucket table lines 232-241; the
table is authoritative — see test docstring for Rule-1 fix rationale).
All 51 tests pass; mypy + ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 1 RED gate for plan 29-07. 10 tests covering: - Full page render with base.html chrome - HX-Request: true returns partial only - Dedicated /_table partial route (always partial, never halts polling) - 5-state status pill rendering with LOCKED Tailwind classes - Empty state UI-SPEC §Empty State LOCKED copy - Sort order alive → stale → dead → never → revoked - 3 BLOCKER-2 tests: htmx event listener + role=alert failure footer + localStorage `phaze:agents:lastError` plumbing - Production-wiring smoke (router registered in main.create_app) Currently fails at import — phaze.routers.admin_agents does not exist yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 1 of plan 29-07 (OPS-04 UI half — Wave-1 deliverables):
- src/phaze/routers/admin_agents.py: APIRouter(prefix="/admin/agents"); two
handlers — `page` (HX-Request-aware, full page OR partial) and
`table_partial` (always partial, the canonical 5s polling target).
`_load_agents` injects transient `agent._status` via classify(a, now) and
sorts via sort_key (Phase 27 transient-attr pattern). No
get_authenticated_agent dep (operator-facing on private LAN, consistent
with pipeline.py / pipeline_scans.py precedent).
- src/phaze/templates/admin/agents.html: page shell extending base.html;
current_page="admin_agents"; hosts the MANDATORY htmx:responseError +
htmx:sendError + htmx:afterSwap listener writing/clearing
`phaze:agents:lastError` localStorage (BLOCKER-2 UI-SPEC §Error /
Failure-Tolerant Refresh LOCKED).
- src/phaze/templates/admin/partials/agents_table.html: HTMX
self-replacing <section> (hx-get/hx-trigger/hx-swap=outerHTML, never
halts — UI-SPEC §Polling LOCKED). Empty state + 6-column table + happy-
path "Last refreshed Ns ago" Alpine footer + MANDATORY red role=alert
"Refresh failed at HH:MM:SS" footer driven by localStorage (BLOCKER-2).
- src/phaze/templates/admin/partials/_status_pill.html: 5-state liveness
pill with LOCKED Tailwind palette (alive=green-100/950,
stale=amber-100/950, dead=red-100/950, revoked/never=gray-100/800) +
redundant aria-label="Status: <state>" for screen readers.
- src/phaze/templates/base.html: new "Agents" nav link inserted between
Audit Log and the theme toggle. Uses short-slug
`current_page == 'admin_agents'` per WARNING-1 (matches live convention
where Audit Log uses 'audit' not 'audit_log'). aria-current="page" is
a forward-looking a11y upgrade applied only to this new link.
- src/phaze/main.py: register admin_agents.router alongside Phase 27/28
routers.
All BLOCKER-2 grep gates pass:
agents.html: htmx:responseError × 2, htmx:sendError × 2, htmx:afterSwap × 2,
phaze:agents:lastError × 2, localStorage.setItem × 1,
localStorage.removeItem × 1.
agents_table.html: localStorage.getItem × 2, phaze:agents:lastError × 4,
"Refresh failed" × 2, role="alert" × 1.
Test status:
- test_router_registered_in_main_app passes (non-DB, structural).
- 9 DB-backed tests collect cleanly; cannot execute locally (no Postgres
on the executor host). Logic verified via direct Jinja render smoke
(positions, pill classes, BLOCKER-2 markup, relative_time output).
- mypy + ruff clean on all new files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l (RED)
Four LOCKED structural assertions for the file-server compose:
1. test_agent_compose_service_list — exactly {worker, watcher, audfprint, panako}
2. test_agent_compose_has_no_postgres_env — DIST-04 invariant
3. test_worker_service_has_phaze_role_agent — D-17
4. test_all_scan_path_mounts_use_failfast_syntax — WARNING-3
All four currently fail because docker-compose.agent.yml does not yet
exist. The GREEN commit will create the compose file + env template.
Per WARNING-3: the fail-fast regex test rejects future drift to
${SCAN_PATH:-/data/music} loose-default form that would let
docker compose up succeed on a misconfigured file-server host.
Refs phase 29-04 plan, D-15..D-17, D-22 test surface.
- 4 happy-path tests in test_heartbeat_cron.py: * success: heartbeat_tick POSTs HeartbeatRequest with correct payload * ctx-missing: missing api_client/agent_identity -> WARNING + return * queue.info-fail: queue_depth defaults to 0; heartbeat still POSTs * importlib metadata: agent_version sourced from importlib.metadata - 1 failure test in test_heartbeat_failure.py: * AgentApiServerError -> WARNING logged; no exception escapes (D-09) - Tests use ctx["worker"].queue (NOT ctx["queue"]) per RESEARCH Pitfall 8 - AgentApiServerError constructed positional-only (no status_code= kwarg) - RED step: tests FAIL with ModuleNotFoundError until heartbeat.py lands Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EEN)
The file-server-host compose file (D-15) declares exactly 4 services:
worker, watcher, audfprint, panako. Worker + watcher pull from
ghcr.io/simplicityguy/phaze (D-16 with `${PHAZE_IMAGE_TAG:-latest}`);
sidecars retain `build:` (not yet on GHCR per D-15).
All 4 services use `${SCAN_PATH:?SCAN_PATH required}` fail-fast
interpolation (WARNING-2 unified explicit-message form; WARNING-3
test enforces). MODELS_PATH bind mount is rw on worker + watcher for
D-21 auto-download; CA_PATH bind mount is ro everywhere.
Adds `.env.example.agent` with every variable a file-server host
needs (D-23 portion): PHAZE_IMAGE_TAG, PHAZE_AGENT_API_URL,
PHAZE_REDIS_URL, PHAZE_AGENT_{ID,TOKEN,QUEUE}, PHAZE_AGENT_CA_FILE,
PHAZE_AGENT_ENV, SCAN_PATH, MODELS_PATH, CA_PATH,
PHAZE_AGENT_SCAN_ROOTS. Production-pin guidance lives in inline
comments.
Also resolves the Plan 29-05 deferred test failure: updates
`tests/test_phase04_gaps.py::test_docker_compose_has_agent_worker_consuming_agent_queue`
to scan BOTH `docker-compose.yml` and `docker-compose.agent.yml`. The
agent-worker now lives in `docker-compose.agent.yml::worker`, so the
Phase 27 UAT gap-13 invariant is again codified across the split
compose surface. Marks the deferred-items.md entry resolved.
All 4 RED tests now pass.
Refs phase 29-04 plan, D-15, D-16, D-17, D-22 (agent-compose portion),
D-23 (.env.example.agent portion).
Adds test_docker_publish_workflow_tags_both_latest_and_version — a 5th
test in test_agent_compose.py. Replaces the original checkpoint:human-verify
task with an automated YAML-parse check that .github/workflows/docker-publish.yml
emits BOTH a `:latest` tag and a `:v<version>` tag (D-16).
Currently fails because docker-publish.yml's docker/metadata-action step
only declares `type=raw,value=latest`, `type=ref,event=branch`, `type=ref,event=pr`,
and `type=schedule,pattern=...` — no `type=semver,pattern={{version}}` and
no `type=ref,event=tag`. The GREEN commit will extend the workflow.
The plan stays autonomous (no human checkpoint); the regression-detection
is now in CI permanently.
Refs phase 29-04 plan WARNING-4 resolution.
… (GREEN)
Two coupled changes to satisfy the WARNING-4 automated test and align
the published api image URL with docker-compose.agent.yml:
1. Tag strategy (D-16 + WARNING-4): the docker/metadata-action step now
emits `type=raw,value=latest`, `type=semver,pattern={{version}}`,
`type=semver,pattern={{major}}.{{minor}}`, `type=ref,event=tag`,
`type=ref,event=branch`, `type=ref,event=pr`, and the schedule tag.
On a tagged release `v4.0.0`, this produces `:latest`, `:v4.0.0` (via
ref,event=tag), `:4.0.0` and `:4.0` (via semver). Operators get the
full set of stability rungs the .env.example.agent comments
reference (PHAZE_IMAGE_TAG=v4.0.0 production pin).
2. Image URL realignment (D-15): the matrix entry for `api` now sets
`image_suffix: ""`, pushing to the BARE-repo URL
`ghcr.io/simplicityguy/phaze:<tag>` — the exact URL
docker-compose.agent.yml's worker + watcher pull from. The sidecars
keep their `/audfprint` and `/panako` sub-paths because agent.yml
builds them locally (D-15) and does not pull them from GHCR.
This makes `test_docker_publish_workflow_tags_both_latest_and_version`
pass, closing WARNING-4 with an autonomous test instead of a
checkpoint:human-verify task.
Verification result: `fixed` + `url-realigned` (both the tag pattern
and the image URL needed adjustment).
Refs phase 29-04 plan WARNING-4 resolution.
…n plan OPS-02 fully closed. Lands the file-server-host compose surface (docker-compose.agent.yml + .env.example.agent) with exactly 4 services and replaces the original GHCR-tag human-verify checkpoint with an automated YAML-parse test (WARNING-4 resolution). Workflow extended to emit :v<version> tags and api image realigned to the bare-repo URL ghcr.io/simplicityguy/phaze. Also resolves the Plan 29-05 deferred test (gap-13 invariant now codified across the split compose surface). Plan stays autonomous: true. 5 new tests, all green, no regressions across deployment + adjacent suites (22 tests).
Land the agent-side half of OPS-04 (D-07..D-10):
- New src/phaze/tasks/heartbeat.py with heartbeat_tick(ctx) async cron
handler. Reads ctx["api_client"], ctx["agent_identity"],
ctx["worker"].queue (NOT ctx["queue"] per RESEARCH Pitfall 8); builds
HeartbeatRequest(agent_version=importlib.metadata.version("phaze"),
worker_pid=os.getpid(), queue_depth=Queue.info()["queued"]) and POSTs
it via PhazeAgentClient.heartbeat.
- Defensive: ctx not initialized -> WARNING + return; queue.info()
failure -> default queue_depth=0 + still POST; AgentApiError -> WARNING
+ swallow (D-09 fire-and-forget; SAQ retries on next tick).
- agent_worker.py: import CronJob + heartbeat_tick; add heartbeat_tick
to settings.functions; add cron_jobs=[CronJob(heartbeat_tick,
cron="* * * * * */30", unique=True, timeout=10)]. Trailing-seconds
6-field form per RESEARCH Critical Discovery #2 (the CONTEXT.md D-08
leading-seconds example would fire every second; verified empirically
with croniter -- gaps are 30s vs 1s).
- agent_worker.py stays a single .py file (Pitfall 9 avoided).
- heartbeat.py is Postgres-free (banner documents the invariant).
Tests: all 5 heartbeat tests pass (GREEN); test_task_split still passes;
tests/test_tasks/ suite passes (excluding the pre-existing Plan 29-03/04
deferred test_docker_compose_has_agent_worker_consuming_agent_queue).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures Task 1 (RED tests) + Task 2 (GREEN implementation) outcomes, threat-model mitigations, and the one notable RESEARCH Critical Discovery #2 fact: the cron string is the trailing-seconds 6-field form `* * * * * */30` (NOT the leading-seconds form from CONTEXT.md D-08). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ect.sh (Task 1)
- justfile: add up-agent + up-all recipes under [group('dev')]; existing `up` unchanged
- docs/deployment.md (new, 230 lines): 6-step two-host operator walkthrough,
D-20 filesystem-isolation smoke, CA rotation guidance, production checklist;
required strings present: phaze-ca.crt (9), just up-agent (5), REDIS_PASSWORD (3),
/admin/agents (2), PHAZE_AGENT_TOKEN (2)
- .planning/PROJECT.md: new "### Deployment (v4.0 — Distributed Agents)" subsection
under Constraints; documents two-compose-file invariant, HTTPS internal CA,
Redis password-bound LAN, zero-new-pip-deps beyond cryptography
- scripts/update-project.sh: audited (pure dependency/version orchestrator with no
Python module enumeration); left untouched per plan rule
Closes D-18 (justfile recipes), D-20 (filesystem-isolation smoke documented),
D-23 (operator workflow + doc sweep).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator reviewed docs/deployment.md end-to-end against the live codebase and confirmed all commands, env vars, routes (/admin/agents, /api/internal/agent/heartbeat), and paths (/data/music, /certs/phaze-ca.crt) match the compose mounts and router prefixes. Cert-bootstrap banner text matches src/phaze/cert_bootstrap.py verbatim. Resume signal: verified-docs-only (Option C from the checkpoint). Follow-up: real-deployment smoke deferred until file-server hardware is available — tracked as a v4.0 outstanding UAT item in the SUMMARY's "Outstanding Items" section. Structural CI tests under tests/test_deployment/ cover compose-file invariants in the meantime. Plan 29-08 complete — Phase 29 (deployment-hardening-agents-admin) closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… env-var Two critical issues from gsd-code-reviewer's REVIEW.md, both in src/phaze/config.py: CR-01: .env.example.agent:22 references AgentSettings._enforce_https_in_production as a guard that refuses http:// agent_api_url in production. The guard did not exist — operators following the docs could ship plaintext bearer tokens. Adds the validator (and three test cases covering https-ok / http-blocked / dev-permits-http). CR-02: BaseSettings.redis_url and database_url had no validation_alias, so pydantic-settings only accepted bare REDIS_URL / DATABASE_URL env-var names. Operators using PHAZE_REDIS_URL (as documented in .env.example.agent) hit the default passwordless URL and tripped _enforce_redis_password_in_production with a misleading "requires a password" error — preventing the production agent from starting. Adds AliasChoices on both fields and an env-var integration test that exercises the binding via AgentSettings() with no kwargs (the 29-02 tests pass kwargs directly and never hit the env path).
ensure_models_present previously short-circuited whenever *any* .pb file existed in the models directory. An interrupted first download (e.g., 1/34 files written before SIGTERM) permanently left every subsequent agent start skipping re-download — the agent would silently break at analysis time when essentia tried to load the 2-33 missing weights. Compares the observed .pb count against len(CLASSIFIER_MODELS) + len(GENRE_MODELS). Partial state logs WARNING with the observed/expected counts and re-invokes download_to, which is idempotent at the per-file level (_download_one skips existing dests). Two test updates: - tests/test_services/test_model_bootstrap.py: populated-no-op test now writes all 34 expected files; new partial-triggers-redownload test pins the WARNING path. - tests/test_tasks/test_agent_startup_banner.py: two pre-existing tests patched pathlib.Path.glob to return a single fake .pb and relied on the old loose check. Patch ensure_models_present directly so the banner / queue-mismatch logic under test is not coupled to the completeness rule. Caught by the gsd-code-reviewer agent (Phase 29 REVIEW.md CR-03).
3-source requirements cross-reference: all 26 requirements satisfied in code (verification + summary + wiring). 22/22 cross-phase exports wired, 12/12 internal API routes consumed, all 5 E2E flows traced. Documentation drift surfaced (does not affect runtime): - 13 REQUIREMENTS.md traceability entries stale at Pending despite verified-passed phases (DIST-04/05, DATA-01..04, AUTH-01/04, TASK-04, EXEC-01..04) - ROADMAP.md Phase 24 checkbox still [ ] - Phase 24 VERIFICATION.md filename unprefixed Tech debt carried into post-milestone backlog: P28-WR-03, P28-RACE-01, P29-WR-01..04, P29-IN-01..03 (all advisory; none block archive).
1. REQUIREMENTS.md: 13 stale `[ ]` checkboxes → `[x]` and 13 `| Pending |`
traceability rows → `| Complete |` (DIST-04, DIST-05, DATA-01..04,
AUTH-01, AUTH-04, TASK-04, EXEC-01..04). Traceability table footer
bumped to today with the v4.0 completion note. The integration
checker confirmed all 26 requirements are wired in code; the rows
were already satisfied — the doc just hadn't been touched since
2026-05-11.
2. ROADMAP.md: flip Phase 24 checkbox `[ ]` → `[x] (completed 2026-05-11)`.
Phase 24 VERIFICATION.md `status: passed, score 4/4` since 2026-05-11.
3. Rename `phases/24-schema-foundation-agent-registry/VERIFICATION.md`
→ `24-VERIFICATION.md` to match the v4.0 convention used by every
other phase (`{phase_num}-VERIFICATION.md`). The unprefixed name
broke `gsd-sdk query find-phase` discovery.
Snapshot v4.0 Distributed Agents milestone: - .planning/milestones/v4.0-ROADMAP.md (full phase details) - .planning/milestones/v4.0-REQUIREMENTS.md (26/26 satisfied) - .planning/milestones/v4.0-MILESTONE-AUDIT.md (moved from .planning/) - MILESTONES.md prepended with v4.0 entry + delivered summary - PROJECT.md full evolution review (Current State, Validated, Key Decisions outcomes) - STATE.md status -> milestone_complete; v4.0 velocity recorded - ROADMAP.md collapses v4.0 Phases 24-29 into <details> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Archived to .planning/milestones/v4.0-REQUIREMENTS.md (all 26 reqs satisfied). Next milestone will define fresh requirements via /gsd:new-milestone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append v4.0 Distributed Agents section: what was built, what worked, what was inefficient, patterns established (settings split factory, subprocess import-boundary tests, 403-before-state-machine guard, pre-uvicorn entrypoint shim, etc.), key lessons (8), cost observations. Update Cross-Milestone Trends: process evolution row for v3.0 + v4.0, cumulative quality table, top lessons (7) verified across milestones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. **`admin_agents` router 422s on all 9 tests + production** —
`AsyncSession` was imported under `if TYPE_CHECKING:` with
`from __future__ import annotations`, so FastAPI's `get_type_hints`
could not resolve the runtime annotation in
`Annotated[AsyncSession, Depends(get_session)]` and treated
`session` as a query parameter (`422 {"detail":[{"type":"missing","loc":["query","session"]}]}`).
Move `from sqlalchemy.ext.asyncio import AsyncSession` to a
runtime import (matches `agent_files.py` pattern, noqa TC002 since
FastAPI requires runtime resolution). Fixes all 9
`test_admin_agents.py` failures and the production route.
2. **`test_settings_redis_url_default` env leak** — Phase 29-02 added
`PHAZE_REDIS_URL` as a pydantic `AliasChoices` alias for the
`redis_url` field, and CI sets `PHAZE_REDIS_URL=redis://localhost:6379/0`
for the test Redis service. The default-value test only deleted
`REDIS_URL`, not the new alias. Delete all three spellings
(`PHAZE_REDIS_URL`, `REDIS_URL`, `redis_url`) before asserting on
the default.
3. **`validate-docker-compose` job parse-fail** — Phase 29-03 added
`${REDIS_PASSWORD:?REDIS_PASSWORD required}` to the application-server
`docker-compose.yml`, and Phase 29-04 added `${SCAN_PATH:?...}` on
four services in the new `docker-compose.agent.yml`. The CI job
only did `touch .env`, so compose-parse fail-fast tripped before
anything could be validated. Supply placeholders for both compose
files and validate the new agent compose alongside the existing
app-server compose.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Codecov flagged 26 lines missing across 3 Phase 29 modules (patch coverage 90.33%). This commit takes all three files to 100%. - tests/test_entrypoint.py (new, 14 lines covered) Monkeypatches ensure_certs_present + os.execvp; verifies env-var defaults, env-var overrides, and the ensure→execvp sequencing invariant (RESEARCH Pattern 2: cert files must exist before uvicorn boots against --ssl-keyfile/--ssl-certfile). - tests/test_scripts/test_download_models.py (new, 10 lines covered) respx-mocked tests for _download_one (idempotent skip on existing dest, atomic .part-then-rename on success, 4xx leaves dest absent so model_bootstrap's *.pb glob retries) and download_to (walks CLASSIFIER_MODELS + GENRE_MODELS, no-ops on a populated dir). - tests/test_cert_bootstrap.py — added test_unparseable_existing_ certs_trigger_regeneration to cover lines 202-203 (the WARNING + regeneration branch when all 4 files exist but parse as garbage). - Tagged the two `if __name__ == "__main__":` CLI invocation guards in entrypoint.py and download_models.py with `# pragma: no cover` so coverage reflects what is reachable from `python -m`. Coverage after this commit: cert_bootstrap.py: 96.77% -> 100.00% entrypoint.py: 0.00% -> 100.00% download_models.py: 67.74% -> 100.00% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 29: Deployment Hardening & Agents Admin — final phase of milestone v4.0 Distributed Agents.
Goal: A real two-host deployment runs end-to-end with the application server holding no file mounts, HTTPS + Redis hardening in place, and an admin can see at a glance which agents are alive and healthy.
Verification: 6/6 must-haves verified (
docs/code/test). One human-UAT item — real two-host hardware smoke — is operator-deferred asverified-docs-only(hardware not yet available); tracked in29-HUMAN-UAT.mdand explicitly accepted by the v4.0 milestone audit (status: passed).This PR ships Phase 29 and closes milestone v4.0: strips file mounts from the application server compose, adds a self-signed internal CA + HTTPS termination, hardens Redis (
requirepass+ LAN bind), introducesdocker-compose.agent.ymlfor file servers, per-file-server model download, 30 s heartbeat, and the/admin/agentspage — then archives the v4.0 milestone artifacts and tagsv4.0.Changes
Phase 29 plans (8/8 complete)
29-01phaze.cert_bootstrap(idempotent CA + leaf x509 viacryptography), pre-uvicorn entrypoint shim (bootstrap →execvp uvicorn),PhazeAgentClient.verify=kwarg defaulting toAgentSettings.agent_ca_file, wrong-CA →ConnectErrorintegration test29-02AgentSettings.agent_envfield + production-moderedis_urlpassword validator (fail-fast on passwordless URL whenPHAZE_AGENT_ENV=production)29-03docker-compose.yml: stripSCAN_PATH/MODELS_PATHmounts, delete watcher/agent-worker/audfprint/panako services, harden Redis with${REDIS_PASSWORD:?}+${REDIS_BIND_IP:-127.0.0.1},.env.exampleupdates, filesystem-isolation YAML-parse tests29-04docker-compose.agent.yml(4 services: worker, watcher, audfprint, panako) +.env.example.agent+ agent-compose YAML-parse tests +docker-publish.ymlextended for agent-compose image tags29-05phaze.scripts.download_modelsPython helper +phaze.tasks._shared.model_bootstrap(rejects partial-download.partstate) + agent_worker/watcher startup wiring + bash shim rewrite29-06phaze.tasks.heartbeat.heartbeat_tick+ SAQCronJobregistration inagent_worker.settings(30 s cron)29-07services.agent_livenessclassifier +utils.humanize+routers.admin_agents+ 3 Jinja templates +base.htmlnav link + HTMX 5 s auto-refresh29-08justfilerecipes (up-agent,up-all) +docs/deployment.md+ PROJECT.md Deployment subsection +scripts/update-project.shtouch + human-verify checkpointBug fixes from code review (Phase 29 CR pass)
fix(29-cr-01,cr-02): productionhttps://guard onPhazeAgentClient; bindPHAZE_REDIS_URLenv var through to composefix(29-cr-03):model_bootstraprejects partial-download (.part) state on startupv4.0 milestone close
docs(milestone-v4.0): audit report (status: passed) + documentation drift closure (REQUIREMENTS.mdtraceability sync,ROADMAP.mdcheckbox sync)chore: archive v4.0 milestone files—milestones/v4.0-ROADMAP.md,milestones/v4.0-REQUIREMENTS.md,milestones/v4.0-MILESTONE-AUDIT.md;MILESTONES.mdprepended;PROJECT.mdfull evolution review;STATE.md→milestone_complete;ROADMAP.mdcollapses Phases 24-29 into<details>chore: remove REQUIREMENTS.md for v4.0 milestone close—git rmfor fresh-per-milestone conventiondocs: update retrospective for v4.0 milestone— v4.0 section (built, worked, inefficient, 12 patterns, 8 lessons) + Cross-Milestone Trends updatedDiff scope
Requirements Addressed
Phase 29 closes the final 6 v4.0 requirements:
SCAN_PATH/MODELS_PATHmountsrequirepass+ LAN-only bind; agents connect withredis://default:<password>@<host>:6379docker-compose.agent.ymlbrings up worker/watcher/audfprint/panako on file serverjust download-modelspopulates per-file-server local/modelsvolume/admin/agentspage with liveness/queue depth/last-seenWith this, 26/26 v4.0 requirements satisfied (full traceability in
milestones/v4.0-REQUIREMENTS.md).Verification
29-VERIFICATION.md)status: passed(milestones/v4.0-MILESTONE-AUDIT.md) — 26/26 reqs, 6/6 phases, 22/22 cross-phase wires, 4/5 flows complete + 1 advisory edgeverified-docs-only; tracked in29-HUMAN-UAT.md. Will be run when file-server hardware is available.Key Decisions
execvp uvicornso signals and PID-1 propagate cleanly; no double-process tree under Dockerapicontainer on first start — no DNS dependency, no public ACME, no rotation pain for single-user LAN. Operator distributes public cert via scp; CA private key (phaze-ca.key, mode 0600) never leaves the app-serverAgentSettings.agent_envboots-fail-fast on passwordlessredis_urlwhenPHAZE_AGENT_ENV=production— surfaces misconfig at startup rather than as a silent unauthenticated connectiondocker-compose.agent.ymlenforces${SCAN_PATH:?...}on all four services — compose parse fails before any container starts on a misconfigured file-server hostphaze.tasks._shared.model_bootstraprejects.partfiles at startup (CR-03) — prevents corruption from interrupteddownload_modelsrunsprint()ANDlogger.warning()(CONTEXT D-02) — visible regardless of logging configurationTech Debt Carried Forward
Documented in
29-REVIEW.mdandmilestones/v4.0-MILESTONE-AUDIT.mdfor follow-up:Path.write_bytes()applies umask beforechmod(0o600); CA + leaf private keys are world-readable for a brief window on the bind mount (cert_bootstrap.py:215-234)docker-compose.ymlbinds Postgres on0.0.0.0(no equivalent guard to Redis's${REDIS_BIND_IP:-127.0.0.1}hardening).parttemp file not cleaned up on download error; next run leaves a stale sibling (download_models.py:84-90)watcherservice mountsMODELS_PATH:rwdespite never writing (unnecessary write grant)skipped_revokedcount but notrevoked_agentslist toprogress.htmlMilestone v4.0 Distributed Agents — SHIPPED
This PR closes the milestone. Tag
v4.0is created locally on this branch and should be pushed frommainafter merge:🤖 Generated with Claude Code