Skip to content

docs: tighten OPS.md sections 2/3 against first M0-5/M1-6 deploy#75

Open
Augustas11 wants to merge 1 commit into
mainfrom
docs/m2-8-ops-md-post-deploy-update
Open

docs: tighten OPS.md sections 2/3 against first M0-5/M1-6 deploy#75
Augustas11 wants to merge 1 commit into
mainfrom
docs/m2-8-ops-md-post-deploy-update

Conversation

@Augustas11

Copy link
Copy Markdown
Owner

Summary

  • Resolves the two **TBD after first M0-5/M1-6 deploy** callouts in OPS.md §2 and §3 against observations from the first end-to-end M0-5/M1-6 production deploy (2026-06-11, v1.3.0-24-g87b3a6b → v1.3.1-5-gba04cd4 on Pearl).
  • §2: documents the observed restart→/healthz timing (single GET after 2s sleep, ~5s end-to-end, no retry loop, immediate response).
  • §3: confirms the single-file .prev layout at /opt/macprovider/{coordinator,gateway}.prev (owned macprovider:macprovider, mode 0755), plus the coordinator-side timestamped coordinator.yaml.bak-<UTC> accumulating backups.

Companion findings worth a follow-up (not in this PR)

During the deploy, two on-disk surprises caused the coordinator deploy script to fail at step 6b and would have caused the gateway deploy script to fail at step 4 (mitigated by switching to a binary-only swap for the gateway). These are not OPS.md content but should be tracked separately:

  1. Local repo's phase4-coordinator/dist/nginx-coordinator.streamvc.live.conf still declares limit_req_zone ws_provider_rate and limit_conn_zone ws_provider_conn — both already declared by the api.streamvc.live site on Pearl. Pearl's live coordinator site had been dedup'd earlier on 2026-06-11 (.bak-pre-dedup-20260611T135903Z artifact survives) but the local file was never updated. Step 6b overwrote the dedup'd live with the un-dedup'd local; nginx -t failed with "limit_conn_zone is already bound." Fixed in-place on Pearl; the local file still drifts.
  2. Gateway deploy script (phase5-gateway/dist/deploy-pearl-vps.sh) lacks the sed-uncomment step the coordinator script has for ssl_certificate lines. The local nginx-api.streamvc.live.conf ships those lines commented; if the gateway script's step 4 ran end-to-end, it would install the commented config and nginx -t would fail with "no ssl_certificate defined for SSL listener." Avoided here by skipping the script's nginx step. Either the gateway script needs the same sed step, or the local config needs to ship uncommented.
  3. FR-C9.4 TOFU policy regression: a provider (air5) that connected during the deploy gap under the old binary cannot reconnect under v1.3.1-5 with auth.require_provider_tokens=false. Coordinator log: tokenless connect refused; an active token already exists for this provider_id. Operator will revoke the stored token or run a TOFU bypass; flagged for the decision log.

Test plan

  • Operator skims §2 + §3 wording and confirms it matches their recollection of the deploy
  • On the next deploy, verify the timing characterization still holds (single GET, ~5s window)
  • On the next deploy, verify a fresh coordinator.yaml.bak-<UTC> accumulates as documented

🤖 Generated with Claude Code

…observations

Resolves the two "TBD after first M0-5/M1-6 deploy" callouts in OPS.md
against what was actually observed during the v1.3.0-24-g87b3a6b -> v1.3.1-5-gba04cd4
deploy on 2026-06-11.

- Section 2 (coordinator restart): the post-restart /healthz check is a single
  GET after a 2s sleep, not a poll loop. Total window from restart command
  to provenance assert is ~5s; /healthz responded immediately.
- Section 3 (gateway restart): confirmed the single-file .prev layout for both
  services at /opt/macprovider/{coordinator,gateway}.prev (owned
  macprovider:macprovider, mode 0755). Also documented the coordinator deploy
  script's timestamped /opt/macprovider/coordinator.yaml.bak-<UTC> backups
  (gateway script does not touch gateway.yaml so no equivalent there).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Augustas11 added a commit that referenced this pull request Jun 12, 2026
…#76)

Two drift items the 2026-06-11 deploy hit on Pearl and patched live,
left unfixed in the repo (would re-bite the next deploy):

1. phase4-coordinator/dist/nginx-coordinator.streamvc.live.conf
   re-declared `ws_provider_rate` and `ws_provider_conn` zones that
   the api.streamvc.live vhost already declares. Two vhosts on the
   same nginx instance cannot redeclare the same http-context zone —
   `nginx -t` fails with "limit_conn_zone is already bound." Removed
   the dup declarations; left a comment explaining the cross-vhost
   sharing and the restore step if the coordinator vhost is ever
   deployed standalone.

2. phase5-gateway/dist/deploy-pearl-vps.sh was missing the
   ssl_certificate sed-uncomment block that the coordinator script
   has. nginx-api.streamvc.live.conf ships with those lines commented
   for first-deploy ACME ordering; without the sed, post-cert deploys
   fail `nginx -t` with "no ssl_certificate is defined for the
   listen ... ssl" directive. Added the same idempotent sed pair the
   coordinator script uses at its step 6b.

Both surfaced in PR #75's "companion findings" block. The deploy
session worked around #1 by editing nginx config in place on Pearl
and #2 by switching to a binary-only swap (skipping the script's
nginx step. This commit closes the drift in source.


EOF
)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant