Skip to content

feat(operator): park multi-deployment barrier copy phase at the cutover barrier#417

Merged
Kiran01bm merged 5 commits into
mainfrom
kiran01bm/park-copy-at-barrier-oc2
Jun 18, 2026
Merged

feat(operator): park multi-deployment barrier copy phase at the cutover barrier#417
Kiran01bm merged 5 commits into
mainfrom
kiran01bm/park-copy-at-barrier-oc2

Conversation

@Kiran01bm

@Kiran01bm Kiran01bm commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

What & why

OC-2 of the ordered-cutover workstream. Under the barrier cutover policy, an operation of a multi-deployment apply now runs its copy phase and parks at waiting_for_cutover instead of inline-driving to completed, so the high-risk atomic swaps can later be driven in deployment order by the cutover claim (OC-3).

Today the operator drives each claimed operation inline (copy + cutover in one claim). OC-1 added the dormant cutover-claim predicate; OC-2 makes the copy drive actually stop at the barrier and hand off.

Behaviour

gate only                         OC-2 (park + release)
 copy → waiting_for_cutover         copy → waiting_for_cutover
   → block for manual cutover         → release the claim
   → timeout → cancel  ✗              → op row persisted waiting_for_cutover
                                       → OC-3 cutover claim drives it later ✓
  1. Auto-defer gate — effectiveCopyDriveOptions turns on defer-cutover only for multiOperation && cutover_policy == barrier; the manual --defer-cutover option stays authoritative. Threaded as an execution-only value, never persisted onto the apply.
  2. Park & release — the atomic drive exits at waiting_for_cutover (no auto-cutover, no manual-cutover timeout/cancel) so the operator can persist the row and free the claim.
  3. Persist parked — persistOperationState records waiting_for_cutover (completed_at nil, resumable).
  4. Copy-claim exemption — FindNextApplyOperation no longer re-leases a stale multi-op barrier row parked at the barrier; it is reserved for the cutover claim.

Safety / scope

  • Gated on multiOperation && barrier, so single-op, rolling, and manual --defer-cutover are byte-for-byte unchanged (covered by tests).
  • Dormant in production: an apply owns one operation today, so the barrier branch is never taken until multi-deployment fan-out lands.
  • Local operator drive only. The gRPC remote drive runs a fresh apply with no operation context, so it does not yet park/release at the barrier — this is a deliberate follow-up (needs a transport signal carrying the per-operation barrier decision to the remote drive).
  • Sequencing: OC-3 (the cutover claim + ordered cutover drive) must land before the fan-out enablement config flip (tracker PR 11), otherwise parked operations would have no claimer.

Testing / validation

  • Unit: defer-cutover truth table; grouped-apply gating honours effective options; park-and-release vs manual-defer keeps polling; op-row persisted at waiting_for_cutover.
  • Integration: copy claim skips a stale multi-op barrier parked row, but still re-claims a single-op barrier parked row and a multi-op rolling parked row.
  • Full pkg/tern, pkg/api, pkg/storage/mysqlstore unit + integration suites and golangci-lint --new-from-rev origin/main pass.

References

…er barrier

OC-2 of the ordered-cutover workstream. An operation of a multi-deployment
apply under the barrier policy now auto-defers its cutover, releases the copy
drive at waiting_for_cutover, and is persisted parked for the deployment-ordered
cutover claim (OC-3) — instead of inline-driving to completed or blocking for a
manual cutover. The copy claim no longer re-leases a parked multi-op barrier row.

Scoped to multi-op barrier operations, so single-op, rolling, and manual
--defer-cutover behaviour is unchanged. Dormant until fan-out lands (one op per
apply today). Local operator drive only; the gRPC remote drive parks via a
later transport change. OC-3 must land before fan-out enablement (tracker PR 11).
Copilot AI review requested due to automatic review settings June 17, 2026 23:58

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SchemaBot’s local Tern/operator execution path so that, under the barrier cutover policy for multi-deployment applies, an operation’s copy phase can stop at waiting_for_cutover and release its claim—enabling a later, deployment-ordered cutover claim (OC-3) to drive the high-risk swap.

Changes:

  • Threaded “effective” execution options through the local apply/resume/polling stack so operation-scoped drives can auto-defer cutover without persisting that decision onto the apply.
  • Added “park + release” behavior at waiting_for_cutover for eligible multi-op barrier operations, including timeout/cancel exemptions and operator-side persistence of the parked operation state.
  • Updated MySQL store claiming to avoid re-leasing stale parked multi-op barrier rows (reserved for cutover-claim), with new unit/integration coverage.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/tern/local_control_resume.go Operation-scoped resume now computes effective options and can release at the cutover barrier.
pkg/tern/local_client.go Fresh Apply dispatch explicitly does not auto-park (no operation context).
pkg/tern/local_client_test.go Updates call signatures and adds a unit test for “release at cutover barrier” behavior.
pkg/tern/local_client_integration_test.go Updates atomic polling call signatures for integration coverage.
pkg/tern/local_apply.go Grouped-apply mode selection now uses the effective options map.
pkg/tern/local_apply_grouped.go Threads options + barrier-release through grouped execution and atomic polling; adds parking behavior.
pkg/tern/cutover_barrier.go Introduces helpers to decide auto-defer and compute effective copy-drive options.
pkg/tern/cutover_barrier_test.go Unit tests for auto-defer truth table and grouped-apply gating using effective options.
pkg/storage/mysqlstore/apply_operations.go Adds stale-active exemption so copy-claim won’t re-lease parked multi-op barrier rows.
pkg/storage/mysqlstore/apply_operations_test.go Integration tests covering the new stale-parked exemption scoping.
pkg/api/operator.go Persists waiting_for_cutover on apply_operation rows (non-terminal, resumable).
pkg/api/operator_test.go Unit test ensuring parked operations are persisted via UpdateState (not terminalized).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tern/local_control_resume.go
Comment thread pkg/tern/local_control_resume.go
Kiran01bm and others added 3 commits June 18, 2026 10:29
Derive the barrier claim-release decision separately from the auto-defer
decision: when an apply was started with manual --defer-cutover, hold the claim
and poll for a manual cutover (documented contract) instead of releasing at the
barrier. The cutover is still deferred either way.

When tasks exist but the apply_operation row is missing, return a distinct
ErrApplyOperationRowMissing that wraps ErrNoTasksForApplyOperation, so the
fail-closed errors.Is handling is preserved while the message reads accurately.
@Kiran01bm Kiran01bm marked this pull request as ready for review June 18, 2026 01:49
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 18, 2026 01:49
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@Kiran01bm Kiran01bm merged commit 6137a30 into main Jun 18, 2026
29 checks passed
@Kiran01bm Kiran01bm deleted the kiran01bm/park-copy-at-barrier-oc2 branch June 18, 2026 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants