Skip to content

feat(tern): store remote apply id per operation for fan-out applies#399

Merged
Kiran01bm merged 3 commits into
mainfrom
kiran01bm/fan-out-per-op-remote-resume
Jun 17, 2026
Merged

feat(tern): store remote apply id per operation for fan-out applies#399
Kiran01bm merged 3 commits into
mainfrom
kiran01bm/fan-out-per-op-remote-resume

Conversation

@Kiran01bm

Copy link
Copy Markdown
Collaborator

What

Track the remote engine apply id per operation for multi-deployment
(fan-out) applies instead of on the parent apply.

  • Multi-op applies persist each deployment's remote id on
    apply_operations.engine_resume_context; single-op and local applies keep
    using applies.external_id (unchanged behavior).
  • Every remote control/progress RPC (Apply/Start/Stop/Cutover/Progress)
    resolves its apply id from the operation's scope.
  • The remote id is persisted before the parent state update so a crash
    mid-dispatch resumes the exact remote operation rather than dispatching a
    duplicate.
  • Apply-scoped controls that carry no operation context fail closed when
    the apply spans more than one operation.

Why

A fan-out apply dispatches each deployment to its own remote engine apply, so
there is no single authoritative remote id to hang on the parent
applies.external_id. Sharing one parent id is ambiguous: it lets one
deployment's resume/stop/progress call act on another deployment's remote
apply, or route a stale/empty id to the wrong target. Scoping the remote id to
the operation makes each deployment independently resumable and controllable,
and failing closed avoids guessing an id when the context is ambiguous.

Before / After

                 remote engine apply id
                 ──────────────────────
 BEFORE (parent-scoped)            AFTER (per-operation)
 ╭───────────────────╮            ╭───────────────────╮
 │ applies           │            │ applies           │
 │  external_id ◄──┐ │            │  external_id      │  single-op / local only
 ╰───────────────────╯            ╰───────────────────╯
        ▲ all deployments                 ▲
        │ share ONE id                    │ single-op / local
    AMBIGUOUS for fan-out          ╭──────────────────────────────╮
   (stop/resume/progress can       │ apply_operations             │
    hit the wrong deployment)      │  engine_resume_context ◄──┐  │
                                   ╰──────────────────────────────╯
                                              ▲ one id per deployment
                                              │ independently resumable

 Apply-scoped control, no op context:
   single-op  ─▶ use parent external_id
   multi-op   ─▶ FAIL CLOSED (no single authoritative id)

Safety

Fails closed on: deployment mismatch, operation not in the apply's set, an
operation that resolves to no tasks, overwriting an existing different remote
id, and apply-scoped remote controls on multi-operation applies.

Copilot AI review requested due to automatic review settings June 17, 2026 04:07

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Tern routing + remote-drive logic so multi-deployment (“fan-out”) applies persist and resolve the remote engine apply ID per operation (via apply_operations.engine_resume_context) instead of relying on the single parent applies.external_id, and makes apply-scoped remote controls fail closed when the apply spans multiple operations.

Changes:

  • Add operation-listing support to routing lookups and fail closed for apply-scoped remote controls when an apply spans >1 operation.
  • Introduce an operation-aware applyTaskScope in the gRPC client to read/write the remote apply ID from the correct storage location (parent vs operation).
  • Extend tests to cover fail-closed apply-scoped progress for multi-op remote applies and per-operation remote ID persistence.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
pkg/tern/routing_client.go Adds apply-scoped remote ID resolution that fails closed for multi-operation applies; requires operation listing for remote routing decisions.
pkg/tern/routing_client_test.go Adds ListByApply test support and a regression test for fail-closed apply-scoped Progress on multi-op remote applies.
pkg/tern/grpc_client.go Implements operation-scoped remote ID routing/persistence via applyTaskScope and updates dispatch/control paths to use scope-resolved remote IDs.
pkg/tern/grpc_client_test.go Adds an in-memory ApplyOperation store + tests ensuring multi-op dispatch stores remote IDs on the operation row (not the parent).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tern/grpc_client.go Outdated
Comment thread pkg/tern/grpc_client.go Outdated
Comment thread pkg/tern/grpc_client.go Outdated
Comment thread pkg/tern/grpc_client.go Outdated
Comment thread pkg/tern/grpc_client.go Outdated
A multi-deployment apply has no single authoritative remote engine apply
id — each deployment is dispatched independently. Persist each operation's
remote id on apply_operations.engine_resume_context and route every remote
control/progress call through it, leaving applies.external_id for the
single-operation path. Apply-scoped controls without an operation context
fail closed when the apply spans more than one operation.
@Kiran01bm Kiran01bm force-pushed the kiran01bm/fan-out-per-op-remote-resume branch from 3ed8aa1 to 860353e Compare June 17, 2026 04:51
The operator claims an operation pending→running before the drive runs, so a
freshly claimed multi-operation deployment reached dispatch already running with
no per-operation remote id and was wrongly treated as ambiguous and failed
before its first remote dispatch. Recognise a running operation with an empty
operation-scoped remote id as a legitimate first dispatch, and make the
ambiguity messages and start-failure logs report the scope's remote apply id
instead of the parent external_id so multi-op triage is accurate.
@Kiran01bm Kiran01bm marked this pull request as ready for review June 17, 2026 06:00
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 17, 2026 06:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d50d86028

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/tern/grpc_client.go
Comment thread pkg/tern/grpc_client.go
A multi-deployment apply shares one durable stop request across deployments.
Stopping a claimed-but-undispatched operation must terminalize only that
operation and keep the apply-level request pending so dispatched siblings still
observe the stop; also fail closed when an operation row is missing instead of
panicking.
@Kiran01bm Kiran01bm merged commit 0ad1677 into main Jun 17, 2026
29 checks passed
@Kiran01bm Kiran01bm deleted the kiran01bm/fan-out-per-op-remote-resume branch June 17, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants