feat(tern): store remote apply id per operation for fan-out applies#399
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the Tern routing + remote-drive logic so multi-deployment (“fan-out”) applies persist and resolve the remote engine apply ID per operation (via apply_operations.engine_resume_context) instead of relying on the single parent applies.external_id, and makes apply-scoped remote controls fail closed when the apply spans multiple operations.
Changes:
- Add operation-listing support to routing lookups and fail closed for apply-scoped remote controls when an apply spans >1 operation.
- Introduce an operation-aware
applyTaskScopein the gRPC client to read/write the remote apply ID from the correct storage location (parent vs operation). - Extend tests to cover fail-closed apply-scoped progress for multi-op remote applies and per-operation remote ID persistence.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| pkg/tern/routing_client.go | Adds apply-scoped remote ID resolution that fails closed for multi-operation applies; requires operation listing for remote routing decisions. |
| pkg/tern/routing_client_test.go | Adds ListByApply test support and a regression test for fail-closed apply-scoped Progress on multi-op remote applies. |
| pkg/tern/grpc_client.go | Implements operation-scoped remote ID routing/persistence via applyTaskScope and updates dispatch/control paths to use scope-resolved remote IDs. |
| pkg/tern/grpc_client_test.go | Adds an in-memory ApplyOperation store + tests ensuring multi-op dispatch stores remote IDs on the operation row (not the parent). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
A multi-deployment apply has no single authoritative remote engine apply id — each deployment is dispatched independently. Persist each operation's remote id on apply_operations.engine_resume_context and route every remote control/progress call through it, leaving applies.external_id for the single-operation path. Apply-scoped controls without an operation context fail closed when the apply spans more than one operation.
3ed8aa1 to
860353e
Compare
The operator claims an operation pending→running before the drive runs, so a freshly claimed multi-operation deployment reached dispatch already running with no per-operation remote id and was wrongly treated as ambiguous and failed before its first remote dispatch. Recognise a running operation with an empty operation-scoped remote id as a legitimate first dispatch, and make the ambiguity messages and start-failure logs report the scope's remote apply id instead of the parent external_id so multi-op triage is accurate.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d50d86028
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
A multi-deployment apply shares one durable stop request across deployments. Stopping a claimed-but-undispatched operation must terminalize only that operation and keep the apply-level request pending so dispatched siblings still observe the stop; also fail closed when an operation row is missing instead of panicking.
What
Track the remote engine apply id per operation for multi-deployment
(fan-out) applies instead of on the parent apply.
apply_operations.engine_resume_context; single-op and local applies keepusing
applies.external_id(unchanged behavior).resolves its apply id from the operation's scope.
mid-dispatch resumes the exact remote operation rather than dispatching a
duplicate.
the apply spans more than one operation.
Why
A fan-out apply dispatches each deployment to its own remote engine apply, so
there is no single authoritative remote id to hang on the parent
applies.external_id. Sharing one parent id is ambiguous: it lets onedeployment's resume/stop/progress call act on another deployment's remote
apply, or route a stale/empty id to the wrong target. Scoping the remote id to
the operation makes each deployment independently resumable and controllable,
and failing closed avoids guessing an id when the context is ambiguous.
Before / After