feat(orchestrator): direct peer-to-peer chunk transfer#2020
feat(orchestrator): direct peer-to-peer chunk transfer#2020
Conversation
| // getOrDialPeer returns a cached gRPC connection to the given address, dialing if needed. | ||
| func (c *Cache) getOrDialPeer(address string) (*grpc.ClientConn, error) { | ||
| if conn, ok := c.peerConns.Load(address); ok { | ||
| return conn.(*grpc.ClientConn), nil |
There was a problem hiding this comment.
Security: Peer address is not validated and the connection is plaintext.
grpc.NewClient is called with the address string directly from SandboxConfig.source_orchestrator_address, which originates from the API layer. If node.IPAddress is ever set to a loopback, link-local (e.g. 169.254.169.254), or other internal service address, the orchestrator will dial it. Consider validating that address resolves to an expected IP range (e.g. private RFC-1918 subnet of the cluster) before dialing.
Additionally, the connection uses insecure.NewCredentials() (plaintext), meaning snapshot data (VM memory images, rootfs) is transmitted unencrypted between nodes. This is consistent with the existing internal gRPC patterns, but worth documenting as a deliberate trade-off given the sensitivity of the data.
packages/orchestrator/internal/sandbox/template/peer_storage.go
Outdated
Show resolved
Hide resolved
packages/orchestrator/internal/sandbox/template/peer_storage.go
Outdated
Show resolved
Hide resolved
|
|
||
| return &orchestrator.GetBuildFileInfoResponse{TotalSize: size}, nil | ||
| } | ||
|
|
There was a problem hiding this comment.
Race: template can be evicted between GetCachedTemplate and use of its files.
GetCachedTemplate returns the template value, but does not pin or refresh it in the TTL cache. Between this call and the subsequent t.Memfile() / t.Rootfs() / t.Snapfile() call, the TTL can expire and the eviction callback can call template.Close(ctx) concurrently. For Memfile and Rootfs, which return open block devices, this is a use-after-close race.
The same race exists in GetBuildFile (line 148). Consider having GetCachedTemplate call cache.Get(buildID) with ttlcache.WithDisableTouchOnHit(false) (already default) and checking that the item's TTL is not about to expire, or use a reference-counting mechanism to keep the template alive for the duration of the RPC.
| if h == nil { | ||
| return &orchestrator.GetBuildFileInfoResponse{NotAvailable: true}, nil | ||
| } | ||
|
|
There was a problem hiding this comment.
Performance: header is fully serialized in GetBuildFileInfo only to compute its length, then serialized again in GetBuildFile.
header.Serialize(h.Metadata, h.Mapping) is called in GetBuildFileInfo just to measure len(data), and again in GetBuildFile to produce the actual payload. This doubles the allocation and CPU work for every header prefetch. The same issue exists for the rootfs.ext4.header case.
Options: cache the serialized bytes on the device/header object, add a SerializedSize() helper that computes size without allocation, or fold GetBuildFileInfo and GetBuildFile into a single round-trip for header files.
packages/orchestrator/internal/sandbox/template/peer_storage.go
Outdated
Show resolved
Hide resolved
packages/orchestrator/internal/sandbox/template/peer_storage.go
Outdated
Show resolved
Hide resolved
packages/orchestrator/internal/sandbox/template/peer_storage.go
Outdated
Show resolved
Hide resolved
packages/orchestrator/internal/sandbox/template/peer_storage_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
2264a9b to
35dcf12
Compare
Summary
Implement ChunkService gRPC endpoint to enable sandboxes resumed on different orchestrators to fetch template data directly from the source orchestrator before GCS upload completes. Includes peer-first storage wrappers with automatic GCS fallback.
Changes
Behavior
On resume to a different node: destination receives source address → wraps storage to try peer first → uses peer data while GCS upload completes asynchronously → falls back to GCS transparently if peer unavailable.
Note
Low Risk
The provided diff contains no actual file changes, so functional risk is minimal. Main risk is that the diff may be incomplete/misrendered, so reviewers should verify the PR actually includes the intended changes.
Overview
No effective code changes are visible in the provided diff (it only shows
+++ /dev/nullwith no associated file content or hunks).If this PR is expected to add peer-to-peer chunk transfer functionality, the diff appears incomplete/misrendered and should be rechecked before review/merge.
Written by Cursor Bugbot for commit 35dcf12. This will update automatically on new commits. Configure here.