Skip to content

feat(orchestrator): direct peer-to-peer chunk transfer#2020

Closed
dobrac wants to merge 0 commit intomainfrom
dobrac/orchestrator-direct-chunk-transfer
Closed

feat(orchestrator): direct peer-to-peer chunk transfer#2020
dobrac wants to merge 0 commit intomainfrom
dobrac/orchestrator-direct-chunk-transfer

Conversation

@dobrac
Copy link
Contributor

@dobrac dobrac commented Feb 27, 2026

Summary

Implement ChunkService gRPC endpoint to enable sandboxes resumed on different orchestrators to fetch template data directly from the source orchestrator before GCS upload completes. Includes peer-first storage wrappers with automatic GCS fallback.

Changes

  • Proto & gRPC: Added ChunkService with GetBuildFileInfo/GetBuildFile RPCs for serving all template file types (snapfile, metadata, memfile/rootfs headers, diffs)
  • ChunkService handler (chunks.go, chunks_test.go): Serve files from local template cache; handle diff files via DiffStore with range reads; handle blobs via file paths
  • Storage wrappers: PeerBlob and PeerSeekable implement Blob/Seekable interfaces; try peer gRPC first, silently fall back to GCS on error/unavailable
  • PeerStorageProvider: Transparently inject peer access for a specific build's paths; delegate other paths to base provider
  • Template integration: Extend Template interface with MetadataFile(); expose GetCachedTemplate() and BuildStore(); wire peerStorageProvider into GetTemplate()
  • API integration: Pass source orchestrator address from origin node to destination on resume (already passed from PlaceSandbox)
  • Testing: Mock ChunkServiceClient; comprehensive unit tests for handler logic and peer fallback scenarios using testify mocks

Behavior

On resume to a different node: destination receives source address → wraps storage to try peer first → uses peer data while GCS upload completes asynchronously → falls back to GCS transparently if peer unavailable.


Note

Low Risk
The provided diff contains no actual file changes, so functional risk is minimal. Main risk is that the diff may be incomplete/misrendered, so reviewers should verify the PR actually includes the intended changes.

Overview
No effective code changes are visible in the provided diff (it only shows +++ /dev/null with no associated file content or hunks).

If this PR is expected to add peer-to-peer chunk transfer functionality, the diff appears incomplete/misrendered and should be rechecked before review/merge.

Written by Cursor Bugbot for commit 35dcf12. This will update automatically on new commits. Configure here.

// getOrDialPeer returns a cached gRPC connection to the given address, dialing if needed.
func (c *Cache) getOrDialPeer(address string) (*grpc.ClientConn, error) {
if conn, ok := c.peerConns.Load(address); ok {
return conn.(*grpc.ClientConn), nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: Peer address is not validated and the connection is plaintext.

grpc.NewClient is called with the address string directly from SandboxConfig.source_orchestrator_address, which originates from the API layer. If node.IPAddress is ever set to a loopback, link-local (e.g. 169.254.169.254), or other internal service address, the orchestrator will dial it. Consider validating that address resolves to an expected IP range (e.g. private RFC-1918 subnet of the cluster) before dialing.

Additionally, the connection uses insecure.NewCredentials() (plaintext), meaning snapshot data (VM memory images, rootfs) is transmitted unencrypted between nodes. This is consistent with the existing internal gRPC patterns, but worth documenting as a deliberate trade-off given the sensitivity of the data.


return &orchestrator.GetBuildFileInfoResponse{TotalSize: size}, nil
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race: template can be evicted between GetCachedTemplate and use of its files.

GetCachedTemplate returns the template value, but does not pin or refresh it in the TTL cache. Between this call and the subsequent t.Memfile() / t.Rootfs() / t.Snapfile() call, the TTL can expire and the eviction callback can call template.Close(ctx) concurrently. For Memfile and Rootfs, which return open block devices, this is a use-after-close race.

The same race exists in GetBuildFile (line 148). Consider having GetCachedTemplate call cache.Get(buildID) with ttlcache.WithDisableTouchOnHit(false) (already default) and checking that the item's TTL is not about to expire, or use a reference-counting mechanism to keep the template alive for the duration of the RPC.

if h == nil {
return &orchestrator.GetBuildFileInfoResponse{NotAvailable: true}, nil
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: header is fully serialized in GetBuildFileInfo only to compute its length, then serialized again in GetBuildFile.

header.Serialize(h.Metadata, h.Mapping) is called in GetBuildFileInfo just to measure len(data), and again in GetBuildFile to produce the actual payload. This doubles the allocation and CPU work for every header prefetch. The same issue exists for the rootfs.ext4.header case.

Options: cache the serialized bytes on the device/header object, add a SerializedSize() helper that computes size without allocation, or fold GetBuildFileInfo and GetBuildFile into a single round-trip for header files.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@dobrac dobrac closed this Mar 2, 2026
@dobrac dobrac force-pushed the dobrac/orchestrator-direct-chunk-transfer branch from 2264a9b to 35dcf12 Compare March 2, 2026 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants