Add S3-style multi-part upload mechanism for large file parallel upload#1801
Add S3-style multi-part upload mechanism for large file parallel upload#1801mishushakov wants to merge 62 commits intomainfrom
Conversation
Add endpoints for chunked file uploads to support large files:
- POST /files/upload/init - Initialize upload session
- PUT /files/upload/{uploadId} - Upload individual parts
- POST /files/upload/{uploadId}/complete - Assemble final file
- DELETE /files/upload/{uploadId} - Abort and cleanup
Parts are stored in temp directory and assembled sequentially
on completion using copy_file_range for efficiency.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove fsync in Complete (was ~40-100ms for 100MB) - Move temp directory cleanup to background goroutine For sandbox use cases, immediate durability is not critical. The kernel will flush data to disk eventually. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add POST/files/upload/ to allowed path prefixes for auth bypass - Add max session limit (100) to prevent resource exhaustion - Fix race condition in complete handler by copying parts under lock - Add logging for background cleanup errors - Add startup cleanup of stale temp directories - Fix test bug with part number string conversion - Add test for max sessions limit Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…s issues Security fixes: - Validate uploadId is a valid UUID to prevent path traversal attacks - Add max part size limit (100MB) to prevent DoS via disk exhaustion - Validate part numbers are non-negative Robustness fixes: - Add completed flag to prevent race between part uploads and complete/abort - Clean up destination file on assembly errors - Validate parts are contiguous (0, 1, 2, ..., n-1) before assembly - Warn on duplicate part number uploads (last write wins) - Add session TTL (1 hour) with background cleanup goroutine - Explicit destFile.Close() before marking success Features: - Add ETag header (MD5 hash) for uploaded parts - Add CreatedAt timestamp to sessions for TTL tracking Tests: - Invalid upload ID format (path traversal) - Negative part numbers - Missing parts in sequence - Upload part after complete started (conflict) - Part size limit exceeded - ETag header verification Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ETag was being computed and returned but never validated or used. Remove the unnecessary MD5 hash computation and header to simplify the code and improve performance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace temp file based multipart upload with direct writes to destination: - Add totalSize and partSize to init request (breaking API change) - Create and preallocate destination file at init time - Write parts directly to file at computed offsets using WriteAt - Eliminate temp directory, assembly phase, and copy operations - Keep open file handle across session for concurrent part writes This removes multiple layers of I/O overhead: - No temp file creation per part - No reading temp files back during assembly - No sequential copy loop - Single file open/close cycle Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add blank lines before return statements (nlreturn) - Check json.Encode error return values (errchkjson) - Use integer range syntax for Go 1.22+ (intrange) - Add t.Parallel() to TestMultipartUploadRouting (paralleltest) - Use assert.Empty() instead of assert.Equal() (testifylint) - Fix import ordering (gci) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test was using "non-existent" as the upload ID, but the handler validates UUID format before checking session existence, causing a 400 instead of the expected 404. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The prefix-based matching for allowed paths caused "POST/files" to incorrectly match "POST/filesystem.Filesystem/*" Connect RPC endpoints, allowing unauthenticated access to filesystem operations on secure sandboxes. Split allowed paths into exact matches (for static paths like /files) and prefix matches (for paths with dynamic segments like /files/upload/). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Explicitly clean paths returned by ExpandAndResolve to remove any .. or . components. This addresses the GitHub Advanced Security warning about path traversal, even though filepath.Abs already normalizes paths. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract path sanitization into a dedicated function that: - Cleans the path using filepath.Clean - Validates the result is an absolute path - Rejects paths containing null bytes (path injection attack) This breaks the taint chain for CodeQL security analysis by adding explicit validation before the path is used in file operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverts the sanitizePath changes to path.go as they are not needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move the mutex lock to only protect the PartsWritten map access instead of the entire WriteAt operation. WriteAt is safe for concurrent writes at different offsets, so parallel chunk uploads no longer serialize on disk I/O. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for gzip-compressed request bodies on POST /files and
PUT /files/upload/{uploadId} endpoints. Clients can now send compressed
uploads by setting Content-Encoding: gzip header.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep gzip Content-Encoding support only in multipart upload API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Register a placeholder session (with completed=true) under the lock, then perform OpenFile/Truncate/Chown outside the lock to avoid blocking unrelated upload operations on slow filesystem calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
@claude review |
…lete Complete: keep session in map during Close+Rename to prevent a concurrent Init for the same path from starting before the rename finishes. Delete from map only after success (or after cleanup on error). Delete: use RLock to look up the session, CAS under session.mu, then write-Lock only for the map deletion — avoids holding the global write lock during CAS which blocked all concurrent RLock callers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Allow io.ErrUnexpectedEOF to fall through to the size mismatch check, which correctly returns 400 Bad Request. Previously, a client sending fewer bytes than expectedSize was misclassified as a server error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 5 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
- Protect session initialization (DestFile/Parts) with session.mu for proper memory ordering instead of relying on atomic store alone - Clean up session on response encoding failure to prevent permanent fd leak - Check and log DestFile.Close() error in delete handler - Check and log os.Remove() errors during init error paths to detect temp file leaks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@cursor review |
|
@claude review |
…e paths Fix two race conditions in the multipart upload handlers: 1. Init error path: set completed under session.mu and call wg.Wait() before destFile.Close() so in-flight part writes finish before the file descriptor is closed. 2. Complete missing-parts path: reset completed under session.mu to prevent a concurrent Complete from winning the CAS while the first Complete is still returning, which would cause two goroutines to race on Close/Rename. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| } | ||
|
|
||
| // Ensure parent directories exist | ||
| if err := permissions.EnsureDirs(filepath.Dir(filePath), uid, gid); err != nil { |
There was a problem hiding this comment.
permissions.EnsureDirs is called at line 130, before the authoritative session-limit check at line 157. If 101 concurrent init requests all pass the early optimistic check (RLock) and proceed to EnsureDirs, 101 sets of directories get created even though only 100 slots exist. The one that loses the write-lock race creates orphaned directories. This is a minor side effect but inconsistent with the comment "Check session limit early, before any file I/O"—EnsureDirs is file I/O that happens between the two checks.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
EnsureDirs was called before the write-lock session-limit check, meaning requests that would be rejected for exceeding the limit still created directories as a side effect. Move it after the session is registered so all file I/O happens only for requests that actually claim a slot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
…completion When a part write completes successfully but a concurrent Complete has already set the completed flag, the handler was returning 409 even though the data was on disk and marked partComplete. Complete's parts scan would count the part as present and return 200, giving the client an inconsistent view (409 for part upload, 200 for complete). Remove the post-write completed check so the client always gets 200 when its data was actually written, matching what Complete will report. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| jsonError(w, http.StatusBadRequest, fmt.Errorf("upload would require %d parts, exceeding the maximum of %d (increase partSize)", numParts, maxNumParts)) | ||
|
|
||
| return | ||
| } |
There was a problem hiding this comment.
numParts int overflow on 32-bit bypasses validation
Medium Severity
numParts is computed as int((body.TotalSize + body.PartSize - 1) / body.PartSize). For totalSize=10GB and partSize=1, the result is 10 billion, which overflows int on 32-bit systems. The overflowed value can be negative, so numParts > maxNumParts is false and the check is skipped. The upload then proceeds with a wrong NumParts, and for i := range session.NumParts iterates zero times, so the file is treated as complete without uploaded parts.
Compute numParts as int64 and validate against maxNumParts before casting to int. On 32-bit systems, the previous int() cast could overflow to a negative value for large totalSize/small partSize combinations, bypassing the cap check entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| "github.com/google/uuid" | ||
|
|
||
| "github.com/e2b-dev/infra/packages/envd/internal/execcontext" | ||
| "github.com/e2b-dev/infra/packages/envd/internal/logs" |
There was a problem hiding this comment.
Sessions have no TTL or expiry. A client that calls POST /files/upload/init and then crashes or disconnects without completing or aborting will hold a preallocated temp file (up to maxTotalSize = 10 GB) and a session slot indefinitely — until the process restarts. With maxUploadSessions = 100, intentionally (or accidentally) abandoned sessions can exhaust all slots, causing every subsequent init to return 429.
A simple fix is to record a creation timestamp in multipartUploadSession and run a periodic background goroutine (e.g. every minute) that removes sessions older than some threshold (e.g. 30 minutes), closing and unlinking their temp files.
Without http.MaxBytesReader, the Go HTTP server buffers the entire request body before the handler can reject it. A client sending a significantly oversized body (e.g. 1 GB for a 100 MB part) consumes memory proportional to the excess. Wrap r.Body with MaxBytesReader(expectedSize+1) before io.CopyN so the server stops reading after the expected part size plus one byte (needed for the existing trailing-byte check). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


Note
Medium Risk
Introduces new file-upload endpoints and in-memory session tracking with concurrent writes/renames, which can impact disk usage and correctness under race conditions if edge cases are missed.
Overview
Adds an S3-style multipart upload flow to
envdso large files can be uploaded in parallel: a client caninitan upload session (with limits on sessions/size/parts),PUTindividual parts directly into a preallocated temp file, then eithercompleteto atomically rename into place orabortto clean up. The OpenAPI spec and generated clients/servers are updated accordingly (including new 409/429/404 error responses), and a comprehensive test suite is added to cover ordering, missing parts, abort/complete behavior, and resource-limit enforcement.Written by Cursor Bugbot for commit 1005e97. This will update automatically on new commits. Configure here.