-
Notifications
You must be signed in to change notification settings - Fork 121
Parallelize file uploads in fs cp command. #4132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Commit: e720670
18 interesting tests: 7 FAIL, 4 RECOVERED, 3 KNOWN, 3 flaky, 1 SKIP
Top 31 slowest tests (at least 2 minutes):
|
shreyas-goenka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical Issue: Context Shadowing Bug
Line 61 in cp.go shadows the context variable from line 57, causing the defer cancel() to cancel the wrong context. This breaks proper cleanup and cancellation propagation.
Issues
-
Context shadowing at line 61: The errgroup.WithContext returns a new context that shadows the cancellable context from line 57. This means defer cancel() on line 58 will cancel the wrong context.
-
Redundant context check at lines 90-92: The ctx.Err() check inside the goroutine is ineffective since the goroutine may start before cancellation, and cpFileToFile already handles context cancellation.
-
Missing test coverage: TestCp_concurrencyValidation only tests invalid values. Should also test valid values (1, 16, 100) work correctly.
-
No integration test for --concurrency flag: The new flag should have an integration test exercising different concurrency values.
Suggestions
-
Consider removing the double context wrapping since errgroup.WithContext already provides cancellation on error.
-
Document the no-ordering-guarantee behavior in command help text.
-
Consider adding debug logs for performance monitoring.
Questions
-
Why the double context wrapping (lines 57 and 61)? Is there a specific reason beyond what errgroup provides?
-
Is concurrency=16 based on benchmarking? Different scenarios might benefit from different values.
-
With concurrent output, messages will be interleaved. Is this acceptable UX?
Review generated by reviewbot
…bricks/cli into fs-cp-fast
What changes are proposed in this pull request?
This PR improves the performance of the
databricks fs cpcommand when copying directories by parallelizing file uploads. The command uses 16 concurrent workers by default but the number can be controlled via--concurrency.Implementation details:
Filerimplementation as before.Why
--concurrency? No strong preference here, it does not seem that there is a pattern in the CLI to control concurrency in other places. This is the flag name used in most Go tools but I'm happy to use something else.How is this tested?
Added acceptance tests to exercise most code paths + unit tests to validate that the context cancellation and propagation works properly.