Skip to content

Conversation

@renaudhartert-db
Copy link
Contributor

@renaudhartert-db renaudhartert-db commented Dec 11, 2025

What changes are proposed in this pull request?

This PR improves the performance of the databricks fs cp command when copying directories by parallelizing file uploads. The command uses 16 concurrent workers by default but the number can be controlled via --concurrency.

Implementation details:

  • No ordering guarantee: Files are now copied in parallel with no guaranteed order (previously sequential).
  • Fail-fast on errors: If any file copy fails, the context is cancelled and remaining operations are stopped (first error is returned).
  • Retry responsibility: The implementation does not retry failed operations; this remains the responsibility of the underlying Filer implementation as before.

Why --concurrency? No strong preference here, it does not seem that there is a pattern in the CLI to control concurrency in other places. This is the flag name used in most Go tools but I'm happy to use something else.

How is this tested?

Added acceptance tests to exercise most code paths + unit tests to validate that the context cancellation and propagation works properly.

@eng-dev-ecosystem-bot
Copy link
Collaborator

eng-dev-ecosystem-bot commented Dec 11, 2025

Commit: e720670

Run: 20276256026

Env ❌​FAIL 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 3 4 1 380 650 17:59
🟨​ aws windows 3 1 4 1 381 648 15:58
🟨​ aws-ucws linux 3 4 1 523 535 23:43
🟨​ aws-ucws windows 3 4 1 525 533 19:59
❌​ azure linux 2 1 3 379 648 21:44
❌​ azure windows 7 2 1 3 374 646 20:29
💚​ azure-ucws linux 1 3 520 533 21:25
💚​ azure-ucws windows 1 3 522 531 19:38
❌​ gcp linux 2 1 3 368 654 16:24
❌​ gcp windows 7 1 3 365 652 15:44
18 interesting tests: 7 FAIL, 4 RECOVERED, 3 KNOWN, 3 flaky, 1 SKIP
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 🟨​K 🟨​K 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 🟨​K 🟨​K
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
❌​ TestExport ✅​p ✅​p ✅​p ✅​p ❌​F ❌​F ✅​p ✅​p ❌​F ❌​F
❌​ TestExportWithFileFlag ✅​p ✅​p ✅​p ✅​p ❌​F ❌​F ✅​p ✅​p ❌​F ❌​F
❌​ TestImportDir ✅​p ✅​p ✅​p ✅​p ✅​p ❌​F ✅​p ✅​p ✅​p ❌​F
❌​ TestImportDirDoesNotOverwrite ✅​p ✅​p ✅​p ✅​p ✅​p ❌​F ✅​p ✅​p ✅​p ❌​F
❌​ TestImportDirWithOverwriteFlag ✅​p ✅​p ✅​p ✅​p ✅​p ❌​F ✅​p ✅​p ✅​p ❌​F
❌​ TestImportFileFormatAuto ✅​p ✅​p ✅​p ✅​p ✅​p ❌​F ✅​p ✅​p ✅​p ❌​F
❌​ TestImportFileFormatSource ✅​p ✅​p ✅​p ✅​p ✅​p ❌​F ✅​p ✅​p ✅​p ❌​F
🔄​ TestFilerWorkspaceNotebook ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p
🔄​ TestFilerWorkspaceNotebook/sqlNb.sql ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p
🔄​ TestFetchRepositoryInfoAPI_FromRepo ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
Top 31 slowest tests (at least 2 minutes):
duration env testname
6:28 aws-ucws linux TestAccept/bundle/resources/synced_database_tables/basic
6:19 aws-ucws windows TestAccept/bundle/resources/synced_database_tables/basic
5:41 gcp windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:39 aws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:37 gcp windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:34 aws-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:34 aws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:33 aws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:32 aws-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:28 gcp linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:23 gcp linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:18 aws-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:17 aws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
4:55 azure-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
4:51 azure linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
4:39 azure linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
4:32 azure-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
4:21 azure-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
4:17 azure windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
3:58 azure-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
3:52 azure windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
3:02 gcp linux TestAccept/ssh/connection
2:56 azure-ucws windows TestAccept/bundle/resources/synced_database_tables/basic
2:48 azure-ucws linux TestAccept/bundle/resources/synced_database_tables/basic
2:28 aws-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
2:24 aws linux TestAccept/ssh/connection
2:19 azure-ucws windows TestSecretsPutSecretStringValue
2:10 aws linux TestSecretsPutSecretStringValue
2:06 azure linux TestAccept/ssh/connection
2:03 aws-ucws linux TestAccept/ssh/connection
2:02 gcp linux TestSecretsPutSecretStringValue

Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Issue: Context Shadowing Bug

Line 61 in cp.go shadows the context variable from line 57, causing the defer cancel() to cancel the wrong context. This breaks proper cleanup and cancellation propagation.

Issues

  1. Context shadowing at line 61: The errgroup.WithContext returns a new context that shadows the cancellable context from line 57. This means defer cancel() on line 58 will cancel the wrong context.

  2. Redundant context check at lines 90-92: The ctx.Err() check inside the goroutine is ineffective since the goroutine may start before cancellation, and cpFileToFile already handles context cancellation.

  3. Missing test coverage: TestCp_concurrencyValidation only tests invalid values. Should also test valid values (1, 16, 100) work correctly.

  4. No integration test for --concurrency flag: The new flag should have an integration test exercising different concurrency values.

Suggestions

  1. Consider removing the double context wrapping since errgroup.WithContext already provides cancellation on error.

  2. Document the no-ordering-guarantee behavior in command help text.

  3. Consider adding debug logs for performance monitoring.

Questions

  1. Why the double context wrapping (lines 57 and 61)? Is there a specific reason beyond what errgroup provides?

  2. Is concurrency=16 based on benchmarking? Different scenarios might benefit from different values.

  3. With concurrent output, messages will be interleaved. Is this acceptable UX?


Review generated by reviewbot

@shreyas-goenka shreyas-goenka self-requested a review December 12, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants