Skip to content

Use moveTo API to move files for standard buckets#800

Open
Yonghui-Lee wants to merge 7 commits intofsspec:mainfrom
ankitaluthra1:yonghui/standard-bucket-mv
Open

Use moveTo API to move files for standard buckets#800
Yonghui-Lee wants to merge 7 commits intofsspec:mainfrom
ankitaluthra1:yonghui/standard-bucket-mv

Conversation

@Yonghui-Lee
Copy link
Copy Markdown
Contributor

I update the implementation of move method for standard bucket to use the atomic moveTo API instead of the non-atomic copy-and-delete.

Key Changes

  • Replace copy-and-delete with atomic moveTo.
  • List and Glob Pattern Support: Expand source and destination paths using fsspec's robust other_paths helper. It supports:
  • Asynchronous Batched Execution: Moves are executed aggressively in parallel using _run_coros_in_chunks (governed by batch_size kwargs parameter).
  • Error handling for Implicit Directories: Because standard un-hierarchical GCS buckets use implicit directories, _expand_path() will identify dummy directory paths that don't physically exist as objects. The exception handling loop gracefully catches and suppresses FileNotFoundError if it corresponds to an implicit directory.
  • Add a specific microbenchmark for file renaming operations, segregating it from directory renaming benchmarks.

Performance

For previous copy-and-delete implementation

  • copy tasks are executed in coroutines.
  • delele tasks can make use of batch API and are also executed in coroutines.

For moveTo API, tasks are executed in coroutines.

The micro benchmark results show that moveTo is 10-20% faster than copy-and-delete using the same batch size.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.66%. Comparing base (64936ae) to head (0439a99).

Files with missing lines Patch % Lines
gcsfs/core.py 96.29% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   76.11%   76.66%   +0.55%     
==========================================
  Files          15       15              
  Lines        2679     2704      +25     
==========================================
+ Hits         2039     2073      +34     
+ Misses        640      631       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@martindurant
Copy link
Copy Markdown
Member

How long has this API been available? :)

@Mahalaxmibejugam
Copy link
Copy Markdown
Contributor

Can you attach the benchmark values of standard buckets for both file and folder renames in the description or in a comment?


# mv method is already available as sync method in the fsspec.py
# Async version of it is introduced here so that mv can be used in async methods.
# TODO: Add async mv method in the async.py and remove from GCSFileSystem.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this ToDo now, as this is GCS specific custom implementation, we can't copy this method to async.py. We will have to keep it in GCSFileSystem only.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the implementation looks generic - but no need to take any action about it now. In fact, I don't expect too much of a performance gain (but maybe some) - we just need to make sure it passes the abstract tests for mv.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh yes agree that it is generic, I missed that moveTo API is abstracted by mv_file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider moving the implementation from GCSFileSystem to fsspec? We might also see performance gains if other fsspec implementations also include atomic mv_file like we see in case of GCSFS.

There is also a significant code duplication in mv method from other fsspec methods like copy which can be extracted into a helper.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea. But the default implementation of mv_file of AsyncFileSystem isn't atomic. If we use mv_file for mv by default, there may be performance drop for some fsspec implementations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing mv implementation in AbstractFileSystem is also not atomic, it relies on a copy+delete logic. Therefore IMO changing from copy+delete to mv_file (which is also not atomic in AsyncFileSystem) should not have any performance implications.

@martindurant Let us know your thoughts about this

@Yonghui-Lee
Copy link
Copy Markdown
Contributor Author

How long has this API been available? :)

It has been available for about one year. https://docs.cloud.google.com/storage/docs/release-notes#February_24_2025

@Yonghui-Lee
Copy link
Copy Markdown
Contributor Author

Can you attach the benchmark values of standard buckets for both file and folder renames in the description or in a comment?

The benchmark raw data for batch size 20 is as follows.
Copy-and-delete:
image

MoveTo:
image

@martindurant
Copy link
Copy Markdown
Member

(e2e test error is not showing up for me)

@Mahalaxmibejugam
Copy link
Copy Markdown
Contributor

(e2e test error is not showing up for me)

Tests didn't run yet, we need to add /gcbrun as a comment to trigger the pipeline run. You and Ankita have the permissions to trigger the tests.

@martindurant
Copy link
Copy Markdown
Member

/gcbrun

@martindurant
Copy link
Copy Markdown
Member

(the CI runs should show up as pending if they have not begun yet, but now I get it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants