Skip to content

Enhance _info method to check file and directory info in parallel#786

Merged
ankitaluthra1 merged 15 commits intofsspec:mainfrom
ankitaluthra1:optimize-info
Apr 8, 2026
Merged

Enhance _info method to check file and directory info in parallel#786
ankitaluthra1 merged 15 commits intofsspec:mainfrom
ankitaluthra1:optimize-info

Conversation

@yuxin00j
Copy link
Copy Markdown
Contributor

@yuxin00j yuxin00j commented Mar 25, 2026

Optimize the performance of the _info method by enabling concurrent checks for file paths and directory listings.

  • Early Return Strategy: If _get_object completes first and resolves to a valid file (not a directory marker), the execution cancels the directory scan tasks and returns the file metadata immediately.

  • Fallback Logic: If _get_object fails or yields a directory marker, it safely falls back to the directory tree scan result.

Benchmark run result

Folder Info

Execution times consistently dropped by 30% to 60% across all single-threaded and multi-process configurations.

File Info

Results are generally neutral.

Bucket Info

This optimization does not affect info call for bucket.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.44%. Comparing base (e70bc65) to head (df864d8).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #786      +/-   ##
==========================================
+ Coverage   75.98%   76.44%   +0.46%     
==========================================
  Files          14       15       +1     
  Lines        2665     2679      +14     
==========================================
+ Hits         2025     2048      +23     
+ Misses        640      631       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yuxin00j yuxin00j marked this pull request as ready for review March 26, 2026 02:22
@yuxin00j yuxin00j changed the title Enhance _info method to check file and directory info in parallel.Optimize info Enhance _info method to check file and directory info in parallel Mar 26, 2026
@yuxin00j
Copy link
Copy Markdown
Contributor Author

Hi @ankitaluthra1, you may check the update on optimization in _info here and in #780

@ankitaluthra1
Copy link
Copy Markdown
Collaborator

/gcbrun

@ankitaluthra1
Copy link
Copy Markdown
Collaborator

@yuxin00j Can you please check the e2e failure

@yuxin00j
Copy link
Copy Markdown
Contributor Author

yuxin00j commented Apr 2, 2026

Hi @ankitaluthra1, I have fixed the test failure.

@Mahalaxmibejugam
Copy link
Copy Markdown
Contributor

QQ: Was the 30% to 60% improvement also observed for HNS buckets where we are parallelizing get_object and get_folder calls?

@Mahalaxmibejugam
Copy link
Copy Markdown
Contributor

File Info: Results are mixed but generally neutral, showing minor speedups of up to 24.6% in high process count runs. One outlier showed a minor regression in deep regional tests.

Is the speedup for file paths related to the changes in this PR? I am assuming it is variance and not related to this PR as the latency for file paths shouldn't be impacted by this change, let me know if I am missing something here.

@yuxin00j
Copy link
Copy Markdown
Contributor Author

yuxin00j commented Apr 6, 2026

QQ: Was the 30% to 60% improvement also observed for HNS buckets where we are parallelizing get_object and get_folder calls?

Yes. There's improvement for all 3 bucket types when the target type is folder.

@yuxin00j
Copy link
Copy Markdown
Contributor Author

yuxin00j commented Apr 6, 2026

File Info: Results are mixed but generally neutral, showing minor speedups of up to 24.6% in high process count runs. One outlier showed a minor regression in deep regional tests.

Is the speedup for file paths related to the changes in this PR? I am assuming it is variance and not related to this PR as the latency for file paths shouldn't be impacted by this change, let me know if I am missing something here.

Yeah, I think you're right. It should be just variance.

yuxin00j added a commit to ankitaluthra1/gcsfs that referenced this pull request Apr 6, 2026
@ankitaluthra1
Copy link
Copy Markdown
Collaborator

/gcbrun

@ankitaluthra1 ankitaluthra1 merged commit 9d25f7c into fsspec:main Apr 8, 2026
9 checks passed
@googlyrahman
Copy link
Copy Markdown
Contributor

googlyrahman commented Apr 13, 2026

Recently I've started getting these prints on my zonal benchmarks:

Task exception was never retrieved
future: <Task finished name='Task-12' coro=<ExtendedGcsFileSystem._get_directory_info() done, defined at /home/margubur/scripts/repos/gcsfs/gcsfs/extended_gcsfs.py:814> exception=FileNotFoundError('do-not-delete-margubur-zonal-usc1/20gb-file')>
Traceback (most recent call last):
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/google/api_core/grpc_helpers_async.py", line 86, in __await__
    response = yield from self._call.__await__()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/grpc/aio/_interceptor.py", line 474, in __await__
    response = yield from call.__await__()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/grpc/aio/_call.py", line 327, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.NOT_FOUND
        details = "The folder does not exist."
        debug_error_string = "UNKNOWN:Error received from peer ipv4:192.178.209.207:443 {grpc_status:5, grpc_message:"The folder does not exist."}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/margubur/scripts/repos/gcsfs/gcsfs/extended_gcsfs.py", line 836, in _get_directory_info
    response = await client.get_folder(request=request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/google/cloud/storage_control_v2/services/storage_control/async_client.py", line 708, in get_folder
    response = await rpc(
               ^^^^^^^^^^
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/google/api_core/retry/retry_unary_async.py", line 231, in retry_wrapped_func
    return await retry_target(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/google/api_core/retry/retry_unary_async.py", line 163, in retry_target
    next_sleep = _retry_error_helper(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/google/api_core/retry/retry_base.py", line 216, in _retry_error_helper
    raise final_exc from source_exc
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/google/api_core/retry/retry_unary_async.py", line 158, in retry_target
    return await target()
           ^^^^^^^^^^^^^^
  File "/home/margubur/scripts/temp/lib/python3.11/site-packages/google/api_core/grpc_helpers_async.py", line 89, in __await__
    raise exceptions.from_grpc_error(rpc_error) from rpc_error
google.api_core.exceptions.NotFound: 404 The folder does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/margubur/scripts/repos/gcsfs/gcsfs/extended_gcsfs.py", line 851, in _get_directory_info
    raise FileNotFoundError(path)
FileNotFoundError: do-not-delete-margubur-zonal-usc1/20gb-file

This happens, when a spawned asyncio task which was completed, and raised an exception but was never checked back. Seeing the traces and the code, i think this relates to this PR intended to parallelize the list and get call in _info method. It may happen when the list calls finishes and raises an exception earlier than get call.

While this doesn't block the execution of further tasks, this isn't something "clean exit", this texts get printed irrespective of verbosity. This may confuse our users, and I think we should fix this.

Here's the reproduction script: https://paste.googleplex.com/4773599538970624, @yuxin00j can you check? as a fix maybe you can add await asyncio.gather(*tasks, return_exceptions=True) in your finally block of parallel_tasks_first_completed context manager.

Please let me know if that's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants