Enhance _info method to check file and directory info in parallel#786
Enhance _info method to check file and directory info in parallel#786ankitaluthra1 merged 15 commits intofsspec:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #786 +/- ##
==========================================
+ Coverage 75.98% 76.44% +0.46%
==========================================
Files 14 15 +1
Lines 2665 2679 +14
==========================================
+ Hits 2025 2048 +23
+ Misses 640 631 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @ankitaluthra1, you may check the update on optimization in _info here and in #780 |
|
/gcbrun |
|
@yuxin00j Can you please check the e2e failure |
|
Hi @ankitaluthra1, I have fixed the test failure. |
|
QQ: Was the 30% to 60% improvement also observed for HNS buckets where we are parallelizing get_object and get_folder calls? |
Is the speedup for file paths related to the changes in this PR? I am assuming it is variance and not related to this PR as the latency for file paths shouldn't be impacted by this change, let me know if I am missing something here. |
Yes. There's improvement for all 3 bucket types when the target type is folder. |
Yeah, I think you're right. It should be just variance. |
…and format with black
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…wait and simplify parallel task evaluation in _info
|
/gcbrun |
|
Recently I've started getting these prints on my zonal benchmarks: This happens, when a spawned asyncio task which was completed, and raised an exception but was never checked back. Seeing the traces and the code, i think this relates to this PR intended to parallelize the list and get call in _info method. It may happen when the list calls finishes and raises an exception earlier than get call. While this doesn't block the execution of further tasks, this isn't something "clean exit", this texts get printed irrespective of verbosity. This may confuse our users, and I think we should fix this. Here's the reproduction script: https://paste.googleplex.com/4773599538970624, @yuxin00j can you check? as a fix maybe you can add await asyncio.gather(*tasks, return_exceptions=True) in your finally block of parallel_tasks_first_completed context manager. Please let me know if that's not the case. |
Optimize the performance of the _info method by enabling concurrent checks for file paths and directory listings.
Early Return Strategy: If _get_object completes first and resolves to a valid file (not a directory marker), the execution cancels the directory scan tasks and returns the file metadata immediately.
Fallback Logic: If _get_object fails or yields a directory marker, it safely falls back to the directory tree scan result.
Benchmark run result
Folder Info
Execution times consistently dropped by 30% to 60% across all single-threaded and multi-process configurations.
File Info
Results are generally neutral.
Bucket Info
This optimization does not affect info call for bucket.