Updates zarr-parser to use obstore list_async instead of concurrent_map#892
Updates zarr-parser to use obstore list_async instead of concurrent_map#892norlandrhagen merged 26 commits intomainfrom
zarr-parser to use obstore list_async instead of concurrent_map#892Conversation
virtualizarr/parsers/zarr.py
Outdated
| lengths = await _concurrent_map( | ||
| [(k,) for k in chunk_keys], zarr_array.store.getsize | ||
| ) | ||
| lengths = [size_map[k] for k in chunk_keys] |
There was a problem hiding this comment.
I think we really want to work hard to avoid creating any python lists / dicts at all
There was a problem hiding this comment.
instead we want obstore -> arrow -> numpy
via https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
There was a problem hiding this comment.
I think the hardest part of this is dealing with logic for missing keys - arrow might return these a nulls, but the to_numpy conversion doesn't support nulls?
Any operations we do should either be as pyarrow arrays or as numpy arrays, never as python collections
There was a problem hiding this comment.
I think in this case you can assert that there are no nulls. I don't think this particular list function will ever create nulls in the arrow arrays.
virtualizarr/parsers/zarr.py
Outdated
| stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True) | ||
| async for batch in stream: | ||
| size_map.update( | ||
| zip(batch.column("path").to_pylist(), batch.column("size").to_pylist()) |
There was a problem hiding this comment.
is this zipping of pylists creating a python dict? we want to avoid that
|
You will also want to add a new (private for now) constructor to the |
|
Hmm, now hitting a kerchunk error: |
virtualizarr/manifests/manifest.py
Outdated
| def _from_arrow( | ||
| cls, | ||
| *, | ||
| chunk_keys: "pa.Array", |
There was a problem hiding this comment.
I don't know that you need to pass this - maybe instead we should pass arrow arrays with nulls for unintialized chunks?
virtualizarr/parsers/zarr.py
Outdated
|
|
||
| path_batches = [] | ||
| size_batches = [] | ||
| stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True) |
There was a problem hiding this comment.
Just grabbing the underlying obstore store is a interesting idea...
Co-authored-by: Tom Nicholas <tom@earthmover.io>
This should be unit testable without using Kerchunk or Icechunk. We are simply creating the |
…ape]. Moves all weird arrow reshaping into zarr:build_chunk_manifest
Totally agree! I think... the kerchunk errors are unrelated. I added |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #892 +/- ##
==========================================
- Coverage 89.24% 89.15% -0.09%
==========================================
Files 34 33 -1
Lines 1999 2038 +39
==========================================
+ Hits 1784 1817 +33
- Misses 215 221 +6
🚀 New features to boost your workflow:
|
virtualizarr/manifests/manifest.py
Outdated
| pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths | ||
| ).to_numpy(zero_copy_only=False) | ||
|
|
||
| if shape is not None: |
There was a problem hiding this comment.
What happens if shape is None? Should that even be allowed?
virtualizarr/manifests/manifest.py
Outdated
| paths_np = ( | ||
| pc.if_else(pc.is_null(paths), "", paths) | ||
| .to_numpy(zero_copy_only=False) | ||
| .astype(np.dtypes.StringDType()) | ||
| ) | ||
| offsets_np = pc.if_else( | ||
| pc.is_null(offsets), pa.scalar(0, pa.uint64()), offsets | ||
| ).to_numpy(zero_copy_only=False) | ||
| lengths_np = pc.if_else( | ||
| pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths | ||
| ).to_numpy(zero_copy_only=False) |
There was a problem hiding this comment.
Lets split the arrow compute operations from the numpy conversions if only because it makes it easier to read.
virtualizarr/parsers/zarr.py
Outdated
| chunk_grid_shape = tuple( | ||
| math.ceil(s / c) for s, c in zip(zarr_array.shape, zarr_array.chunks) | ||
| ) | ||
| # scalar arrays go through the dict path instead of the pure arrow bit |
There was a problem hiding this comment.
It would be nice to not have to keep the whole old codepath around just for this special case...
virtualizarr/parsers/zarr.py
Outdated
| return ChunkManifest(chunk_map) | ||
| normalized_keys, full_paths, all_lengths = result | ||
|
|
||
| # Incoming: lots of LLM arrow mumbo jumbo for sparse arrays |
There was a problem hiding this comment.
there's a lot going on here that I'm suspicious could be simplified
There was a problem hiding this comment.
Totally agree. I took a shot at trying to simplify it a bit. The handling of sparse arrays makes it a bit verbose.
|
Does this seem G2G to you @TomNicholas? |
TomNicholas
left a comment
There was a problem hiding this comment.
Yes, thank you so much @norlandrhagen !
| from virtualizarr.types import ChunkKey | ||
|
|
||
| if TYPE_CHECKING: | ||
| import pyarrow as pa # type: ignore[import-untyped,import-not-found] |
There was a problem hiding this comment.
I'd strongly suggest not tying to pyarrow
- You should very easily be able to make your code generic and not tied to pyarrow
- pyarrow doesn't have any internal type checking, so typing as
pa.StringArrayorpa.Uint64Arraymeans absolutely nothing to the user (it might mean something to the developer)
| "imagecodecs-numcodecs==2024.6.1", | ||
| ] | ||
|
|
||
| zarr = ["arro3-core", "pyarrow"] |
There was a problem hiding this comment.
Ah thanks for the feedback @kylebarron. Good to know about pyarrow.
| *, | ||
| paths: "pa.StringArray", | ||
| offsets: "pa.UInt64Array", | ||
| lengths: "pa.UInt64Array", |
There was a problem hiding this comment.
I'd suggest typing these as ArrowArrayExportable, and then using an arrow library of choice to import the data, such as passing input to pyarrow.array or arro3.core.Array.from_arrow().
Then this API will automatically support any arrow input, including polars, duckdb, arro3, etc apache/arrow#39195 (comment)
| arrow_paths = pc.if_else(pc.is_null(paths), "", paths) | ||
| arrow_offsets = pc.if_else( | ||
| pc.is_null(offsets), pa.scalar(0, pa.uint64()), offsets | ||
| ) | ||
| arrow_lengths = pc.if_else( | ||
| pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths | ||
| ) |
There was a problem hiding this comment.
Requiring a pyarrow dependency just for these three lines is not worth it IMO. Much better to just document that the users must remove any null values before passing in arguments.
There was a problem hiding this comment.
And then you can probably use arro3-core for all your needs and save the big pyarrow dependency.
|
(replies to comments continued in #922) |

Closes Speed up ZarrParser using obstore and Arrow? #891
Tests passing
Full type hint coverage
Changes are documented in
docs/releases.rstSwaps out the
_concurrent_mapinbuild_chunk_mappingwith obstore'slist_async.Constructs the python ChunkManifest object's numpy arrays directly from the Arrow arrays. *
*There is still a conversion to a dict, so not quite.Bonus - removes the zarr vendor code.