Updates `zarr-parser` to use obstore `list_async` instead of concurrent_map by norlandrhagen · Pull Request #892 · zarr-developers/VirtualiZarr

norlandrhagen · 2026-02-26T18:38:22Z

Closes Speed up ZarrParser using obstore and Arrow? #891
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst
Swaps out the _concurrent_map in build_chunk_mapping with obstore's list_async.
Constructs the python ChunkManifest object's numpy arrays directly from the Arrow arrays. *
* There is still a conversion to a dict, so not quite.
Bonus - removes the zarr vendor code.

TomNicholas · 2026-02-26T18:42:08Z

virtualizarr/parsers/zarr.py

-    lengths = await _concurrent_map(
-        [(k,) for k in chunk_keys], zarr_array.store.getsize
-    )
+    lengths = [size_map[k] for k in chunk_keys]


I think we really want to work hard to avoid creating any python lists / dicts at all

instead we want obstore -> arrow -> numpy

via https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy

I think the hardest part of this is dealing with logic for missing keys - arrow might return these a nulls, but the to_numpy conversion doesn't support nulls?

Any operations we do should either be as pyarrow arrays or as numpy arrays, never as python collections

I think in this case you can assert that there are no nulls. I don't think this particular list function will ever create nulls in the arrow arrays.

TomNicholas · 2026-02-26T18:45:13Z

virtualizarr/parsers/zarr.py

+    stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True)
+    async for batch in stream:
+        size_map.update(
+            zip(batch.column("path").to_pylist(), batch.column("size").to_pylist())


is this zipping of pylists creating a python dict? we want to avoid that

virtualizarr/parsers/zarr.py

TomNicholas · 2026-02-26T18:53:14Z

You will also want to add a new (private for now) constructor to the ChunkManifest class that accepts 3 pyarrow arrays, of type variable-length string, int, and int. The new constructor can just call the existing .from_numpy constructor.

norlandrhagen · 2026-02-27T01:26:04Z

Hmm, now hitting a kerchunk error:

FAILED virtualizarr/tests/test_writers/test_kerchunk.py::TestAccessor::test_accessor_to_kerchunk_parquet - ValueError: Error converting column "path" to bytes using encoding UTF8. Original error: Unable to avoid copy while creating an array as requested.

virtualizarr/manifests/manifest.py

TomNicholas · 2026-02-27T04:13:21Z

virtualizarr/manifests/manifest.py

+    def _from_arrow(
+        cls,
+        *,
+        chunk_keys: "pa.Array",


I don't know that you need to pass this - maybe instead we should pass arrow arrays with nulls for unintialized chunks?

TomNicholas · 2026-02-27T04:17:17Z

virtualizarr/parsers/zarr.py

+
+    path_batches = []
+    size_batches = []
+    stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True)


Just grabbing the underlying obstore store is a interesting idea...

Co-authored-by: Tom Nicholas <tom@earthmover.io>

TomNicholas · 2026-02-27T16:54:52Z

now hitting a kerchunk error

This should be unit testable without using Kerchunk or Icechunk. We are simply creating the ManifestStore in a more optimized way. If all array dtypes and so on are the same as before at that step we should not hit any problems later.

…ape]. Moves all weird arrow reshaping into zarr:build_chunk_manifest

norlandrhagen · 2026-02-27T17:49:41Z

This should be unit testable without using Kerchunk or Icechunk. We are simply creating the ManifestStore in a more optimized way. If all array dtypes and so on are the same as before at that step we should not hit any problems la

Totally agree! I think... the kerchunk errors are unrelated. I added pyarrow and arro3-core to a zarr-parser opt dependency and added that to the py11 and py12 tests. Maybe this caused the kerchunk bug. I can check that in a separate issue/pr.

codecov · 2026-03-06T21:30:54Z

Codecov Report

❌ Patch coverage is 96.05263% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.15%. Comparing base (bf30692) to head (3d7ebfc).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
virtualizarr/parsers/zarr.py	95.94%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #892      +/-   ##
==========================================
- Coverage   89.24%   89.15%   -0.09%     
==========================================
  Files          34       33       -1     
  Lines        1999     2038      +39     
==========================================
+ Hits         1784     1817      +33     
- Misses        215      221       +6

Files with missing lines	Coverage Δ
virtualizarr/manifests/manifest.py	`85.41% <100.00%> (+0.80%)`	⬆️
virtualizarr/parsers/zarr.py	`94.76% <95.94%> (-2.86%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TomNicholas · 2026-03-06T21:42:42Z

virtualizarr/manifests/manifest.py

+            pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths
+        ).to_numpy(zero_copy_only=False)
+
+        if shape is not None:


What happens if shape is None? Should that even be allowed?

TomNicholas · 2026-03-06T21:43:25Z

virtualizarr/manifests/manifest.py

+        paths_np = (
+            pc.if_else(pc.is_null(paths), "", paths)
+            .to_numpy(zero_copy_only=False)
+            .astype(np.dtypes.StringDType())
+        )
+        offsets_np = pc.if_else(
+            pc.is_null(offsets), pa.scalar(0, pa.uint64()), offsets
+        ).to_numpy(zero_copy_only=False)
+        lengths_np = pc.if_else(
+            pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths
+        ).to_numpy(zero_copy_only=False)


Lets split the arrow compute operations from the numpy conversions if only because it makes it easier to read.

TomNicholas · 2026-03-06T21:55:06Z

virtualizarr/parsers/zarr.py

-            chunk_grid_shape = tuple(
-                math.ceil(s / c) for s, c in zip(zarr_array.shape, zarr_array.chunks)
-            )
+    # scalar arrays go through the dict path instead of the pure arrow bit


It would be nice to not have to keep the whole old codepath around just for this special case...

TomNicholas · 2026-03-06T22:19:04Z

virtualizarr/parsers/zarr.py

-    return ChunkManifest(chunk_map)
+    normalized_keys, full_paths, all_lengths = result
+
+    # Incoming: lots of LLM arrow mumbo jumbo for sparse arrays


there's a lot going on here that I'm suspicious could be simplified

Totally agree. I took a shot at trying to simplify it a bit. The handling of sparse arrays makes it a bit verbose.

virtualizarr/tests/test_parsers/test_zarr.py

virtualizarr/parsers/zarr.py

…ator

norlandrhagen · 2026-03-09T17:55:18Z

Does this seem G2G to you @TomNicholas?

TomNicholas

Yes, thank you so much @norlandrhagen !

kylebarron · 2026-03-11T16:29:02Z

virtualizarr/manifests/manifest.py

 from virtualizarr.types import ChunkKey

+if TYPE_CHECKING:
+    import pyarrow as pa  # type: ignore[import-untyped,import-not-found]


I'd strongly suggest not tying to pyarrow

You should very easily be able to make your code generic and not tied to pyarrow

pyarrow doesn't have any internal type checking, so typing as pa.StringArray or pa.Uint64Array means absolutely nothing to the user (it might mean something to the developer)

kylebarron · 2026-03-11T16:30:34Z

pyproject.toml

    "imagecodecs-numcodecs==2024.6.1",
 ]

+zarr = ["arro3-core", "pyarrow"]


Feel free to disregard, but ideally you shouldn't need to depend on both dependencies. Pyarrow is very large. I looked recently and it looks like it's gotten even larger

50MB compressed is huge.

Ah thanks for the feedback @kylebarron. Good to know about pyarrow.

kylebarron · 2026-03-11T16:39:23Z

virtualizarr/manifests/manifest.py

+        *,
+        paths: "pa.StringArray",
+        offsets: "pa.UInt64Array",
+        lengths: "pa.UInt64Array",


I'd suggest typing these as ArrowArrayExportable, and then using an arrow library of choice to import the data, such as passing input to pyarrow.array or arro3.core.Array.from_arrow().

Then this API will automatically support any arrow input, including polars, duckdb, arro3, etc apache/arrow#39195 (comment)

kylebarron · 2026-03-11T16:40:48Z

virtualizarr/manifests/manifest.py

+        arrow_paths = pc.if_else(pc.is_null(paths), "", paths)
+        arrow_offsets = pc.if_else(
+            pc.is_null(offsets), pa.scalar(0, pa.uint64()), offsets
+        )
+        arrow_lengths = pc.if_else(
+            pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths
+        )


Requiring a pyarrow dependency just for these three lines is not worth it IMO. Much better to just document that the users must remove any null values before passing in arguments.

And then you can probably use arro3-core for all your needs and save the big pyarrow dependency.

TomNicholas · 2026-03-11T17:01:39Z

(replies to comments continued in #922)

norlandrhagen added 2 commits February 26, 2026 11:30

updates zarr-parser to use obstore list_async instead of concurrent_map

aa93b8b

removes the zarr vendor code

37dff68

norlandrhagen added performance parsers labels Feb 26, 2026

norlandrhagen temporarily deployed to test-release February 26, 2026 18:38 — with GitHub Actions Inactive

TomNicholas reviewed Feb 26, 2026

View reviewed changes

virtualizarr/parsers/zarr.py Show resolved Hide resolved

adds arro3-core to zarr group

2fa25a7

norlandrhagen temporarily deployed to test-release February 26, 2026 18:49 — with GitHub Actions Inactive

adds _from_arrow method

626d0b9

norlandrhagen had a problem deploying to test-release February 27, 2026 01:06 — with GitHub Actions Failure

adds type_checking for pa type hint + import in _from_arrow

9d6a312

norlandrhagen had a problem deploying to test-release February 27, 2026 01:16 — with GitHub Actions Failure

extra import removed

bab147d

norlandrhagen temporarily deployed to test-release February 27, 2026 01:18 — with GitHub Actions Inactive

adds zarr to test-py31* test group

17e35cc

norlandrhagen temporarily deployed to test-release February 27, 2026 01:23 — with GitHub Actions Inactive

TomNicholas reviewed Feb 27, 2026

View reviewed changes

virtualizarr/manifests/manifest.py Outdated Show resolved Hide resolved

TomNicholas reviewed Feb 27, 2026

View reviewed changes

TomNicholas mentioned this pull request Feb 27, 2026

Ingesting arbitrarily large native Zarr stores by batching arrow streams #894

Open

Update virtualizarr/manifests/manifest.py

6cbb7c0

Co-authored-by: Tom Nicholas <tom@earthmover.io>

norlandrhagen temporarily deployed to test-release February 27, 2026 16:24 — with GitHub Actions Inactive

updates _from_arrow method to have paths, offsets, lengths and opt[sh…

b400a34

…ape]. Moves all weird arrow reshaping into zarr:build_chunk_manifest

norlandrhagen temporarily deployed to test-release February 27, 2026 17:37 — with GitHub Actions Inactive

norlandrhagen temporarily deployed to test-release March 6, 2026 21:24 — with GitHub Actions Inactive

norlandrhagen marked this pull request as ready for review March 6, 2026 21:30

TomNicholas reviewed Mar 6, 2026

View reviewed changes

incorporate feedback

d96d5c5

norlandrhagen temporarily deployed to test-release March 6, 2026 23:36 — with GitHub Actions Inactive

TomNicholas reviewed Mar 7, 2026

View reviewed changes

virtualizarr/parsers/zarr.py Outdated Show resolved Hide resolved

TomNicholas approved these changes Mar 7, 2026

View reviewed changes

norlandrhagen added 3 commits March 9, 2026 10:11

removed seperator normalization and added a method to get chunk seper…

716a0bb

…ator

Merge branch 'main' into zarr-parser-obstore-list

7e76088

de-dup pyproj

5df7705

norlandrhagen temporarily deployed to test-release March 9, 2026 16:14 — with GitHub Actions Inactive

mypy

08232a8

norlandrhagen temporarily deployed to test-release March 9, 2026 16:31 — with GitHub Actions Inactive

Merge branch 'main' into zarr-parser-obstore-list

3d7ebfc

norlandrhagen temporarily deployed to test-release March 9, 2026 17:44 — with GitHub Actions Inactive

TomNicholas approved these changes Mar 9, 2026

View reviewed changes

norlandrhagen merged commit 1b72dcf into main Mar 9, 2026
15 checks passed

norlandrhagen deleted the zarr-parser-obstore-list branch March 9, 2026 20:04

norlandrhagen mentioned this pull request Mar 9, 2026

Remove vendored zarr code #617

Closed

This was referenced Mar 10, 2026

fix: use public attribute in zarr parser #916

Merged

Release v2.5.0 #915

Open

chore: add test for ZarrParser on bucket without list permissions #919

Merged

kylebarron reviewed Mar 11, 2026

View reviewed changes

TomNicholas mentioned this pull request Mar 11, 2026

Re-implement support for parsing non-listable zarr stores #921

Open

Conversation

norlandrhagen commented Feb 26, 2026 • edited by TomNicholas Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomNicholas commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

norlandrhagen commented Feb 27, 2026

Uh oh!

Uh oh!

TomNicholas Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented Feb 27, 2026

Uh oh!

norlandrhagen commented Feb 27, 2026

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

norlandrhagen commented Mar 9, 2026

Uh oh!

TomNicholas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylebarron Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

norlandrhagen commented Feb 26, 2026 •

edited by TomNicholas

Loading

TomNicholas commented Feb 26, 2026 •

edited

Loading

TomNicholas Feb 27, 2026 •

edited

Loading

codecov bot commented Mar 6, 2026 •

edited

Loading

kylebarron Mar 11, 2026 •

edited

Loading