You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MultiZarrToZarr is extremely powerful but rather hard to use.
This is important - kerchunk has been transformative, so we increasingly recommend it as the best way to ingest large amounts of data into the pangeo ecosystem's tools. However that means we should make sure the kerchunk user experience is smooth, so that new users don't get stuck early on.
Part of the problem is that this one MultiZarrToZarr function can do many different things. Contrast with xarray - when combining multiple datasets into one, xarray takes some care to distinguish between a few common cases/concepts (we even have a glossary):
Concatenation along a single existing dimension. Achieved by xr.concat where dim is a str
Concatenation along a single new dimension (optionally providing new coordinates to use along that new dimension). Achieved by xr.concat where dim is a set of values
Merging of multiple variables which already share dimensions, first aligned according to their coordinates. Achieved by xr.merge
"Combining" by order given, which means some ordered combination of concatenation along one or more dimensions and/or merging. Achieved by xr.combine_nested
"Combining" by coordinate order, which again means some ordered combination of concatenation along one or more dimensions and/or merging, but the order is specified by information in the datasets' coordinates. Achieved by xr.combine_by_coords
In kerchunk it seems that the recommended way to handle operations resembling all 5 of these cases is through MultiZarrToZarr. It also cannot currently easily handle certain types of multi-dimensional concatenation.
Suggestion
Break up MultiZarrToZarr by defining a set of functions similar to xarray's merge/concat/combine/unify_chunks that consume and produce VirtualZarrStore objects (EDIT: see #375).
Advantages
We can replace/deprecate the heavily overloaded and unituitive coo_map kwarg (it has 10 possible input types!). Perhaps giving simply an ordered list of coordinate values would be sufficient, and just make it easier for the user to extract the values they want from the VirtualZarrStore objects they want to concatenate.
If users need to do something really unusual they can more easily break their problem up into concatenating each array separately (e.g. for concatenating on staggered grids)
Can think of as a refactoring to move some pangeo-forge functionality upstream, reducing redundancy. We shouldn't have 3 completely different designs for multidimensional concatenation in adjacent libraries in the stack.
These new functions would be more useful as basic primitives for parallelization frameworks to call (e.g. doing tree reduction via dask, beam, or cubed), rather than trying to wrap calls to those frameworks within kerchunk (like kerchunk.combine.auto_dask does).
Questions
How close can these functions be to xarray's version of merge/concat/combine? And what can we learn from the design decisions in pangeo-forge-recipes FilePattern? (@cisaacstern@rabernat )
How close are kerchunk's existing combine.merge_vars and combine.concatenate_arrays functions to providing this functionality? If the answer is "pretty close", then how much of this issue could be solved via documentation?
Problem
MultiZarrToZarris extremely powerful but rather hard to use.This is important - kerchunk has been transformative, so we increasingly recommend it as the best way to ingest large amounts of data into the pangeo ecosystem's tools. However that means we should make sure the kerchunk user experience is smooth, so that new users don't get stuck early on.
Part of the problem is that this one
MultiZarrToZarrfunction can do many different things. Contrast with xarray - when combining multiple datasets into one, xarray takes some care to distinguish between a few common cases/concepts (we even have a glossary):xr.concatwheredimis a strxr.concatwheredimis a set of valuesxr.mergexr.combine_nestedxr.combine_by_coordsIn kerchunk it seems that the recommended way to handle operations resembling all 5 of these cases is through
MultiZarrToZarr. It also cannot currently easily handle certain types of multi-dimensional concatenation.Suggestion
Break up
MultiZarrToZarrby defining a set of functions similar to xarray'smerge/concat/combine/unify_chunksthat consume and produceVirtualZarrStoreobjects (EDIT: see #375).Advantages
coo_mapkwarg (it has 10 possible input types!). Perhaps giving simply an ordered list of coordinate values would be sufficient, and just make it easier for the user to extract the values they want from theVirtualZarrStoreobjects they want to concatenate.kerchunk.combine.auto_daskdoes).Questions
merge/concat/combine? And what can we learn from the design decisions in pangeo-forge-recipesFilePattern? (@cisaacstern @rabernat )combine.merge_varsandcombine.concatenate_arraysfunctions to providing this functionality? If the answer is "pretty close", then how much of this issue could be solved via documentation?