Skip to content

Review of first notebook 01_io_and_attributes.ipynb #2

@guillaumeeb

Description

@guillaumeeb

Just a review of the first notebook content:

  • First, in general, why rely on pip more than Conda for all the packages?

  • Notebooks should probably be commited without cell outputs.

  • show(ds_rasterio) leads to memory error on binder (with 2GB memory). This has thus happened several other times when too many cells where executed and the data loaded by different libraries.

  • After data = ds_rasterio.read(), 1.7GB are used.

  • Here, we see that our chunks have the size of 121 MB (11264 x 11264), which is too big and can lead to memory overload

    This is not necessarily too big, about 100MB is good with big collections. In case of a single EO product, this is probably too big, and not aligned with how the underlying arrays are layout in their corresponding files.

  • This comes from the fact that the original rasters is not chunked on disk

    Not sure about that, are you sure that even if files where chunked, rioxarray will use this knowledge for default chunk size?

  • (the ideal recommanded size is generally between 10~50 MB)

    I would say somewhere between 10 and 200 depending on the dataset size.

  • rechunking can take a bit of time

    More than that: rechunking is a heavy operation than must be avoided

  • I've never used odc-geo, but I think there is some magic that should be explained: how can we get the odc accessor when loading data through rioxarray?

  • GCP part: cannot be executed in binder, there is no "data" folder in the repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions