AODN Cloud Optimised library allows to convert oceanographic datasets from IMOS (Integrated Marine Observing System) / AODN (Australian Ocean Data Network) into cloud-optimised formats such as Zarr (for gridded multidimensional data) and Parquet (for tabular data).
Visit the documentation on ReadTheDocs for detailed information.
- Convert CSV or NetCDF (single or multidimensional) to Zarr or Parquet.
- Dataset configuration: YAML-based configuration with inheritance, allowing similar datasets to share settings. Example: Radar ACORN, GHRSST.
- Semi-automatic creation of dataset configuration: ReadTheDocs guide.
- Generic handlers for standard datasets: GenericParquetHandler, GenericZarrHandler
- Custom handlers can inherit from generic handlers: Argo handler, Mooring Timeseries Handler
- Supports local Dask cluster and remote clusters:
- Cluster behaviour is configuration-driven and can be easily overridden.
- Automatic restart of remote cluster upon Dask failure.
- Zarr: Gridded datasets are processed in batch and in parallel using
xarray.open_mfdataset. - Parquet: Tabular files are processed in batch and in parallel as independent tasks, implemented with
concurrent.futures.Future. - S3 / S3-Compatible Storage Support:
Support for AWS S3 and S3-compatible endpoints (e.g., MinIO, LocalStack) with configurable input/output buckets and authentication via
s3fsandboto3.
- Zarr: Reprocessing is achieved by writing to specific slices, including non-contiguous regions.
- Parquet: Reprocessing uses PyArrow internal overwriting; can also be forced when input files change significantly.
- Improves performance for querying and parallel processing.
- Parquet: Partitioned by polygon and timestamp slices. Issue reference
- Zarr: Chunking is defined in dataset configuration.
See doc
- Global Attributes -> variable
- variable attribute -> variable
- filename part -> variable
- ...
- Parquet: Metadata stored as a sidecar
_metadata.parquetfile for faster queries and schema discovery.
This library ships with an MCP (Model Context Protocol) server that exposes the AODN dataset catalogue to AI assistants such as GitHub Copilot CLI, Gemini CLI, and Claude Desktop.
It enables an AI to discover datasets, inspect schemas, verify real S3 data coverage, and generate validated Jupyter notebooks for oceanographic analysis.
See aodn_cloud_optimised/mcp/README.md for installation and usage instructions.
Requirements:
- Python >= 3.11
- AWS SSO configured for pushing files to S3
- Optional: Coiled account for remote clustering
To use the library for data processing pipelines (Zarr/Parquet conversion), no notebook or test dependencies needed:
git clone https://github.com/aodn/aodn_cloud_optimised.git
cd aodn_cloud_optimised
make core # installs core deps via Poetry venvcurl -s https://raw.githubusercontent.com/aodn/aodn_cloud_optimised/main/install.sh | bashOtherwise, go to the release page.
Full contributor setup (notebooks + tests + docs + tooling):
git clone https://github.com/aodn/aodn_cloud_optimised.git
cd aodn_cloud_optimised
make dev # Poetry venv — recommended
# or: ./setup_miniforge_venvs.sh dev # named mamba env alternative
poetry run pre-commit installSee ReadTheDocs - Dev for full details.
A curated list of Jupyter Notebooks ready to be loaded in Google Colab and Binder for users to play with IMOS/AODN converted to Cloud Optimised dataset. Click on the badge above