Skip to content

feat(esm-tools-plus/simcat): add esm_catalog — STAC-based experiment catalog#1473

Open
siligam wants to merge 2 commits into
releasefrom
esm-tools-plus/simcat/pr-esm-catalog
Open

feat(esm-tools-plus/simcat): add esm_catalog — STAC-based experiment catalog#1473
siligam wants to merge 2 commits into
releasefrom
esm-tools-plus/simcat/pr-esm-catalog

Conversation

@siligam
Copy link
Copy Markdown

@siligam siligam commented May 18, 2026

Context

Part of the ESM-Tools-plus/simcat initiative, which adds a STAC-based experiment catalog to ESM-Tools for indexing, querying and browsing climate model output.

This PR introduces esm_catalog — the catalog API, scanner, and CLI.
A companion PR adds esm_viz — the visualization service.

Reviewer tip: this is a new top-level package. All files are additions. Start with src/esm_catalog/ARCHITECTURE.md for an overview, then src/esm_catalog/cli.py for the user-facing entry points.


What's included

src/esm_catalog/ — the package (108 files, ~32k lines)

Sub-package Purpose
api/ FastAPI STAC server (stac-fastapi-api backed by DuckDB)
scan/ NetCDF / GRIB / namelist scanner; writes STAC items to DuckDB
stac/ STAC Collection/Item builders + ESM-specific extensions
storage/ DuckDB storage layer + personal collections
hpc/ HPC system detection (Albedo, Levante, …)
integration/ ESM-Tools runscript hooks for auto-scanning
mcp/ Model Context Protocol server for LLM tool access
cli.py esm-catalog CLI (scan, serve, mcp, reindex, …)

STAC extensions defined:

  • hpc — HPC system name on assets (hpc:system)
  • namelist — F90 namelist parameters (nml:*)
  • paleo — deep-time metadata (paleo:years_bp)
  • datacube — variable/dimension axes
  • contacts — experiment owner info

API highlights:

  • GET /collections — all experiments, filterable via CQL2
  • GET /experiments — experiment-level search with CQL2 (component=, variable=, nml:radctl.co2vmr > 284)
  • GET /search — STAC item search
  • GET /queryables — OGC queryables for browser filter UI
  • GET /paleo-presets — named paleo time periods (LGM, mid-Holocene, …)
  • GET /collections/{id}/nml-parameters — namelist params for an experiment
  • Personal collections CRUD (/personal/…)
  • Catalog registry API (/catalogs) + web admin UI at /ui

Option A catalog layout (current): one STAC Collection per experiment, components stored as item properties (properties.component), enabling cross-component queries without per-component collection proliferation.

tests/test_esm_catalog/ — test suite

~4000 lines covering API, CLI, storage, scanning, personal collections, integration, HPC detection, and STAC object construction.

Shared infrastructure (also in this PR)

File Purpose
pixi.toml / pixi.lock Pixi environment for local dev
setup.py Editable install entry point
.github/workflows/docker-esm-catalog.yml Docker image CI
docs/esm_catalog_*.rst User guide, architecture ref, VirtualiZarr workflow
examples/fesom_stac/ Catalog builder examples for FESOM experiments
examples/echam_grib/ ECHAM GRIB reading examples
examples/notebooks/ Jupyter notebook walkthrough
utils/ See esm_viz PR

Key design decisions

  • DuckDB as storage: single-file, zero-infrastructure, fast JSON/array queries via SQL. No Postgres required on HPC.
  • STAC as the wire format: interoperable with existing STAC clients (stac-browser, pystac, intake-stac).
  • CQL2 filtering: full OGC CQL2-text/JSON support, including namelist parameter range queries.
  • Option A layout: one collection per experiment (not one per component) keeps the browser grid clean and allows multi-component queries.

Test plan

  • pixi run pytest tests/test_esm_catalog/ passes
  • esm-catalog scan --help and esm-catalog serve --help run without error
  • esm-catalog scan <experiment_path> produces a catalog.duckdb
  • esm-catalog serve --catalog catalog.duckdb serves GET /collections and GET /search
  • GET /queryables returns variables, components, experiment types
  • CQL2 filter ?filter=variable='sst' returns only SST items

🤖 Generated with Claude Code

siligam and others added 2 commits May 18, 2026 13:48
… for ESM-Tools

Introduces the esm_catalog package as part of the ESM-Tools-plus/simcat
initiative. Provides a DuckDB-backed STAC API, scanner, CLI, and MCP
server for cataloguing and querying climate model experiment output.

See src/esm_catalog/ARCHITECTURE.md for design overview.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move catalog deps (duckdb, pyarrow, pystac, shapely, cfgrib, etc.)
  to extras_require["catalog"] to fix pip install in CI
- Add from __future__ import annotations for Python 3.8 compatibility
- Guard duckdb import with try/except for optional install
- Exclude src/esm_catalog and src/esm_viz from pytest auto-discovery
- Fix esm_motd _get_real_dir_from_pth_file("") crash with try/except

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant