diff --git a/docs/fy26-roadmap.md b/docs/fy26-roadmap.md deleted file mode 100644 index 6c94442..0000000 --- a/docs/fy26-roadmap.md +++ /dev/null @@ -1,129 +0,0 @@ -# ODD Fiscal Year (FY) 2026 Roadmap - -If you are interested in a better understanding of the ODD service roadmap, and what datasets will be supported when, this document is for you. - -This document provides a roadmap for the VEDA Optimized Data Delivery Team (ODD), broken into 4 categories: -1. Services for granules in CMR -2. Services for datacubes -3. Services non-datacube -4. Foundational Work - -It is important to note that this roadmap is a reflection of the team's current plans, written as of November 2025. These are likely to evolve over time. We intend to update the roadmap quarterly. - -For a higher-level vision, see also: [Optimized Data Delivery Roadmap for NASA - July 2025](https://docs.google.com/presentation/d/1Ouo_9qJJuDBdrzDHpt2P-o1wGBPS1nvTjLRFAFGsYkU/edit?usp=sharing). - ---- - -## Legend - -- **โœ… Complete** - Already delivered -- **๐Ÿšง In Progress** - Active development -- **๐Ÿ”„ Ongoing** - Ongoing work -- **๐Ÿ“… Planned** - Scheduled for specific quarter -- **๐Ÿ”ฎ Future** - Planned for future timeline - ---- - -![Services for CMR Granules](./category1-granules.svg) - -## Roadmap for Service Category 1: Services for CMR Granules - -### Access -*N/A* - -### Visualization -- **โœ… Complete** titiler-cmr /tiles API + VEDA UI integration - -### Timeseries -- **โœ… Complete** titiler-cmr /timeseries/statistics API + VEDA UI integration - -### Additional Features -- **๐Ÿšง 26.1** Release /compatibility endpoint -- **๐Ÿ“… 26.2+** Develop support for more datasets, informed by compatibility testing in 26.1. - -### Dataset Support -- **โœ… Complete** Demonstrated with GPM IMERG, TROPESS O3 and MiCASA -- **๐Ÿšง 26.1** Compile a list of compatible datasets -- **๐Ÿšง 26.1** Develop support for EDL-based credential access, as an aternative to requester-pays and role-based access. To support NISAR (ASF) and GEDI L4B (ORNL DAAC) specifically. -- **๐Ÿ“… 26.2+** Test integration of new datasets as requester-pays is enabled for more buckets. - -### Performance + Operations -- **๐Ÿšง 26.1** Deploy monitoring + performance evaluation via service tracing (OpenTelemetry) -- **๐Ÿ“… 26.1** MCP Production deployment -- **๐Ÿ“… 26.2** Consolidated benchmarking utilities for advising users on zoom levels, AOIs and temporal parameters on a per-dataset basis - -### Ecosystem Development -- **๐Ÿ“… 26.2** Share compatible dataset list with NASA product teams for potential integration (i.e. Worldview) -- **๐Ÿ“… 26.2+** Continued documentation to support self-service use of titiler-cmr. - ---- - -![Services for Datacubes](./category2-datacubes.svg) - -## Roadmap for Service Category 2: Services for Datacubes - -### Access -- **โœ… Complete** Lazy loading/intelligent subsetting/intelligent access for varied data formats (GRIB, COG, NetCDF-4, HDF5 via VirtualiZarr) -- **๐Ÿ“… 26.1** Support adoption of Virtual Zarr through library maintenance, improved documentation, and user support -- **๐Ÿ“… 26.2** Support for arbitrary [chunk-grids (variable chunking)](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#chunk-grids) -- **๐Ÿ“… 26.2** Explore virtualization methods for alternate grid structures (i.e., healpix, cubegrid) - -### Visualization -- **๐Ÿ“… 26.1** Virtual container (Icechunk) integration in titiler-multidim to support /tiles endpoints -- **๐Ÿ“… 26.1** Identify additional I/O parameters to allow for per-dataset optimizations -- **๐Ÿ“… 26.1** Test VEDA UI integration of /tiles for a virtual dataset (e.g. NLDAS) -- **๐Ÿ“… 26.2** Additional performance improvements (e.g. obstore integration) - -### Timeseries -- **๐Ÿ“… 26.1** Design the timeseries/statistics endpoint to support datacubes (i.e. could be an asynchronous API outside the titiler ecosystem) -- **๐Ÿ“… 26.2** Develop the timeseries/statistics endpoint -- **๐Ÿ“… 26.2** Integrate the timeseries/statistics endpoint into VEDA UI - -### Datasets -- **โœ… Complete** Prototyped virtual (Icechunk) stores for NLDAS, RASI, HRRR, MUR SST -- **๐Ÿ“… 26.1** Demonstrate publication and tiling of NLDAS virtual store (๐Ÿ’ง Water Insight) -- **๐Ÿ“… 26.1** Architecture + documentation for generalizing STAC publication and VEDA UI /tiles integration -- **๐Ÿ“… 26.2** HydroGlobe 5km and 10km virtual stores (๐Ÿ’ง Water Insight) -- **๐Ÿ“… 26.2** CarbonTracker-CHโ‚„, EPA Gridded CHโ‚„ Emissions Inventory virtual stores (๐Ÿญ GHGCenter) -- **๐Ÿ“… 26.3** Documentation for STAC publication and VEDA UI /timeseries/statistics integration -- **๐Ÿ“… 26.3** CarbonTracker-CHโ‚„, EPA Gridded CHโ‚„ Emissions Inventory tiles and timeseries integrations (๐Ÿญ GHGCenter) -- **๐Ÿ“… 26.3** TROPESS NOx, TROPESS O3, JPL MOMO Chem, GEOS CF virtual stores, tiles and timeseries integrations (๐Ÿ’จ Air Quality) - -### Operations -- **๐Ÿ“… 26.2** Monitoring + Performance evaluation via service tracing (OpenTelemetry) -- **๐Ÿ“… 26.3** MCP deployment -- **๐Ÿ“… 26.2** Consolidated benchmarking utilities for advising users on zoom levels, AOIs and temporal parameters on a per-dataset basis - -### Ecosystem Development -- **๐Ÿ“… 26.1** Create template data ingestion pipeline for virtualizing datasets -- **๐Ÿ“… 26.3+** Moving towards self-service integration - ---- - -![Services for Non-Datacubes](./category3-nondatacubes.svg) - -## Roadmap for Service Category 3: Services for Non-Datacubes - -### Access -- **๐Ÿšง 26.1-26.3** Prototyping creating a query engine using a Zarr provider for data fusion - -### Visualization -- **๐Ÿ”ฎ 26.4 or FY 27** Tiling endpoints in near-term, direct client approaches in long-term - -### Timeseries -- **๐Ÿ”ฎ 26.4 or FY 27** Timeseries API - -### Datasets -- **๐Ÿ“… 26.1** Prototype HLS store -- **๐Ÿ“… 26.3+** Prototype NISAR and/or Opera stores - -### Operations -- **๐Ÿ”ฎ 26.4 or FY 27** Operational deployment + documentation -- **๐Ÿ”ฎ 26.4 or FY 27** Consolidated benchmarking utilities for advising users on zoom levels, AOIs and temporal parameters on a per-dataset basis - -### Ecosystem Development -- **๐Ÿ”ฎ 26.4 or FY 27** Develop ecosystem, moving towards self-service adoption within VEDA and broader community - -## Roadmap for Service Category 4: Foundational Work (including Technical Debt) - -- **๐Ÿ”„ 26.1+** Establish areas for consolidation in the TiTiler ecosystem. Similar features across applications should rely on shared upstream libraries. The ODD team continuously identifying similar features and proactively DRY up codebases. diff --git a/docs/index.md b/docs/index.md index 1acf3d2..3faa5d0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,4 +4,4 @@ Welcome to the documentation for the Optimized Data Delivery (ODD) team, working ## ODD FY26 Roadmap -For a digest of what the team plans to work on this next year, please visit our [Fiscal Year (FY) 2026 Roadmap](./fy26-roadmap.md). +For a digest of what the team plans to work on this next year, please visit our [Roadmap](./roadmap.md). diff --git a/docs/roadmap.md b/docs/roadmap.md new file mode 100644 index 0000000..a6c8988 --- /dev/null +++ b/docs/roadmap.md @@ -0,0 +1,158 @@ +# ODD roadmap + +This page explains the motivations behind ODD's daily work. It connects what we're building to why we're building it. The primary audience is the ODD team. The secondary audience is peer ODSI teams who want to understand how our work fits the broader picture. + +## Vision + +If we are successful, we imagine users will be able to: + +1. **Ask questions in plain language and reproduce the response:** As an Earth enthusiast, I want to ask questions like "how did the Gifford fire evolve?" and get an animated visual. I want to be able to reproduce responses with links to the source code that produced the analysis, so I can verify and reproduce it. +2. **Explore in the browser:** As an Earth enthusiast, I want to visually explore forest disturbance through NISAR data directly in my browser, with no specialized software or cloud account. +3. **Research at scale:** As a fire event researcher, I want to evaluate relationships between variables from different data products across many thousands of fires, with minimal data pre-processing for fusion and modeling. +4. **Operate in near-real time:** As an operational application, I need products like HLS for disaster response, or sea surface temperature for maritime operations, available in near-real time. + +## The gap + +NASA already serves these users โ€” but current services have limits that grow more acute as data volumes grow: + +| User story | Today's services | Where they fall short | +| ------------------------- | -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| Ask in plain language | Earth Information Explorer | Limited dataset access; datasets must be curated into the system | +| Explore in the browser | Worldview / GIBS | Not configurable by users; pre-rendered layers don't scale to new datasets or rendering needs | +| Research at scale | Earthdata Cloud, Harmony, cloud-hosted JupyterHubs | Harmony offloads processing to servers, requiring heavy compute cost rather than a structural fix; users struggle to find the best datasets for their needs | +| Operate in near-real time | LANCE + HLS | Hard to keep metadata and data in sync; no reliable notification system for new data landing in Earthdata Cloud buckets | +| Data discovery | CMR | Under increasing pressure from rapid archive growth and analytics-scale query traffic | + +## Our pillars + +We address these gaps through four pillars: + +1. **Open standards & FAIR data:** NASA data and services are findable, accessible, interoperable, and reusable, built on community standards rather than bespoke systems. +2. **Performance, cost & scale:** Optimize performance while minimizing cost, with solutions that scale sustainably to new and growing data volumes. +3. **Empowered users:** Users โ€” both data providers and data consumers โ€” can use and apply the solutions we build without us. +4. **Trusted & reliable data:** The data products NASA generates are verifiable, consistent, and kept in sync with their metadata. + +Further, we maintain high standards for the software we develop or reuse, while never intending to duplicate effort. All software we develop or use should be of high quality, under an open source license, and developed and adopted by a broad community. + +## Roadmap + +Listed in the table below are technologies and technical components this team plans or is contributing to. We believe these components will make progress towards the vision and pillars described above. + +Below, the **[Roadmap Items in Detail](#roadmap-items-in-detail)** section provides a brief description of each roadmap item. + +* **Now ยท mature** means this is a mature technology. We are currently working on it but it is ready for adoption. +* **Next ยท developing** means this is a developing technology. We are currently working on it so it will be ready for adoption. Timeline for maturity and adoption readiness varies. +* **Later ยท future** means this is a technology we are not actively developing. We would like to work on it but other technologies in active development take precedence. + +The **โ—†** designation represents a category of ongoing work. + +| Pillar | Now ยท mature | Next ยท developing| Later ยท future | +| ------------------------------ | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | +| **Open standards & FAIR data** | โ—† Array format (Zarr) stewardship ยท โ—† Geospatial conventions (GeoZarr) | Zarr Ecosystem sustainability ยท Codec re-architecture ยท variable chunking | Conventions + CRS utilities | +| **Performance, cost & scale** | Data virtualization ยท Object-store access ยท Dynamic tiling ยท In-browser rendering | Virtual stores + lazy array analytics ยท Analytics-scale metadata ยท Storage model evaluation | Resampling/warp tooling ยท Query at scale ยท Storage cost optimization ยท Caching | +| **Empowered users** | Cloud-native guidance ยท Science support ยท Format evaluation | In-browser rendering ยท Cloud-optimized decision framework ยท Improved access & auth libraries ยท Dataset + tooling coverage metrics | AI-assisted optimization (skills + tooling) ยท ESRI / ArcGIS integration | +| **Trusted & reliable data** | โ—† Transactional Zarr (Icechunk) | Virtual stores for ongoing datasets ยท Synchronized metadata + data | Event-driven (object store notifications) for near-real time (NRT) updates | + +## How we work + +ODD is a research and development team, not an operations or continued-maintenance team. Success for any item on this roadmap is *graduating off of it* โ€” not staying on it indefinitely. + +### Lifecycle + +We anticipate work to move through four stages: **Later** (future, aspirational) โ†’ **Next** (developing) โ†’ **Now** (mature) โ†’ **Handed off** (owned by someone else). + +An item is ready to hand off when it passes three tests: + +1. **Someone else can do it.** Documentation, tooling, and skills exist so that a data provider or partner can reproduce the work without us. +2. **Someone else owns it.** A named owner โ€” a DAAC, a mission team, community maintainers โ€” has accepted responsibility. +3. **We've stopped learning.** Our remaining contribution is maintenance, not discovery. + +Using virtual data stores as an example: today we generate stores ourselves (learning). Next, +we will ship developer docs and optimization skills (enabling). Then store generation +graduates to data providers. While we will continue to work on underlying tooling, several +roadmap items โ€” documentation, decision tooling, and ecosystem sustainability โ€” are not just projects but +handoff methods. + +The above steps and example are notional and not established through practice. + +### Prioritization + +Objectives we take on must also balance "utopian" goals โ€” like a unified Zarr model โ€” +with the necessity of supporting legacy patterns and other formats. + +When evaluating new candidate work, we apply these criteria: + +- **Vision alignment:** Does it serve at least one vision story and satisfy all appropriate pillars? +- **Adoption readiness:** How quickly can the ecosystem absorb it? Building on familiar interfaces lowers the barrier (VirtualiZarr adopting xarray's data model made it immediately accessible); very new technology carries adoption lag as a risk. +- **Cost:** What does adoption cost โ€” in compute, energy and onboarding (users and systems)? +- **Handoff path:** Can we state who would eventually own this? + +## Roadmap Items in Detail + +Below, each technical component is briefly explained. + +### Open standards & FAIR data + +**โ—† Array format stewardship:** The foundational format for cloud-native array data is Zarr. This component comprises ongoing maintenance and stewardship, including convening the community โ€” e.g. Zarr Summit '26/27 โ€” to unblock progress on technical features and convention adoption. + +**โ—† Geospatial conventions:** Zarr conventions for geospatial metadata (GeoZarr) are essential for native and virtual Zarr collections to interoperate across GIS, visualization, and analysis libraries. Success is trust and interoperability for Zarr data from all Earth data providers (NASA, NOAA, ESA), and a consistent platform to build client applications on. + +**Ecosystem sustainability:** Zarr will support growing, complex use cases through a sustainable maintainer ecosystem. That ecosystem includes the work detailed in the zarr-python roadmap plus maintainer onboarding. + +**Codec re-architecture:** The Zarr v2 -> v3 transition exposed design issues in the codec model. Re-architecting it supports new codec development (vital for virtualization, where archival formats use less-standardized codecs), alternative client implementations in Rust and TypeScript and fixing quirky data (CF codecs and concatenating arrays with varied codecs). + +**Conventions + CRS utilities:** Utilities and guidance on CF and GeoZarr conventions will keep virtual store metadata aligned with tooling. This work will unblock tools that rely on those conventions from using compliant virtual stores. + +### Performance, cost & scale + +**Data virtualization:** Data virtualization enables access to archival data through the Zarr API without duplicating it. Work includes VirtualiZarr parser improvements (virtual-tiff, obspec-utils, async-hdf5, GRIB) and transitioning maintenance to partners. + +**Object-store access:** Libraries such as obstore provide high-performance object storage access for the Python geospatial stack. + +**Dynamic tiling:** Dynamic tiling enables visualization without maintaining static image pyramids. Future work includes supporting additional datasets and integrations, for example WMTS GetCapabilities so EGIS can surface HLS vegetation indices in ArcGIS. + +**Lazy array analytics:** Instantly materialize massive lazy multi-dimensional arrays (time, band, x, y) from metadata stores (e.g. lazycogs and lazymerge). These libraries provide a scalable replacement for stackstac/odc-stac. + +**Variable chunking:** Variable chunk support in VirtualiZarr + xarray will unlock virtualizing more datasets. + +**Analytics-scale metadata:** EOSDIS has identified pressure on CMR as a significant risk. We are prototyping collection-level stores using GeoParquet/Iceberg and DataFusion to understand performance, cost, and scaling โ€” and contribute to the relevant open-source libraries. + +**Storage model evaluation:** We will evaluate emerging storage models and their trade-offs, such as the [S3 Files synchronization system](https://aws.amazon.com/s3/features/files/). + +**Resampling/warp tooling:** A composable, Rust-based resampling/warp library will reduce dependence on GDAL's monolithic toolchain. Such a library would be useful for server-side tiling, distributed array frameworks (Dask, Cubed), and WASM in-browser rendering. This idea is stil in the design and ecosystem assessment phase. + +**Query at scale:** We are demonstrating query and access at scale through a single interface (zarr-datafusion-search). This library demonstrates a Zarr interface for Level 0/1 and swath data, and moves EOSDIS toward an Arrow-native ecosystem. + +**Storage cost optimization:** Data virtualization addresses the growing cost of data volumes in Earthdata Cloud by accessing archival data through the Zarr API without duplicating it. Future work includes applying other storage cost strategies as evaluated in the work item listed above. + +### Empowered users + +**Cloud-native guidance:** The CNG guide unblocks people confused about which formats exist, why, and when to use each. + +**Science support:** We continue to work with the dedicated science support team to provide cloud-optimized data guidance. + +**Format evaluation:** We continue to evaluate mission data formats and recommend improvements that enable optimized access patterns. + +**In-browser rendering:** We are developing in-browser GPU rendering of COGs and Zarr via direct data access (e.g. deck.gl-raster + Lonboard). Users customize rendering without re-fetching data. + +**Virtual store documentation:** Virtual store documentation (how to build virtual stores, with or without agents) will unblock DAACs and science teams as virtual store developers. + +**Cloud-optimized decision framework:** The cloud-optimized data decision tree will guide format and chunking decisions. This will also serve as the foundation for AI-assisted optimization. + +**Improved access & auth libraries:** We provide development support to libraries that get data and credentials into users' hands (e.g. earthaccess). + +**AI-assisted optimization (skills + tooling):** A CLI and agentic skill for data structure optimization will build on the cloud-optimized decision framework, reducing engineering time to a balanced or optimized data structure. + +**Dataset + tooling coverage metrics:** An assessment of how many NASA datasets work with our tools (VirtualiZarr, datafusion, lazycogs) will provide metrics for improvement and impact. + +**ESRI / ArcGIS integration:** A large share of NASA data users work in ArcGIS, so our tools and data need to integrate with ESRI systems. We need to ensure our cloud-native outputs are consumable through the open standards ESRI already supports (COG, WMTS, OGC APIs, GeoZarr). + +### Trusted & reliable data + +**โ—† Transactional Zarr:** Checksum verification and ACID transactions for Zarr stores, via Icechunk, provides reliability. + +**Near-real time virtual stores:** We will keep stores current as data arrives. This work will serve anyone doing historical or NRT sea surface temperature analysis. + +**Synchronized metadata + data:** Keep metadata in sync with data to ensure analyses are valid. + +**Event-driven NRT updates:** Stores such as Icechunk make all store updates trackable by listening to changes in object storage keys. Simple event-driven pipelines will enable dynamically updated pyramids (e.g., for Worldview), summary statistics, and pre-computed time series. This is the path to keeping virtual stores current with incoming data streams. diff --git a/mkdocs.yml b/mkdocs.yml index 2113ff9..be2fa6e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -2,7 +2,7 @@ site_name: Optimized Data Delivery repo_name: NASA-IMPACT/veda-odd site_author: ODD Team docs_dir: docs -site_url: !ENV [READTHEDOCS_CANONICAL_URL, 'https://nasa-impact.github.io/veda-odd/'] +site_url: !ENV [READTHEDOCS_CANONICAL_URL, "https://nasa-impact.github.io/veda-odd/"] extra: version: @@ -11,7 +11,7 @@ extra: nav: - "index.md" - - FY26 Roadmap: "fy26-roadmap.md" + - Roadmap: "roadmap.md" - PI Objectives: "objectives.md" - ODD Products: "products.md" - Tech Tips: "tech-tips.md"