Support discovery of index tables available from idc-index-data #198

Copilot · 2025-11-26T21:59:31Z

Index tables were hardcoded in the package. This PR enables automatic discovery of available indices from idc-index-data release assets, with descriptions populated from the accompanying JSON schema files.

Changes

Automatic discovery (_discover_available_indices): Queries GitHub releases API to dynamically discover all available indices from parquet files in release assets - no hardcoded index list
Disk caching: Indices and schemas are cached to disk on first initialization and loaded from cache on subsequent calls. Cache is versioned and automatically refreshed when idc-index-data version changes
Schema API (get_index_schema): New method to retrieve full schema (table_description, columns with types/descriptions). Uses in-memory cache populated during discovery instead of re-fetching
Refresh capability (refresh_indices_overview): Force re-discovery of available indices and refresh both disk and memory caches
Improved fetch_index: Loads already-installed indices from disk instead of just warning; handles missing files gracefully
Unified processing: All indices (including bundled ones like index and prior_versions_index) get their descriptions from release asset schema files

Usage

from idc_index import IDCClient

client = IDCClient()

# Indices are now auto-discovered with descriptions from schema files
# On first init: fetches from API and caches to disk
# On subsequent inits: loads from disk cache (unless version changed)
print(client.indices_overview)
# {'index': {'description': 'This is the main metadata table...', 'installed': True, ...},
#  'sm_index': {'description': 'This table contains metadata about slide microscopy...', ...},
#  'collections_index': {...}, 'analysis_results_index': {...}, ...}

# Get full schema for any index (uses cached schemas from discovery)
schema = client.get_index_schema("sm_index")
print(schema["table_description"])
print(schema["columns"])  # [{name, type, mode, description}, ...]

# Force refresh of available indices (clears cache and re-fetches)
client.refresh_indices_overview()

Discovered Indices

Dynamically discovers all indices from the idc-index-data release assets. When the GitHub API is unavailable, falls back to bundled indices (index, prior_versions_index) with descriptions fetched from direct download URLs.

Fixes Support discovery of index tables available from idc-index-data #197

Original prompt

This section details on the original issue you should resolve

<issue_title>Support discovery of index tables available from idc-index-data</issue_title>
<issue_description>This python package provides access to a set of tables loaded from parquet. These are listed in https://github.com/ImagingDataCommons/idc-index/blob/main/idc_index/index.py#L128. Two of the tables, index and prior_versions_index, are included in the package and are installed automatically. The remaining indices are hard-coded. We need to switch from the hard-coded list of indices to automatic discovery of those tables based on the attachments for the idc-index-data dependency. Those tables are available as release attachments in parquet files, and are accessible at https://github.com/ImagingDataCommons/idc-index-data/releases/tag/<idc-index-data version> (for example, see https://github.com/ImagingDataCommons/idc-index-data/releases/download/23.0.1/analysis_results_index.parquet). The list of available index tables should be automatically discovered during package initialization and used to populate indices_overview.

Descriptions of the individual index tables should be populated from the schema files that are available from idc-index-data along with the parquet files. For example, analysis_results_index.parquet will be accompanied by analysis_results_index.json sibling attachment at https://github.com/ImagingDataCommons/idc-index-data/releases/download/23.0.1/analysis_results_index.json. Table description is available in the table_description attribute. The entire schema for the individual tables should also be cached, and made available to the user via idc-index API.

Discovered index list should be cached, the function that populates it should have an option to override and refresh the cached content.
</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Support discovery of index tables available from idc-index-data #197

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

…eases Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

…ed variable Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

fedorov · 2025-11-26T22:29:19Z

idc_index/index.py

 aws_endpoint_url = "https://s3.amazonaws.com"
 gcp_endpoint_url = "https://storage.googleapis.com"
 asset_endpoint_url = f"https://github.com/ImagingDataCommons/idc-index-data/releases/download/{idc_index_data.__version__}"
+github_api_url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"


@copilot Endpoint URL is switched from github.com to api.github.com. Is there a rate limit? Is rate limit applied to both endpoints?

Yes, both endpoints have rate limits for unauthenticated access:

api.github.com (REST API): 60 requests/hour for unauthenticated users

github.com/.../releases/download/ (direct download): No strict rate limit, but subject to abuse detection

The implementation handles this by:

First trying the API endpoint to discover all available indices dynamically

If API returns 403 (rate limited), falling back to a known list of indices while still fetching descriptions directly from the schema JSON files via the download endpoint

This fallback approach ensures the package works even when rate limited - users get the known indices with descriptions fetched directly rather than failing entirely.

fedorov · 2025-11-26T22:33:12Z

@copilot In idc_index/index.py:29 endpoint is changed from github.com to api.github.com. Are there any request rate limits that are applied with unauthenticated access? Are those limits applied for both endpoints?

fedorov · 2025-11-26T22:40:43Z

@copilot Please switch back to using direct download endpoints so that rate limits do not apply. There is no real need to use the API endpoint.

Copilot · 2025-11-26T22:45:02Z

@copilot Please switch back to using direct download endpoints so that rate limits do not apply. There is no real need to use the API endpoint.

Switched to using only direct download endpoints in commit 9fb9a62. Removed the github_api_url variable and simplified _discover_available_indices() to use the asset_endpoint_url for fetching schema JSONs directly without going through the API.

fedorov · 2025-12-02T18:00:18Z

idc_index/index.py

+            "clinical_index",
+            "collections_index",
+            "analysis_results_index",
+        ]


Do not hard-code known_remote_indices.

Removed the hardcoded known_remote_indices list in commit d693536. Indices are now dynamically discovered from the GitHub API by iterating through all parquet assets in the release.

fedorov · 2025-12-02T18:01:20Z

idc_index/index.py

+
+                    # Skip bundled indices
+                    if index_name in ("idc_index", "prior_versions_index"):
+                        continue


Do not skip bundled indices while initializing description and schema from the release assets.

Fixed in commit d693536. Bundled indices (idc_index, prior_versions_index) are no longer skipped - they are now processed along with all other indices from the API, getting their descriptions and schemas from the release assets.

…ng bundled ones from API Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

Copilot · 2025-12-02T18:10:44Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/23.0.1
- Triggering command: /usr/bin/python3 python3 -m pytest tests/idcindex.py::TestIDCClient::test_list_indices tests/idcindex.py::TestIDCClient::test_discovered_indices_have_descriptions tests/idcindex.py::TestIDCClient::test_get_index_schema tests/idcindex.py::TestIDCClient::test_refresh_indices_overview tests/idcindex.py::TestIDCClient::test_indices_urls -v (http block)
- Triggering command: /usr/bin/python3 python3 -m pytest tests/idcindex.py::TestIDCClient::test_list_indices tests/idcindex.py::TestIDCClient::test_discovered_indices_have_descriptions tests/idcindex.py::TestIDCClient::test_get_index_schema tests/idcindex.py::TestIDCClient::test_get_index_schema_caching tests/idcindex.py::TestIDCClient::test_refresh_indices_overview tests/idcindex.py::TestIDCClient::test_indices_urls -v (http block)
- Triggering command: /usr/bin/python3 python3 -c from idc_index import IDCClient client = IDCClient() print('Discovered indices:') for name, info in client.indices_overview.items(): print(f' {name}: installed={info["installed"]}, has_description={bool(info["description"])}') (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

fedorov · 2025-12-02T19:34:45Z

idc_index/index.py

+        if index_name == "index":
+            schema_url = f"{asset_endpoint_url}/idc_index.json"
+
+        schema = self._fetch_index_schema_from_url(schema_url)


There is no need to re-fetch schema from the URL, since _discover_available_indices was called in __init__ and the schema should be available for all indices.

Fixed in commit ee35afb. get_index_schema() now uses schemas cached during _discover_available_indices() instead of re-fetching from URL. Schemas are stored in _index_schemas during discovery and reused.

fedorov · 2025-12-02T19:36:58Z

idc_index/index.py

        # ... and check it can be executed
        subprocess.check_call([self.s5cmdPath, "--help"], stdout=subprocess.DEVNULL)

+    def _discover_available_indices(self, refresh: bool = False) -> dict:


Improve this function to add disk caching. One the first init, schema should be fetched and saved to disk. On subsequent calls, it should be loaded from disk instead of re-fetching. Cache should be matched against the version of idc-index-data and refreshed if the installed version does not match. User should have an option to force cache refresh if needed.

Implemented disk caching in commit ee35afb. Added _load_indices_cache_from_disk() and _save_indices_cache_to_disk() methods. Schemas are:

Fetched and saved to disk on first init (in indices_data_dir/indices_cache.json)

Loaded from disk on subsequent calls

Cache matched against idc-index-data version and refreshed if version doesn't match

User can force refresh via refresh=True parameter in _discover_available_indices() or by calling refresh_indices_overview()

…ndex_schema Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

Copilot · 2025-12-02T19:45:04Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/23.0.1
- Triggering command: /usr/bin/python3 python3 -m pytest tests/idcindex.py::TestIDCClient::test_list_indices tests/idcindex.py::TestIDCClient::test_discovered_indices_have_descriptions tests/idcindex.py::TestIDCClient::test_get_index_schema tests/idcindex.py::TestIDCClient::test_get_index_schema_caching tests/idcindex.py::TestIDCClient::test_refresh_indices_overview -v (http block)
- Triggering command: /usr/bin/python3 python3 (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Initial plan

b61636e

Copilot AI assigned Copilot and fedorov Nov 26, 2025

Copilot started work on behalf of fedorov November 26, 2025 21:59 View session

Copilot AI and others added 3 commits November 26, 2025 22:14

Implement automatic discovery of index tables from idc-index-data rel…

8f20347

…eases Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

Add missing index attribute initializations in __init__

33cdb8f

Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

Address code review feedback: fix potential recursion and remove unus…

de287b9

…ed variable Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Support discovery of index tables from idc-index-data~~ Support discovery of index tables available from idc-index-data Nov 26, 2025

Copilot AI requested a review from fedorov November 26, 2025 22:25

Copilot finished work on behalf of fedorov November 26, 2025 22:25

fedorov reviewed Nov 26, 2025

View reviewed changes

Copilot started work on behalf of fedorov November 26, 2025 22:33 View session

Copilot AI requested a review from fedorov November 26, 2025 22:40

Copilot finished work on behalf of fedorov November 26, 2025 22:40

Copilot started work on behalf of fedorov November 26, 2025 22:41 View session

Copilot finished work on behalf of fedorov November 26, 2025 22:46

fedorov force-pushed the copilot/support-discovery-index-tables branch from 9fb9a62 to de287b9 Compare November 27, 2025 22:33

fedorov requested changes Dec 2, 2025

View reviewed changes

Copilot started work on behalf of fedorov December 2, 2025 18:03 View session

Remove hardcoded known_remote_indices and process all indices includi…

d693536

…ng bundled ones from API Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

Copilot AI requested a review from fedorov December 2, 2025 18:10

Copilot finished work on behalf of fedorov December 2, 2025 18:10

fedorov added the pre-commit.ci autofix label Dec 2, 2025

pre-commit-ci bot removed the pre-commit.ci autofix label Dec 2, 2025

style: pre-commit fixes

41d7785

fedorov marked this pull request as ready for review December 2, 2025 19:24

fedorov requested changes Dec 2, 2025

View reviewed changes

Copilot started work on behalf of fedorov December 2, 2025 19:37 View session

Add disk caching for indices and schemas, use cached schemas in get_i…

ee35afb

…ndex_schema Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>

Copilot finished work on behalf of fedorov December 2, 2025 19:45

Copilot AI requested a review from fedorov December 2, 2025 19:45

fedorov added the pre-commit.ci autofix label Dec 2, 2025

pre-commit-ci bot removed the pre-commit.ci autofix label Dec 2, 2025

style: pre-commit fixes

8774cb6

Support discovery of index tables available from idc-index-data #198

Are you sure you want to change the base?

Support discovery of index tables available from idc-index-data #198

Conversation

Copilot AI commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Usage

Discovered Indices

Comments on the Issue (you are @copilot in this section)

Uh oh!

fedorov Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

fedorov commented Nov 26, 2025

Uh oh!

fedorov commented Nov 26, 2025

Uh oh!

Copilot AI commented Nov 26, 2025

Uh oh!

fedorov Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

fedorov Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Dec 2, 2025

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

fedorov Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

fedorov Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Dec 2, 2025

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 26, 2025 •

edited

Loading

fedorov Nov 26, 2025 •

edited

Loading