Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 26, 2025

Index tables were hardcoded in the package. This PR enables automatic discovery of available indices from idc-index-data release assets, with descriptions populated from the accompanying JSON schema files.

Changes

  • Automatic discovery (_discover_available_indices): Queries GitHub releases API to dynamically discover all available indices from parquet files in release assets - no hardcoded index list
  • Disk caching: Indices and schemas are cached to disk on first initialization and loaded from cache on subsequent calls. Cache is versioned and automatically refreshed when idc-index-data version changes
  • Schema API (get_index_schema): New method to retrieve full schema (table_description, columns with types/descriptions). Uses in-memory cache populated during discovery instead of re-fetching
  • Refresh capability (refresh_indices_overview): Force re-discovery of available indices and refresh both disk and memory caches
  • Improved fetch_index: Loads already-installed indices from disk instead of just warning; handles missing files gracefully
  • Unified processing: All indices (including bundled ones like index and prior_versions_index) get their descriptions from release asset schema files

Usage

from idc_index import IDCClient

client = IDCClient()

# Indices are now auto-discovered with descriptions from schema files
# On first init: fetches from API and caches to disk
# On subsequent inits: loads from disk cache (unless version changed)
print(client.indices_overview)
# {'index': {'description': 'This is the main metadata table...', 'installed': True, ...},
#  'sm_index': {'description': 'This table contains metadata about slide microscopy...', ...},
#  'collections_index': {...}, 'analysis_results_index': {...}, ...}

# Get full schema for any index (uses cached schemas from discovery)
schema = client.get_index_schema("sm_index")
print(schema["table_description"])
print(schema["columns"])  # [{name, type, mode, description}, ...]

# Force refresh of available indices (clears cache and re-fetches)
client.refresh_indices_overview()

Discovered Indices

Dynamically discovers all indices from the idc-index-data release assets. When the GitHub API is unavailable, falls back to bundled indices (index, prior_versions_index) with descriptions fetched from direct download URLs.

Original prompt

This section details on the original issue you should resolve

<issue_title>Support discovery of index tables available from idc-index-data</issue_title>
<issue_description>This python package provides access to a set of tables loaded from parquet. These are listed in https://github.com/ImagingDataCommons/idc-index/blob/main/idc_index/index.py#L128. Two of the tables, index and prior_versions_index, are included in the package and are installed automatically. The remaining indices are hard-coded. We need to switch from the hard-coded list of indices to automatic discovery of those tables based on the attachments for the idc-index-data dependency. Those tables are available as release attachments in parquet files, and are accessible at https://github.com/ImagingDataCommons/idc-index-data/releases/tag/<idc-index-data version> (for example, see https://github.com/ImagingDataCommons/idc-index-data/releases/download/23.0.1/analysis_results_index.parquet). The list of available index tables should be automatically discovered during package initialization and used to populate indices_overview.

Descriptions of the individual index tables should be populated from the schema files that are available from idc-index-data along with the parquet files. For example, analysis_results_index.parquet will be accompanied by analysis_results_index.json sibling attachment at https://github.com/ImagingDataCommons/idc-index-data/releases/download/23.0.1/analysis_results_index.json. Table description is available in the table_description attribute. The entire schema for the individual tables should also be cached, and made available to the user via idc-index API.

Discovered index list should be cached, the function that populates it should have an option to override and refresh the cached content.
</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits November 26, 2025 22:14
…eases

Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>
Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>
…ed variable

Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>
Copilot AI changed the title [WIP] Support discovery of index tables from idc-index-data Support discovery of index tables available from idc-index-data Nov 26, 2025
Copilot AI requested a review from fedorov November 26, 2025 22:25
aws_endpoint_url = "https://s3.amazonaws.com"
gcp_endpoint_url = "https://storage.googleapis.com"
asset_endpoint_url = f"https://github.com/ImagingDataCommons/idc-index-data/releases/download/{idc_index_data.__version__}"
github_api_url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"
Copy link
Member

@fedorov fedorov Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Endpoint URL is switched from github.com to api.github.com. Is there a rate limit? Is rate limit applied to both endpoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both endpoints have rate limits for unauthenticated access:

  1. api.github.com (REST API): 60 requests/hour for unauthenticated users
  2. github.com/.../releases/download/ (direct download): No strict rate limit, but subject to abuse detection

The implementation handles this by:

  • First trying the API endpoint to discover all available indices dynamically
  • If API returns 403 (rate limited), falling back to a known list of indices while still fetching descriptions directly from the schema JSON files via the download endpoint

This fallback approach ensures the package works even when rate limited - users get the known indices with descriptions fetched directly rather than failing entirely.

@fedorov
Copy link
Member

fedorov commented Nov 26, 2025

@copilot In idc_index/index.py:29 endpoint is changed from github.com to api.github.com. Are there any request rate limits that are applied with unauthenticated access? Are those limits applied for both endpoints?

@fedorov
Copy link
Member

fedorov commented Nov 26, 2025

@copilot Please switch back to using direct download endpoints so that rate limits do not apply. There is no real need to use the API endpoint.

Copy link
Contributor Author

Copilot AI commented Nov 26, 2025

@copilot Please switch back to using direct download endpoints so that rate limits do not apply. There is no real need to use the API endpoint.

Switched to using only direct download endpoints in commit 9fb9a62. Removed the github_api_url variable and simplified _discover_available_indices() to use the asset_endpoint_url for fetching schema JSONs directly without going through the API.

@fedorov fedorov force-pushed the copilot/support-discovery-index-tables branch from 9fb9a62 to de287b9 Compare November 27, 2025 22:33
"clinical_index",
"collections_index",
"analysis_results_index",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not hard-code known_remote_indices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the hardcoded known_remote_indices list in commit d693536. Indices are now dynamically discovered from the GitHub API by iterating through all parquet assets in the release.


# Skip bundled indices
if index_name in ("idc_index", "prior_versions_index"):
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not skip bundled indices while initializing description and schema from the release assets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit d693536. Bundled indices (idc_index, prior_versions_index) are no longer skipped - they are now processed along with all other indices from the API, getting their descriptions and schemas from the release assets.

…ng bundled ones from API

Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 2, 2025

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/23.0.1
    • Triggering command: /usr/bin/python3 python3 -m pytest tests/idcindex.py::TestIDCClient::test_list_indices tests/idcindex.py::TestIDCClient::test_discovered_indices_have_descriptions tests/idcindex.py::TestIDCClient::test_get_index_schema tests/idcindex.py::TestIDCClient::test_refresh_indices_overview tests/idcindex.py::TestIDCClient::test_indices_urls -v (http block)
    • Triggering command: /usr/bin/python3 python3 -m pytest tests/idcindex.py::TestIDCClient::test_list_indices tests/idcindex.py::TestIDCClient::test_discovered_indices_have_descriptions tests/idcindex.py::TestIDCClient::test_get_index_schema tests/idcindex.py::TestIDCClient::test_get_index_schema_caching tests/idcindex.py::TestIDCClient::test_refresh_indices_overview tests/idcindex.py::TestIDCClient::test_indices_urls -v (http block)
    • Triggering command: /usr/bin/python3 python3 -c from idc_index import IDCClient client = IDCClient() print(&#39;Discovered indices:&#39;) for name, info in client.indices_overview.items(): print(f&#39; {name}: installed={info[&#34;installed&#34;]}, has_description={bool(info[&#34;description&#34;])}&#39;) (http block)

If you need me to access, download, or install something from one of these locations, you can either:

@fedorov fedorov marked this pull request as ready for review December 2, 2025 19:24
if index_name == "index":
schema_url = f"{asset_endpoint_url}/idc_index.json"

schema = self._fetch_index_schema_from_url(schema_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to re-fetch schema from the URL, since _discover_available_indices was called in __init__ and the schema should be available for all indices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit ee35afb. get_index_schema() now uses schemas cached during _discover_available_indices() instead of re-fetching from URL. Schemas are stored in _index_schemas during discovery and reused.

# ... and check it can be executed
subprocess.check_call([self.s5cmdPath, "--help"], stdout=subprocess.DEVNULL)

def _discover_available_indices(self, refresh: bool = False) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve this function to add disk caching. One the first init, schema should be fetched and saved to disk. On subsequent calls, it should be loaded from disk instead of re-fetching. Cache should be matched against the version of idc-index-data and refreshed if the installed version does not match. User should have an option to force cache refresh if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented disk caching in commit ee35afb. Added _load_indices_cache_from_disk() and _save_indices_cache_to_disk() methods. Schemas are:

  • Fetched and saved to disk on first init (in indices_data_dir/indices_cache.json)
  • Loaded from disk on subsequent calls
  • Cache matched against idc-index-data version and refreshed if version doesn't match
  • User can force refresh via refresh=True parameter in _discover_available_indices() or by calling refresh_indices_overview()

…ndex_schema

Co-authored-by: fedorov <313942+fedorov@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 2, 2025

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/23.0.1
    • Triggering command: /usr/bin/python3 python3 -m pytest tests/idcindex.py::TestIDCClient::test_list_indices tests/idcindex.py::TestIDCClient::test_discovered_indices_have_descriptions tests/idcindex.py::TestIDCClient::test_get_index_schema tests/idcindex.py::TestIDCClient::test_get_index_schema_caching tests/idcindex.py::TestIDCClient::test_refresh_indices_overview -v (http block)
    • Triggering command: /usr/bin/python3 python3 (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support discovery of index tables available from idc-index-data

2 participants