[Feature][DataLoader] Add catalog for loading tables#454
Merged
robreeves merged 27 commits intolinkedin:mainfrom Feb 21, 2026
Merged
[Feature][DataLoader] Add catalog for loading tables#454robreeves merged 27 commits intolinkedin:mainfrom
robreeves merged 27 commits intolinkedin:mainfrom
Conversation
integrations/python/dataloader/src/openhouse/dataloader/catalog.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/catalog.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/catalog.py
Outdated
Show resolved
Hide resolved
Introduces OpenHouseTableCatalog for loading Iceberg table metadata from the OpenHouse Tables Service, following the Java OpenHouseCatalog auth pattern (optional Bearer token and SSL trust store). Adds a TableCatalog protocol so the DataLoader can accept any PyIceberg Catalog implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DataLoader now takes a TableCatalog as its first argument, making it catalog-agnostic. Callers provide a pre-built catalog (e.g. OpenHouseTableCatalog or any PyIceberg Catalog), keeping auth and catalog-specific config outside the DataLoader. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the TableCatalog protocol with direct inheritance from PyIceberg's Catalog ABC. This makes OpenHouseTableCatalog interchangeable with any PyIceberg catalog, using the same (name, **properties) constructor pattern as the Java OpenHouseCatalog. The DataLoader now types its catalog parameter as Catalog directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Match the Java class name and align the class description. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace generic exceptions with OpenHouseCatalogError for all failure cases in load_table: 404, other HTTP errors (with response body), missing tableLocation, and metadata read failures. Extract tableLocation to a constant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check metadata, io, and catalog on the returned Table in addition to name and metadata_location. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove redundant inheritance tests, set response.ok explicitly, make patching consistent across tests, refactor helper to accept response overrides, add Content-Type and empty tableLocation edge case tests, rename bad_metadata test to unreadable_metadata for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace internal mock-based testing (_make_catalog_with_mock_session) with the responses library for HTTP mocking, testing external behavior instead of internal implementation details. Extract shared test constants, add a mock_iceberg_io fixture with documentation explaining why file I/O is mocked, and organize auth tests into a separate test class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap ValueError from response.json() in OpenHouseCatalogError so callers get a consistent exception type for all catalog failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename identifier_to_database_and_table to _parse_identifier and return a TableIdentifier dataclass instead of a raw tuple, reusing the existing table_identifier module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…error responses - Add __enter__/__exit__/close() so the catalog session is properly closed (similar to AutoCloseable in Java) - Narrow metadata read catch from Exception to OSError - Truncate response text in error messages to 500 chars with debug log - Use context manager in all tests - Use constants in error match strings and BASE_URL in trailing slash test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace debug log with inline "(truncated, showing x/n characters)" in the error message itself so users see the context directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ors propagate Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…leIO Delegates FileIO creation to the Catalog base class's _load_file_io(), which infers the appropriate implementation from the metadata location's URI scheme and catalog properties. This enables S3, HDFS, GCS, and other storage backends without code changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from Catalog._load_file_io() to pyiceberg.io.load_file_io() to avoid depending on a private method from an external library. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…table Reuse the Catalog base class utility instead of a custom static method, as suggested in PR review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use named parameters (uri, auth_token, trust_store, timeout_seconds) instead of a properties dict. Adds a configurable HTTP timeout defaulting to 30 seconds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use pyiceberg's NoSuchTableError instead of OpenHouseCatalogError for missing tables, consistent with other catalog implementations and allowing callers to handle this case specifically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cbb330
reviewed
Feb 20, 2026
integrations/python/dataloader/src/openhouse/dataloader/catalog.py
Outdated
Show resolved
Hide resolved
More precise name since requests.Session.verify only accepts PEM certificate bundles or OpenSSL-hashed directories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Forwards additional properties (e.g., py-io-impl for custom FileIO) to the Catalog base class, enabling custom storage backends without changing the catalog code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a standalone integration test that exercises OpenHouseCatalog.load_table against a real OpenHouse instance in Docker, verifying the full path from HTTP request through metadata fetch to table object construction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ShreyeshArangath
previously approved these changes
Feb 20, 2026
cbb330
approved these changes
Feb 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This creates a Python
OpenHouseCatalogthat supports loading tables. Authentication and logging is modeled after the java implementation of the OpenHouse catalog.I chose to make this inherit from the iceberg catalog. The reason for this is so that
OpenHouseDataLoadercan accept any Iceberg catalog and not be restricted to OpenHouse only.The next step is to have
OpenHouseDataLoaderuse the catalog to load the table and plan the splits. Retries for transient OH server calls will happen in that layer (e.g. retry for OSError exceptions). Here is an in progress PR for the next change.Changes
New Features:
OpenHouseCatalog- Inherits from PyIcebergCatalog, implementsload_tableviaGET /v1/databases/{database}/tables/{table}. Acceptsuri,auth-token, andtrust-storeproperties matching the Java catalog.OpenHouseDataLoadernow takes acatalog: Catalogas a required first parameter, making it catalog-agnostic.Testing Done
All tests pass via
make verifyI also added a new integration test framework for data loader and make a catalog e2e test. Output:
Output:
Additional Information