Skip to content

[Feature][DataLoader] Add catalog for loading tables#454

Merged
robreeves merged 27 commits intolinkedin:mainfrom
robreeves:planner
Feb 21, 2026
Merged

[Feature][DataLoader] Add catalog for loading tables#454
robreeves merged 27 commits intolinkedin:mainfrom
robreeves:planner

Conversation

@robreeves
Copy link
Copy Markdown
Collaborator

@robreeves robreeves commented Feb 13, 2026

Summary

This creates a Python OpenHouseCatalog that supports loading tables. Authentication and logging is modeled after the java implementation of the OpenHouse catalog.

I chose to make this inherit from the iceberg catalog. The reason for this is so that OpenHouseDataLoader can accept any Iceberg catalog and not be restricted to OpenHouse only.

The next step is to have OpenHouseDataLoader use the catalog to load the table and plan the splits. Retries for transient OH server calls will happen in that layer (e.g. retry for OSError exceptions). Here is an in progress PR for the next change.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

New Features:

  • OpenHouseCatalog - Inherits from PyIceberg Catalog, implements load_table via GET /v1/databases/{database}/tables/{table}. Accepts uri, auth-token, and trust-store properties matching the Java catalog.
  • OpenHouseDataLoader now takes a catalog: Catalog as a required first parameter, making it catalog-agnostic.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

All tests pass via make verify

I also added a new integration test framework for data loader and make a catalog e2e test. Output:

  $ make integration-test TOKEN_FILE=../../../tables-test-fixtures/tables-test-fixtures-iceberg-1.2/src/main/resources/dummy.token
  uv run python tests/integration_test_catalog.py ../../../tables-test-fixtures/tables-test-fixtures-iceberg-1.2/src/main/resources/dummy.token
  Created table d_e2e.t_catalog
  Copied metadata from container to /tmp/d_e2e/t_catalog-87baa9a1-9f15-4c11-98bb-f35c0e70a592/00000-5be7fde5-61d4-4035-9cc0-95bdb0dde147.metadata.json
  load_table returned table with name=('d_e2e', 't_catalog'),
  metadata_location=file:/tmp/d_e2e/t_catalog-87baa9a1-9f15-4c11-98bb-f35c0e70a592/00000-5be7fde5-61d4-4035-9cc0-95bdb0dde147.metadata.json
  load_table correctly raised NoSuchTableError for nonexistent table
  Deleted table d_e2e.t_catalog
  All integration catalog tests passed

Output:

  Checking tables service is up...
    Service is up: {'status': 'UP'}

  Creating table test_db.test_table via REST API...
    Table already exists, fetching it...
    tableLocation: file:/tmp/test_db/test_table-5d385e0f-282c-4c85-b7a6-b086f1851fea/00000-dc25a2ea-5be7-40b9-888d-33a500143d63.metadata.json

  Testing OpenHouseCatalog.load_table('test_db.test_table')...
    Success!
    name: ('test_db', 'test_table')
    metadata_location:
  file:/tmp/test_db/test_table-5d385e0f-282c-4c85-b7a6-b086f1851fea/00000-dc25a2ea-5be7-40b9-888d-33a500143d63.metadata.json
    schema: table {
    1: id: required string
    2: name: required string
    3: ts: required timestamp
  }

  Testing load of nonexistent table...
    Correctly raised OpenHouseCatalogError: Table no_db.no_table does not exist

  Cleaning up table test_db.test_table...
    Deleted.

  All tests passed!

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

@robreeves robreeves changed the title Add OpenHouseCatalog for Python DataLoader [Feature][DataLoader] Add OpenHouseCatalog Feb 13, 2026
@robreeves robreeves changed the title [Feature][DataLoader] Add OpenHouseCatalog [Feature][DataLoader] Add catalog for loading tables Feb 18, 2026
@robreeves robreeves marked this pull request as ready for review February 18, 2026 01:07
robreeves and others added 16 commits February 19, 2026 08:23
Introduces OpenHouseTableCatalog for loading Iceberg table metadata from
the OpenHouse Tables Service, following the Java OpenHouseCatalog auth
pattern (optional Bearer token and SSL trust store). Adds a TableCatalog
protocol so the DataLoader can accept any PyIceberg Catalog implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DataLoader now takes a TableCatalog as its first argument, making it
catalog-agnostic. Callers provide a pre-built catalog (e.g.
OpenHouseTableCatalog or any PyIceberg Catalog), keeping auth and
catalog-specific config outside the DataLoader.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the TableCatalog protocol with direct inheritance from
PyIceberg's Catalog ABC. This makes OpenHouseTableCatalog interchangeable
with any PyIceberg catalog, using the same (name, **properties) constructor
pattern as the Java OpenHouseCatalog. The DataLoader now types its catalog
parameter as Catalog directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Match the Java class name and align the class description.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace generic exceptions with OpenHouseCatalogError for all failure
cases in load_table: 404, other HTTP errors (with response body), missing
tableLocation, and metadata read failures. Extract tableLocation to a
constant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check metadata, io, and catalog on the returned Table in addition to
name and metadata_location.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove redundant inheritance tests, set response.ok explicitly, make
patching consistent across tests, refactor helper to accept response
overrides, add Content-Type and empty tableLocation edge case tests,
rename bad_metadata test to unreadable_metadata for clarity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace internal mock-based testing (_make_catalog_with_mock_session) with
the responses library for HTTP mocking, testing external behavior instead
of internal implementation details. Extract shared test constants, add a
mock_iceberg_io fixture with documentation explaining why file I/O is
mocked, and organize auth tests into a separate test class.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap ValueError from response.json() in OpenHouseCatalogError so callers
get a consistent exception type for all catalog failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename identifier_to_database_and_table to _parse_identifier and return
a TableIdentifier dataclass instead of a raw tuple, reusing the existing
table_identifier module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…error responses

- Add __enter__/__exit__/close() so the catalog session is properly
  closed (similar to AutoCloseable in Java)
- Narrow metadata read catch from Exception to OSError
- Truncate response text in error messages to 500 chars with debug log
- Use context manager in all tests
- Use constants in error match strings and BASE_URL in trailing slash test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace debug log with inline "(truncated, showing x/n characters)"
in the error message itself so users see the context directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ors propagate

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
robreeves and others added 4 commits February 19, 2026 10:01
…leIO

Delegates FileIO creation to the Catalog base class's _load_file_io(),
which infers the appropriate implementation from the metadata location's
URI scheme and catalog properties. This enables S3, HDFS, GCS, and other
storage backends without code changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from Catalog._load_file_io() to pyiceberg.io.load_file_io() to
avoid depending on a private method from an external library.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…table

Reuse the Catalog base class utility instead of a custom static method,
as suggested in PR review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use named parameters (uri, auth_token, trust_store, timeout_seconds)
instead of a properties dict. Adds a configurable HTTP timeout
defaulting to 30 seconds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
robreeves and others added 2 commits February 19, 2026 10:38
Use pyiceberg's NoSuchTableError instead of OpenHouseCatalogError for
missing tables, consistent with other catalog implementations and
allowing callers to handle this case specifically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
robreeves and others added 5 commits February 20, 2026 09:58
More precise name since requests.Session.verify only accepts PEM
certificate bundles or OpenSSL-hashed directories.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Forwards additional properties (e.g., py-io-impl for custom FileIO) to
the Catalog base class, enabling custom storage backends without
changing the catalog code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a standalone integration test that exercises OpenHouseCatalog.load_table
against a real OpenHouse instance in Docker, verifying the full path from
HTTP request through metadata fetch to table object construction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robreeves robreeves merged commit c7e55a6 into linkedin:main Feb 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants