Skip to content

feat: Add coordinator implementation of CLP connector.#15

Merged
wraymo merged 16 commits into
release-0.293-clp-connectorfrom
clp_integration-0.293
Jun 19, 2025
Merged

feat: Add coordinator implementation of CLP connector.#15
wraymo merged 16 commits into
release-0.293-clp-connectorfrom
clp_integration-0.293

Conversation

@wraymo
Copy link
Copy Markdown

@wraymo wraymo commented Jun 13, 2025

Description

This PR introduces a CLP connector. The background and proposed implementation details are outlined in the associated RFC.

This PR implements one part of phase 1 of the proposed implementation, namely the Java implementation for the coordinator. The worker implementation will leverage Velox as the default query engine, so once the Velox PR is merged, we will submit another PR to this repo to add the necessary changes to presto-native-execution.

Like other connectors, we have created a presto-clp module and implemented all required connector interfaces. The plan optimizer will be a future PR.

The important classes in the connector are described below.


Core Classes in Java

ClpConfig

The configuration class for CLP. Currently, we support the following properties:

  • clp.metadata-expire-interval: Defines the time interval after which metadata entries are considered expired and removed from the cache.
  • clp.metadata-refresh-interval: Specifies how frequently metadata should be refreshed from the source to ensure up-to-date information.
  • clp.polymorphic-type-enabled: Enables or disables support for polymorphic types within CLP. This determines whether dynamic type resolution is allowed.
  • clp.metadata-provider-type: Defines the type of the metadata provider. It could be a database, a file-based store, or another external system. By default, we use MySQL.
  • clp.metadata-db-url: The connection URL for the metadata database, used when clp.metadata-provider-type is configured to use a database.
  • clp.metadata-db-name: The name of the metadata database.
  • clp.metadata-db-user: The database user with access to the metadata database.
  • clp.metadata-db-password: The password for the metadata database user.
  • clp.metadata-table-prefix: A prefix applied to table names in the metadata database.
  • clp.split-provider-type: Defines the type of split provider for query execution. By default, we use MySQL, and the connection parameters are the same as those for the metadata database.

ClpSchemaTree

A helper class for constructing a nested schema representation from CLP’s column definitions. It supports hierarchical column names (e.g., a.b.c), handles name/type conflicts when the clp.polymorphic-type-enabled option is enabled, and maps serialized CLP types to Presto types. The schema tree produces a flat list of ClpColumnHandle instances, including RowType for nested structures, making it suitable for dynamic or semi-structured data formats.

When polymorphic types are enabled, conflicting fields are given unique names by appending a type-specific suffix to the column name. For instance, if an integer field named "a" and a Varstring (CLP type) field named "a" coexist in CLP’s schema tree, they are represented as a_bigint and a_varchar in Presto. This approach ensures that such fields remain queryable while adhering to Presto’s constraints.


ClpMetadataProvider

An interface responsible for retrieving metadata from a specified source.

public interface ClpMetadataProvider {
    List<ClpColumnHandle> listColumnHandles(SchemaTableName schemaTableName);  
    List<ClpTableHandle> listTableNames(String schema);  
}

We provide a default implementation called ClpMySqlMetadataProvider, which uses two MySQL tables. One of these is the datasets table, defined with the schema shown below. Currently, we support only a single Presto schema named default, and this metadata table stores all table names, paths, and storage types associated with that Presto schema.

Column Name Data Type Constraints
name VARCHAR(255) PRIMARY KEY
archive_storage_type VARCHAR(4096) NOT NULL
archive_storage_directory VARCHAR(4096) NOT NULL

The second MySQL table contains column metadata, defined by the schema shown below. Each Presto table is associated with a corresponding MySQL table that stores metadata about its columns.

Column Name Data Type Constraints
name VARCHAR(512) NOT NULL
type TINYINT NOT NULL
Primary Key (name, type)

ClpSplitProvider

In CLP, an archive is the fundamental unit for searching, and we treat each archive as a Presto Split. This allows independent parallel searches across archives. The ClpSplitProvider interface, shown below, defines how to retrieve split information from a specified source:

public interface ClpSplitProvider {
    List<ClpSplit> listSplits(ClpTableLayoutHandle clpTableLayoutHandle);  
}

We provide a default implementation called ClpMySqlSplitProvider. It uses an archive table to store archive IDs associated with each table. The table below shows part of the schema (some irrelevant fields are omitted).

Column Name Data Type Constraints
pagination_id BIGINT AUTO_INCREMENT PRIMARY KRY
id VARCHAR(128) NOT NULL
... ... ...

By concatenating the table path (archive_storage_directory) and the archive ID (id), we can retrieve all split paths for a table.


ClpMetadata

This interface enables Presto to access various metadata. All requests are delegated to ClpMetadataProvider

For metadata management, it also maintains two caches and periodically refreshes the metadata.

  • columnHandleCache: A LoadingCache<SchemaTableName, List<ClpColumnHandle>> that maps a SchemaTableName to its corresponding list of ClpColumnHandle objects.
  • tableHandleCache: A LoadingCache<String, List<ClpTableHandle>> that maps a schema name (String) to a list of ClpTableHandle objects

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • All unit tests passed successfully.
  • End-to-end tests verified, including Velox workers and Prestissimo protocol integration.

Summary by CodeRabbit

  • New Features

    • Introduced the CLP Connector for querying CLP-S archives in Presto with support for MySQL metadata and split providers.
    • Added support for nested and polymorphic column types, mapping CLP data types to Presto types including complex ROW structures.
    • Provided extensive configuration options for metadata refresh, expiration intervals, polymorphic type support, and provider selection.
    • Integrated lifecycle management and seamless plugin framework support within Presto.
  • Tests

    • Added tests covering metadata and split provider functionality, validating schema, table, and split retrieval.
    • Included utilities for setting up and tearing down H2-based test metadata databases.
  • Documentation

    • Delivered comprehensive CLP Connector documentation detailing setup, configuration, data type mappings, and usage.
    • Updated connector index to include the new CLP Connector.
  • Chores

    • Updated build and provisioning configurations to incorporate the new CLP Connector module and plugin packaging.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants