feat: Add coordinator implementation of CLP connector. by wraymo · Pull Request #15 · y-scope/presto

wraymo · 2025-06-13T02:55:38Z

Description

This PR introduces a CLP connector. The background and proposed implementation details are outlined in the associated RFC.

This PR implements one part of phase 1 of the proposed implementation, namely the Java implementation for the coordinator. The worker implementation will leverage Velox as the default query engine, so once the Velox PR is merged, we will submit another PR to this repo to add the necessary changes to presto-native-execution.

Like other connectors, we have created a presto-clp module and implemented all required connector interfaces. The plan optimizer will be a future PR.

The important classes in the connector are described below.

Core Classes in Java

`ClpConfig`

The configuration class for CLP. Currently, we support the following properties:

clp.metadata-expire-interval: Defines the time interval after which metadata entries are considered expired and removed from the cache.
clp.metadata-refresh-interval: Specifies how frequently metadata should be refreshed from the source to ensure up-to-date information.
clp.polymorphic-type-enabled: Enables or disables support for polymorphic types within CLP. This determines whether dynamic type resolution is allowed.
clp.metadata-provider-type: Defines the type of the metadata provider. It could be a database, a file-based store, or another external system. By default, we use MySQL.
clp.metadata-db-url: The connection URL for the metadata database, used when clp.metadata-provider-type is configured to use a database.
clp.metadata-db-name: The name of the metadata database.
clp.metadata-db-user: The database user with access to the metadata database.
clp.metadata-db-password: The password for the metadata database user.
clp.metadata-table-prefix: A prefix applied to table names in the metadata database.
clp.split-provider-type: Defines the type of split provider for query execution. By default, we use MySQL, and the connection parameters are the same as those for the metadata database.

`ClpSchemaTree`

A helper class for constructing a nested schema representation from CLP’s column definitions. It supports hierarchical column names (e.g., a.b.c), handles name/type conflicts when the clp.polymorphic-type-enabled option is enabled, and maps serialized CLP types to Presto types. The schema tree produces a flat list of ClpColumnHandle instances, including RowType for nested structures, making it suitable for dynamic or semi-structured data formats.

When polymorphic types are enabled, conflicting fields are given unique names by appending a type-specific suffix to the column name. For instance, if an integer field named "a" and a Varstring (CLP type) field named "a" coexist in CLP’s schema tree, they are represented as a_bigint and a_varchar in Presto. This approach ensures that such fields remain queryable while adhering to Presto’s constraints.

`ClpMetadataProvider`

An interface responsible for retrieving metadata from a specified source.

public interface ClpMetadataProvider {
    List<ClpColumnHandle> listColumnHandles(SchemaTableName schemaTableName);  
    List<ClpTableHandle> listTableNames(String schema);  
}

We provide a default implementation called ClpMySqlMetadataProvider, which uses two MySQL tables. One of these is the datasets table, defined with the schema shown below. Currently, we support only a single Presto schema named default, and this metadata table stores all table names, paths, and storage types associated with that Presto schema.

Column Name	Data Type	Constraints
`name`	`VARCHAR(255)`	`PRIMARY KEY`
`archive_storage_type`	`VARCHAR(4096)`	`NOT NULL`
`archive_storage_directory`	`VARCHAR(4096)`	`NOT NULL`

The second MySQL table contains column metadata, defined by the schema shown below. Each Presto table is associated with a corresponding MySQL table that stores metadata about its columns.

Column Name	Data Type	Constraints
`name`	`VARCHAR(512)`	`NOT NULL`
`type`	`TINYINT`	`NOT NULL`
Primary Key	(`name`, `type`)

`ClpSplitProvider`

In CLP, an archive is the fundamental unit for searching, and we treat each archive as a Presto Split. This allows independent parallel searches across archives. The ClpSplitProvider interface, shown below, defines how to retrieve split information from a specified source:

public interface ClpSplitProvider {
    List<ClpSplit> listSplits(ClpTableLayoutHandle clpTableLayoutHandle);  
}

We provide a default implementation called ClpMySqlSplitProvider. It uses an archive table to store archive IDs associated with each table. The table below shows part of the schema (some irrelevant fields are omitted).

Column Name	Data Type	Constraints
`pagination_id`	`BIGINT`	`AUTO_INCREMENT PRIMARY KRY`
`id`	`VARCHAR(128)`	`NOT NULL`
...	...	...

By concatenating the table path (archive_storage_directory) and the archive ID (id), we can retrieve all split paths for a table.

`ClpMetadata`

This interface enables Presto to access various metadata. All requests are delegated to ClpMetadataProvider

For metadata management, it also maintains two caches and periodically refreshes the metadata.

columnHandleCache: A LoadingCache<SchemaTableName, List<ClpColumnHandle>> that maps a SchemaTableName to its corresponding list of ClpColumnHandle objects.
tableHandleCache: A LoadingCache<String, List<ClpTableHandle>> that maps a schema name (String) to a list of ClpTableHandle objects

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

All unit tests passed successfully.
End-to-end tests verified, including Velox workers and Prestissimo protocol integration.

Summary by CodeRabbit

New Features
- Introduced the CLP Connector for querying CLP-S archives in Presto with support for MySQL metadata and split providers.
- Added support for nested and polymorphic column types, mapping CLP data types to Presto types including complex ROW structures.
- Provided extensive configuration options for metadata refresh, expiration intervals, polymorphic type support, and provider selection.
- Integrated lifecycle management and seamless plugin framework support within Presto.
Tests
- Added tests covering metadata and split provider functionality, validating schema, table, and split retrieval.
- Included utilities for setting up and tearing down H2-based test metadata databases.
Documentation
- Delivered comprehensive CLP Connector documentation detailing setup, configuration, data type mappings, and usage.
- Updated connector index to include the new CLP Connector.
Chores
- Updated build and provisioning configurations to incorporate the new CLP Connector module and plugin packaging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add coordinator implementation of CLP connector.#15

feat: Add coordinator implementation of CLP connector.#15
wraymo merged 16 commits into
release-0.293-clp-connectorfrom
clp_integration-0.293

wraymo commented Jun 13, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wraymo commented Jun 13, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Core Classes in Java

ClpConfig

ClpSchemaTree

ClpMetadataProvider

ClpSplitProvider

ClpMetadata

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wraymo commented Jun 13, 2025 •

edited by coderabbitai Bot

Loading

`ClpConfig`

`ClpSchemaTree`

`ClpMetadataProvider`

`ClpSplitProvider`

`ClpMetadata`