feat: Add coordinator implementation of CLP connector.#15
Merged
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a CLP connector. The background and proposed implementation details are outlined in the associated RFC.
This PR implements one part of phase 1 of the proposed implementation, namely the Java implementation for the coordinator. The worker implementation will leverage Velox as the default query engine, so once the Velox PR is merged, we will submit another PR to this repo to add the necessary changes to
presto-native-execution.Like other connectors, we have created a
presto-clpmodule and implemented all required connector interfaces. The plan optimizer will be a future PR.The important classes in the connector are described below.
Core Classes in Java
ClpConfigThe configuration class for CLP. Currently, we support the following properties:
clp.metadata-expire-interval: Defines the time interval after which metadata entries are considered expired and removed from the cache.clp.metadata-refresh-interval: Specifies how frequently metadata should be refreshed from the source to ensure up-to-date information.clp.polymorphic-type-enabled: Enables or disables support for polymorphic types within CLP. This determines whether dynamic type resolution is allowed.clp.metadata-provider-type: Defines the type of the metadata provider. It could be a database, a file-based store, or another external system. By default, we use MySQL.clp.metadata-db-url: The connection URL for the metadata database, used whenclp.metadata-provider-typeis configured to use a database.clp.metadata-db-name: The name of the metadata database.clp.metadata-db-user: The database user with access to the metadata database.clp.metadata-db-password: The password for the metadata database user.clp.metadata-table-prefix: A prefix applied to table names in the metadata database.clp.split-provider-type: Defines the type of split provider for query execution. By default, we use MySQL, and the connection parameters are the same as those for the metadata database.ClpSchemaTreeA helper class for constructing a nested schema representation from CLP’s column definitions. It supports hierarchical column names (e.g.,
a.b.c), handles name/type conflicts when theclp.polymorphic-type-enabledoption is enabled, and maps serialized CLP types to Presto types. The schema tree produces a flat list ofClpColumnHandleinstances, includingRowTypefor nested structures, making it suitable for dynamic or semi-structured data formats.When polymorphic types are enabled, conflicting fields are given unique names by appending a type-specific suffix to the column name. For instance, if an integer field named "a" and a
Varstring(CLP type) field named "a" coexist in CLP’s schema tree, they are represented asa_bigintanda_varcharin Presto. This approach ensures that such fields remain queryable while adhering to Presto’s constraints.ClpMetadataProviderAn interface responsible for retrieving metadata from a specified source.
We provide a default implementation called
ClpMySqlMetadataProvider, which uses two MySQL tables. One of these is the datasets table, defined with the schema shown below. Currently, we support only a single Presto schema nameddefault, and this metadata table stores all table names, paths, and storage types associated with that Presto schema.nameVARCHAR(255)PRIMARY KEYarchive_storage_typeVARCHAR(4096)NOT NULLarchive_storage_directoryVARCHAR(4096)NOT NULLThe second MySQL table contains column metadata, defined by the schema shown below. Each Presto table is associated with a corresponding MySQL table that stores metadata about its columns.
nameVARCHAR(512)NOT NULLtypeTINYINTNOT NULLname,type)ClpSplitProviderIn CLP, an archive is the fundamental unit for searching, and we treat each archive as a Presto Split. This allows independent parallel searches across archives. The
ClpSplitProviderinterface, shown below, defines how to retrieve split information from a specified source:We provide a default implementation called
ClpMySqlSplitProvider. It uses an archive table to store archive IDs associated with each table. The table below shows part of the schema (some irrelevant fields are omitted).pagination_idBIGINTAUTO_INCREMENT PRIMARY KRYidVARCHAR(128)NOT NULLBy concatenating the table path (
archive_storage_directory) and the archive ID (id), we can retrieve all split paths for a table.ClpMetadataThis interface enables Presto to access various metadata. All requests are delegated to
ClpMetadataProviderFor metadata management, it also maintains two caches and periodically refreshes the metadata.
columnHandleCache: ALoadingCache<SchemaTableName, List<ClpColumnHandle>>that maps aSchemaTableNameto its corresponding list ofClpColumnHandleobjects.tableHandleCache: ALoadingCache<String, List<ClpTableHandle>>that maps a schema name (String) to a list ofClpTableHandleobjectsChecklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Tests
Documentation
Chores