Skip to content

Clp integration fix#6

Closed
wraymo wants to merge 7 commits into
masterfrom
clp_integration_fix
Closed

Clp integration fix#6
wraymo wants to merge 7 commits into
masterfrom
clp_integration_fix

Conversation

@wraymo
Copy link
Copy Markdown

@wraymo wraymo commented May 9, 2025

Description

This PR introduces a CLP connector. The background and proposed implementation details are outlined in the associated RFC.

This PR implements one part of phase 1 of the proposed implementation, namely the Java implementation for the coordinator. The worker implementation will leverage Velox as the default query engine, so once the Velox PR is merged, we will submit another PR to this repo to add the necessary changes to presto-native-execution.

Like other connectors, we have created a presto-clp module and implemented all required connector interfaces as well as a few extras to support query push downs.

The important classes in the connector are described below.


Core Classes in Java

ClpConfig

The configuration class for CLP. Currently, we support the following properties:

  • clp.metadata-expire-interval: Defines the time interval after which metadata entries are considered expired and removed from the cache.
  • clp.metadata-refresh-interval: Specifies how frequently metadata should be refreshed from the source to ensure up-to-date information.
  • clp.polymorphic-type-enabled: Enables or disables support for polymorphic types within CLP. This determines whether dynamic type resolution is allowed.
  • clp.metadata-provider-type: Defines the type of the metadata provider. It could be a database, a file-based store, or another external system. By default, we use MySQL.
  • clp.metadata-db-url: The connection URL for the metadata database, used when clp.metadata-provider-type is configured to use a database.
  • clp.metadata-db-name: The name of the metadata database.
  • clp.metadata-db-user: The database user with access to the metadata database.
  • clp.metadata-db-password: The password for the metadata database user.
  • clp.metadata-table-prefix: A prefix applied to table names in the metadata database.
  • clp.split-provider-type: Defines the type of split provider for query execution. By default, we use MySQL, and the connection parameters are the same as those for the metadata database.

ClpSchemaTree

A helper class for constructing a nested schema representation from CLP’s column definitions. It supports hierarchical column names (e.g., a.b.c), handles name/type conflicts when the clp.polymorphic-type-enabled option is enabled, and maps serialized CLP types to Presto types. The schema tree produces a flat list of ClpColumnHandle instances, including RowType for nested structures, making it suitable for dynamic or semi-structured data formats.

When polymorphic types are enabled, conflicting fields are given unique names by appending a type-specific suffix to the column name. For instance, if an integer field named "a" and a Varstring (CLP type) field named "a" coexist in CLP’s schema tree, they are represented as a_bigint and a_varchar in Presto. This approach ensures that such fields remain queryable while adhering to Presto’s constraints.


ClpMetadataProvider

An interface responsible for retrieving metadata from a specified source.

public interface ClpMetadataProvider {
    List<ClpColumnHandle> listColumnHandles(SchemaTableName schemaTableName);  
    List<ClpTableHandle> listTableNames(String schema);  
}

We provide a default implementation called ClpMySqlMetadataProvider, which uses two MySQL tables. One of these is the datasets table, defined with the schema shown below. Currently, we support only a single Presto schema named default, and this metadata table stores all table names, paths, and storage types associated with that Presto schema.

Column Name Data Type Constraints
name VARCHAR(255) PRIMARY KEY
archive_storage_type VARCHAR(4096) NOT NULL
archive_storage_directory VARCHAR(4096) NOT NULL

The second MySQL table contains column metadata, defined by the schema shown below. Each Presto table is associated with a corresponding MySQL table that stores metadata about its columns.

Column Name Data Type Constraints
name VARCHAR(512) NOT NULL
type TINYINT NOT NULL
Primary Key (name, type)

ClpSplitProvider

In CLP, an archive is the fundamental unit for searching, and we treat each archive as a Presto Split. This allows independent parallel searches across archives. The ClpSplitProvider interface, shown below, defines how to retrieve split information from a specified source:

public interface ClpSplitProvider {
    List<ClpSplit> listSplits(ClpTableLayoutHandle clpTableLayoutHandle);  
}

We provide a default implementation called ClpMySqlSplitProvider. It uses an archive table to store archive IDs associated with each table. The table below shows part of the schema (some irrelevant fields are omitted).

Column Name Data Type Constraints
pagination_id BIGINT AUTO_INCREMENT PRIMARY KRY
id VARCHAR(128) NOT NULL
... ... ...

By concatenating the table path (archive_storage_directory) and the archive ID (id), we can retrieve all split paths for a table.


ClpMetadata

This interface enables Presto to access various metadata. All requests are delegated to ClpMetadataProvider

For metadata management, it also maintains two caches and periodically refreshes the metadata.

  • columnHandleCache: A LoadingCache<SchemaTableName, List<ClpColumnHandle>> that maps a SchemaTableName to its corresponding list of ClpColumnHandle objects.
  • tableHandleCache: A LoadingCache<String, List<ClpTableHandle>> that maps a schema name (String) to a list of ClpTableHandle objects

ClpConnectorPlanOptimizer and ClpFilterToKqlConverter

Presto exposes PlanNode to the connector, allowing the connector to push down relevant filters to CLP for improved query performance.

There are three main steps

  1. Transforming filter predicates into KQL queries
  2. Adding the generated KQL query to ClpTableLayoutHandle and constructing a new TableScanNode
  3. Reconstructing a FilterNode with any remaining predicates and the new TableScanNode

ClpFilterToKqlConverter implements RowExpressionVisitor<ClpExpression, Void> and handles expression transformation and pushdown. Since KQL is not SQL-compatible, only certain types of filters can be converted, including:

  • String exact match
  • LIKE predicate (the "^%[^%_]*%$" pattern is not pushed down)
  • Numeric comparisons
  • Logical operators (AND, OR, NOT)
  • IS NULL
  • substr(a, start, length) = 'abc' and substr(a, start) = 'abc' forms

For comparison or match expressions, one side must contain a VariableReferenceExpression, and the other must be a ConstantExpression--specifically a string or numeric literal. Expressions like a > c or substr(a, 2) = lower(c) are not eligible for pushdown. In simple cases without logical operators, the SQL plan can be directly translated into a KQL query. However, for expressions involving logical operators, it's critical to ensure that all conditions are handled correctly.

A naive approach would be to convert only the pushdown-eligible parts of the SQL query into KQL, letting Presto or Prestissimo handle the rest. But this can lead to incorrect execution plans and unintended behavior.

Consider the following query:

SELECT * from clp_table WHERE regexp_like(a, '\d+b') OR b = 2

Since CLP doesn’t currently support regexp_like, if we simply push down b = 2 to CLP, Presto will only receive results where b = 2, effectively changing the query semantics to regexp_like(a, '\d+b') AND b = 2.

To prevent such issues, the pushdown logic follows these rules:

  • OR pushdown: An OR condition is pushable if and only if all child expressions can be pushed down. In this case, all child expressions are pushed down together with the OR operator.
  • AND pushdown: An AND condition is not pushable if and only if none of its child expressions are pushable. Otherwise, pushable expressions are pushed down with the AND operator, while non-pushable expressions remain in the original plan.
  • General Pushdown Condition: An expression is considered pushable if it’s a CallExpression and can be transformed into KQL syntax.

Example transformations:

  1. SQL WHERE Clause: a = 2 AND b = '3'

    • KQL: a: 2 AND b: "3"
    • Effect: The FilterNode is removed.
  2. SQL WHERE Clause: a.b LIKE '%string another_string' OR "c.d" = TRUE

    • KQL: a.b: "*string another_string" OR c.d: true
    • Effect: The FilterNode is removed.
  3. SQL WHERE Clause: a > 2 OR regexp_like(b, '\d+b')

    • KQL: "*"
    • Effect: a > 2 OR regexp_like(b, '\d+b') remains in the FilterNode.
  4. SQL WHERE Clause: a > 2 AND regexp_like(b, '\d+b')

    • KQL: a > 2
    • Effect: regexp_like(b, '\d+b') remains in the FilterNode.
  5. SQL WHERE Clause: NOT substr(a, 2, 5) = 'Hello' and b IS NULL

    • KQL: NOT a: "?Hello*" AND NOT b: *
    • Effect: The FilterNode is removed.

Motivation and Context

See the associated RFC.

Impact

This module is independent from other modules and will not affect any existing functionality.

Test Plan

Unit tests are included in this PR, and we have also done end-to-end tests.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Connector Changes
* Add coordinator support for the CLP connector :pr:`24868`
* Add documentation for the CLP connector :doc:`../connector/clp` :pr:`24868`

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants