Skip to content

Delta column mapping mode caused wrong result with partition column filters #10511

@sezruby

Description

@sezruby

Backend

VL (Velox)

Bug description

 spark.sql(s"""
                    |create table delta_cm2 (id int, name string) using delta
                    |partitioned by (id)
                    |tblproperties ("delta.columnMapping.mode"= "name")
                    |""".stripMargin)
spark.sql(s"""
                    |insert into delta_cm2 values (1, "v1"), (2, "v2"), (3, "v3")
                    |""".stripMargin)
spark.sql("select name from delta_cm2 where id > 2")

[Expected behavior]
returns ["v3"]

[actual behavior]
returns ["v1", "v2", "v3"]

  == Physical Plan ==
  VeloxColumnarToRow
  +- ^(32) ProjectExecTransformer [col-4aba4d55-65eb-461f-9a50-154cb6b1c6ec#6087 AS name#6087]
     +- ^(32) FileScanTransformer parquet spark_catalog.default.delta_cm2[col-4aba4d55-65eb-461f-9a50-154cb6b1c6ec#6087,col-9a0726cb-9edd-4f44-b7f7-a1ccfc5474c3#6086] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[file:/root/gluten/backends-velox/spark-warehouse/org.apache.glute..., PartitionFilters: [isnotnull(col-9a0726cb-9edd-4f44-b7f7-a1ccfc5474c3#6086), (col-9a0726cb-9edd-4f44-b7f7-a1ccfc547..., PushedFilters: [], ReadSchema: struct<col-4aba4d55-65eb-461f-9a50-154cb6b1c6ec:string> NativeFilters: []
  
  == Results ==
  
  == Results ==
  !== Correct Answer - 1 ==   == Gluten Answer - 3 ==
   struct<>                   struct<>
  ![v3]                       [v1]
  !                           [v2]
  !                           [v3] (GlutenQueryTest.scala:437)

This is because Delta filtering expects logical column name in filters:
https://github.com/delta-io/delta/blob/44c619e51846f3e98aa6605d13fa4517de049281/spark/src/main/scala/org/apache/spark/sql/delta/stats/PrepareDeltaScan.scala#L356

Though Gluten changed logical column names to physical column names for parquet reader, filter pushdown.
In JVM Spark, Delta changes the name to physical column names when creating parquet reader.
https://github.com/delta-io/delta/blob/44c619e51846f3e98aa6605d13fa4517de049281/spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala#L179

We need to add a fallback for column mapping mode, both id and name.

Gluten version

Gluten-1.3

Spark version

Spark-3.5.x

Spark configurations

Spark3.5 and Delta 3.2

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions