Parquet source: Expand column stats support, fix bugs in schema conversion by the-other-tim-brown · Pull Request #805 · apache/incubator-xtable

the-other-tim-brown · 2026-02-22T20:47:44Z

What is the purpose of the pull request

This adds support for more column types for the parquet source. As part of adding the testing, other issues with the schema conversion were discovered.

Closes #748

Brief change log

Ensures types like decimals are handled properly for col stats conversion
Fixes bugs in the schema conversion logic
Reuses a single schema when extracting the partition and column stats information to reduce number of objects created. This also fixes an issue for nested fields not having the proper schema

Verify this pull request

New unit tests are added for the added functionality, existing tests are also updated to ensure there is good coverage for the schema conversion logic.

vinishjail97 · 2026-02-23T05:53:55Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java

@@ -65,12 +65,6 @@ public static ParquetStatsExtractor getInstance() {
  private static PathBasedPartitionSpecExtractor partitionSpecExtractor =


partitionValueExtractor (line 63), partitionSpecExtractor (line 65), and getMaxFromColumnStats (line 68) appear to be dead code after the removal of toInternalDataFile. These should be cleaned up along with the unused import org.apache.hadoop.fs.*; on line 32.

vinishjail97 · 2026-02-23T05:53:55Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java

+    ColumnChunkMetaData first = chunks.get(0);
+    InternalField internalField =
+        SchemaFieldFinder.getInstance()
+            .findFieldByPath(internalSchema, first.getPath().toDotString());


SchemaFieldFinder.findFieldByPath() returns null if the column path doesn't exist in the schema (per its Javadoc). This null would silently propagate into the ColumnStat. Consider adding Objects.requireNonNull(internalField, "No field found for path: " + first.getPath().toDotString()).

vinishjail97 · 2026-02-23T05:53:55Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java

+    long totalSize = chunks.stream().mapToLong(ColumnChunkMetaData::getTotalSize).sum();
+    Object globalMin =
+        chunks.stream()
+            .map(c -> convertStatsToInternalType(primitiveType, c.getStatistics().genericGetMin()))


c.getStatistics().genericGetMin() returns null for all-null columns (hasNonNullValue() == false). This would NPE inside convertStatsToInternalType. Consider filtering out chunks where stats have no non-null values before the min/max aggregation.

vinishjail97 · 2026-02-23T05:53:55Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

-        getInternalDataFiles(getParquetFiles(hadoopConf, basePath));
    InternalTable table = getMostRecentTable(getParquetFiles(hadoopConf, basePath));
+    Stream<InternalDataFile> internalDataFiles =
+        getInternalDataFiles(getParquetFiles(hadoopConf, basePath), table.getReadSchema());


getParquetFiles() is called 3 times in getCurrentSnapshot() (lines 186, 188, 193), triggering 3 separate filesystem scans. This is inefficient and could yield inconsistent results if files change between calls. The comment on line 185 says "call the method twice" but it's actually 3 times now. Consider collecting to a list once and creating new streams from it?

vinishjail97 · 2026-02-23T05:53:55Z

xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java

 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
 import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.*;


Nit: wildcard import org.apache.parquet.schema.*; — and the explicit import org.apache.parquet.schema.MessageType on line 49 is redundant since the wildcard already covers it. Expand the wildcard into explicit imports.

the-other-tim-brown added 4 commits February 22, 2026 14:36

fix stats extraction to be per file and handle more types

789b49d

fix list and map handling

705c67e

cleanup

98f3971

spotless

8a504a3

vinishjail97 reviewed Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Parquet source: Expand column stats support, fix bugs in schema conversion#805

Parquet source: Expand column stats support, fix bugs in schema conversion#805
the-other-tim-brown wants to merge 4 commits intoapache:mainfrom
the-other-tim-brown:parquet-col-stats

the-other-tim-brown commented Feb 22, 2026

Uh oh!

vinishjail97 Feb 23, 2026

Uh oh!

vinishjail97 Feb 23, 2026

Uh oh!

vinishjail97 Feb 23, 2026

Uh oh!

vinishjail97 Feb 23, 2026

Uh oh!

vinishjail97 Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -65,12 +65,6 @@ public static ParquetStatsExtractor getInstance() {
		private static PathBasedPartitionSpecExtractor partitionSpecExtractor =

Comments

Conversation

the-other-tim-brown commented Feb 22, 2026

What is the purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

vinishjail97 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants