Skip to content

Comments

[bug] Parquet Source: snapshot sync fails on multiple commits with partitions#806

Draft
the-other-tim-brown wants to merge 3 commits intoapache:mainfrom
the-other-tim-brown:parquet-source-snapshot-failure
Draft

[bug] Parquet Source: snapshot sync fails on multiple commits with partitions#806
the-other-tim-brown wants to merge 3 commits intoapache:mainfrom
the-other-tim-brown:parquet-source-snapshot-failure

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Feb 22, 2026

What is the purpose of the pull request

Fixes issues with snapshot sync with the Parquet source.

Closes #807

Brief change log

  • Updated integration test

Verify this pull request

Integration test is updated to make 2 commits and verify the results for both partitioned and non-partitioned table

return partitionConfigs.stream()
.flatMap(
partitionConfig ->
Stream.of(SyncMode.FULL) // Incremental sync is not yet supported
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provides setup for the incremental sync mode so the tests can be quickly updated in the future to cover those cases

@the-other-tim-brown the-other-tim-brown added the bug Something isn't working label Feb 22, 2026
.toArray(String[]::new);
// add partition columns to dataframe
for (String partitionCol : partitionCols) {
if (partitionCol.equals("year")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if/else if chain only handles "year", "month", and "day". If the partition config ever contains a different column, no column gets added to the DataFrame but partitionBy still references it — causing a confusing AnalysisException. Add an else with throw new IllegalArgumentException("Unsupported partition column: " + partitionCol) to fail fast.

.formatName(formatName)
// set the metadata path to the data path as the default (required by Hudi)
.basePath(table.getDataPath())
.basePath(table.getBasePath())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the comment on line 154 still says "set the metadata path to the data path" but the code now uses getBasePath(). Update the comment to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snapshot sync with Parquet source does not work on partitioned tables

2 participants