Parquet: Fix readers crashing on 2-level (Thrift) list encoding#15747
Open
josephperez3 wants to merge 2 commits intoapache:mainfrom
Open
Parquet: Fix readers crashing on 2-level (Thrift) list encoding#15747josephperez3 wants to merge 2 commits intoapache:mainfrom
josephperez3 wants to merge 2 commits intoapache:mainfrom
Conversation
481d251 to
0844e46
Compare
pvary
reviewed
Mar 24, 2026
|
|
||
| return new ArrayReader<>( | ||
| repeatedD, repeatedR, ParquetValueReaders.option(elementType, elementD, elementReader)); | ||
| if (ParquetSchemaUtil.isOldListElementType(array)) { |
Contributor
There was a problem hiding this comment.
What about having this duplicated code in a single place, like:
public static <E> ResolvedList<E> resolveList(
MessageType fileType,
GroupType array,
String[] repeatedPath,
String[] elementPath,
ParquetValueReader<E> elementReader) {
Type elementType = ParquetSchemaUtil.determineListElementType(array);
int elementD = fileType.getMaxDefinitionLevel(elementPath) - 1;
if (ParquetSchemaUtil.isOldListElementType(array)) {
int elementR = fileType.getMaxRepetitionLevel(elementPath) - 1;
return new ResolvedList<>(elementD, elementR, elementReader);
} else {
int repeatedD = fileType.getMaxDefinitionLevel(repeatedPath) - 1;
int repeatedR = fileType.getMaxRepetitionLevel(repeatedPath) - 1;
return new ResolvedList<>(
repeatedD, repeatedR, option(elementType, elementD, elementReader));
}
}
Author
There was a problem hiding this comment.
Yup that's cleaner thanks. Just pushed a fix.
60d36b5 to
2e31e94
Compare
2e31e94 to
a1e23a1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #9497
Problem
PR #3774 added
visitTwoLevelList()to correctly detect 2-level lists during schema traversal, but thelist()methods in the reader builders were not updated.visitTwoLevelList()does not push the repeated field name onto thefieldNamesstack, socurrentPath()is one segment short. This producesdefLevel=0andrepLevel=-1instead of the correct values, causingRepeatedReaderto loop indefinitely and crash withParquetDecodingException: Reading past RLE/BitPacking stream.Fix
In each
list()method, detect 2-level lists viaisOldListElementType()and compute def/rep levels from the element path instead ofcurrentPath(). We can skip theOptionReaderwrapping since elements are always non-null by nature of two level encoding.Fixed for both Flink and Spark read paths.
Testing
New tests in
TestSparkParquetReaderandTestFlinkParquetReader: write a 2-level Parquet file which includes some empty lists and verifies each row.Existing tests all pass, and without the fix, the new tests crash: