Skip to content

Parquet: Fix readers crashing on 2-level (Thrift) list encoding#15747

Open
josephperez3 wants to merge 2 commits intoapache:mainfrom
josephperez3:fix-two-level-list-reading
Open

Parquet: Fix readers crashing on 2-level (Thrift) list encoding#15747
josephperez3 wants to merge 2 commits intoapache:mainfrom
josephperez3:fix-two-level-list-reading

Conversation

@josephperez3
Copy link
Copy Markdown

Closes #9497

Problem

PR #3774 added visitTwoLevelList() to correctly detect 2-level lists during schema traversal, but the list() methods in the reader builders were not updated. visitTwoLevelList() does not push the repeated field name onto the fieldNames stack, so currentPath() is one segment short. This produces defLevel=0 and repLevel=-1 instead of the correct values, causing RepeatedReader to loop indefinitely and crash with ParquetDecodingException: Reading past RLE/BitPacking stream.

Fix

In each list() method, detect 2-level lists via isOldListElementType() and compute def/rep levels from the element path instead of currentPath(). We can skip the OptionReader wrapping since elements are always non-null by nature of two level encoding.

Fixed for both Flink and Spark read paths.

Testing

New tests in TestSparkParquetReader and TestFlinkParquetReader: write a 2-level Parquet file which includes some empty lists and verifies each row.

Existing tests all pass, and without the fix, the new tests crash:

ParquetDecodingException: Can't read value in column [names, names_tuple]
repeated binary names_tuple (STRING) at value 7 out of 7 in current page.
repetition level: 0, definition level: 2

Caused by: could not read bytes at offset 40


return new ArrayReader<>(
repeatedD, repeatedR, ParquetValueReaders.option(elementType, elementD, elementReader));
if (ParquetSchemaUtil.isOldListElementType(array)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about having this duplicated code in a single place, like:

  public static <E> ResolvedList<E> resolveList(
      MessageType fileType,
      GroupType array,
      String[] repeatedPath,
      String[] elementPath,
      ParquetValueReader<E> elementReader) {
    Type elementType = ParquetSchemaUtil.determineListElementType(array);
    int elementD = fileType.getMaxDefinitionLevel(elementPath) - 1;

    if (ParquetSchemaUtil.isOldListElementType(array)) {
      int elementR = fileType.getMaxRepetitionLevel(elementPath) - 1;
      return new ResolvedList<>(elementD, elementR, elementReader);
    } else {
      int repeatedD = fileType.getMaxDefinitionLevel(repeatedPath) - 1;
      int repeatedR = fileType.getMaxRepetitionLevel(repeatedPath) - 1;
      return new ResolvedList<>(
          repeatedD, repeatedR, option(elementType, elementD, elementReader));
    }
  }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that's cleaner thanks. Just pushed a fix.

@josephperez3 josephperez3 force-pushed the fix-two-level-list-reading branch from 60d36b5 to 2e31e94 Compare March 24, 2026 21:08
@josephperez3 josephperez3 force-pushed the fix-two-level-list-reading branch from 2e31e94 to a1e23a1 Compare March 24, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants