[8016]Support writing hydrated REE arrays to Parquet#10064
Open
Rich-T-kid wants to merge 3 commits into
Open
Conversation
Rich-T-kid
commented
Jun 4, 2026
Rich-T-kid
commented
Jun 4, 2026
9171f7b to
be83375
Compare
85cc317 to
4fc61f5
Compare
Contributor
Author
|
With this PR, its now possible to take a REE and write it out to parquet. test & benchmarks are included in the PR. Also ran a couple local test and used parquetReader to validate. |
Contributor
Author
|
Tried to break the commits into 3 independent pieces (test & benchmarks | implementation | more test/edge cases & benchmarks (null density) ) |
4fc61f5 to
50aa7c3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
This PR works towards an initial solution closing #8016
Rationale for this change
Currently
arrow_writerdoes not support writing Run End Encoded columns out to parquet. This PR works towards solving this by first expanding out the REE to its value type & then writing out to parquet. Once its possible to write REE to parquet we can work on optimizing it by keeping the compacting nature in tact.What changes are included in this PR?
arrow_writer()now supports writing Run End Encoded (REE) arrays to Parquet by hydrating them to their underlying value type before encoding. This is an initial, correctness-first implementation. A follow-up can/should optimize to preserve the compacted structure.parquet/src/arrow/arrow_writer/mod.rs: generate a value-type arrow-column writer & test
parquet/src/arrow/arrow_writer/levels.rs: core writer logic updated to detect REE columns and expand them to their flat value type before the existing write path.
parquet/src/arrow/schema/mod.rs: schema conversion updated to map RunEndEncodedType to an appropriate Parquet physical type.
parquet/benches/arrow_writer.rs: REE write benchmarks added with low and high null density scenarios, now unblocked by the implementation.
Are these changes tested?
Yes
Are there any user-facing changes?
Users will be able to write out their REE columns out to parquet using
arrow_writer