Closes #2880: Add parallel writing when writing pdarrays to Parquet by bmcdonald3 · Pull Request #2881 · Bears-R-Us/arkouda

bmcdonald3 · 2023-12-13T18:02:50Z

This PR (closes #2880) adds support for a parallelWriteThreshold flag that
allows a user to determine the size of files of the Parquet files
to be written and then write those files in parallel, appending a
_CORE#### to the end of the file name.

By running the Arkouda server with:
./arkouda_server --ParquetMsg.parallelWriteThreshold=<num>, a
user is able to control the size of the files that are going to be
written.

This is currently only supported on pdarrays of natively-supported
datatypes (meaning not strings or dataframes), but follow work is
on the way.

src/ParquetMsg.chpl

jaketrookman

Looks good

stress-tess

A whole lot of comments from me all just to say... LGTM lol

Nice work ben! This looks great!! 🚀

stress-tess · 2023-12-13T23:06:03Z

src/ParquetMsg.chpl

                                        dtype, compression,
                                        errMsg): int;
    var dtypeRep = toCDtype(dtype);
+    var doParallel = if A.size > parallelWriteThreshold then true else false;


okay this is pedantic, but this could just be var doParallel = A.size > parallelWriteThreshold; right?

stress-tess · 2023-12-13T23:19:33Z

src/ParquetMsg.chpl

+            var fileSizes: [0..#loc.maxTaskPar] int = locDom.size/loc.maxTaskPar;
+            // First file has the extra elements if it isn't evenly divisible by maxTaskPar
+            fileSizes[0] += locDom.size - ((locDom.size/loc.maxTaskPar)*loc.maxTaskPar);


i had to convince myself that the integer cast of the float divide would always round down. This seems to check out and the adjustment gave what i expected on a small example!

var fileSizes: [0..#6] int = 10/6; // 1.6666 verify this rounds down writeln(fileSizes); var leftOver = 10 - ((10/6)*6); writeln(leftOver);

1 1 1 1 1 1 4

I will say in my small example this resulted in a pretty unbalanced distribution, but I think that in a real case that locDom.size would be large enough relative to loc.maxTaskPar that it would be pretty uniform... I'm just now realizing that's probably part of the motivation for having a parallelWriteThreshold lol

Since leftOver is the remainder of locDom.size/loc.maxTaskPar, it should be guaranteed to be less than fileSizes.size. So if we wanted to distribute the remainder more evenly we could do something like

var fileSizes: [0..#6] int = 10/6; writeln(fileSizes); var leftOver = 10 - ((10/6)*6); writeln(leftOver); fileSizes[0..#leftOver] += 1; writeln(fileSizes);

1 1 1 1 1 1 4 2 2 2 2 1 1

But this is probably overengineering something that isn't actually a problem

There are pathological cases where this could make a big difference. I lean towards implementing this change in Tess's comment.

Tagging @e-kayrakli who's working on Parquet improvements and may not be following this PR.

stress-tess · 2023-12-13T23:26:27Z

src/ParquetMsg.chpl

+              var suffix = '%04i'.format(idx): string;
+              var parSuffix = '%04i'.format(i): string;
+              const parFilename = filename + "_LOCALE" + suffix + "_CORE" + parSuffix + ".parquet";
+              var oi = if i == 0 then i else offsets[i-1];


instead of doing this back by one, couldn't you do

var offsets = (+ scan fileSizes) - fileSizes; forall (i, off, len) in zip(fileSizes.domain, offsets, fileSizes) { ...

I don't think this would have any performance difference, but this is more similar to how we calculate offsets in other places. I normally prefer looping variables over indexing when possible because it makes easier for me to tell at a glance what's local, but that doesn't apply here. So there's def no need to change

stress-tess · 2023-12-14T18:33:52Z

Since @bmcdonald3 is out and none of my comments are blocking, I'll go ahead and merge this. Thanks again ben!!!

EDIT: apparently ben is in and he wants to hold off for string support

This PR adds support for a `parallelWriteThreshold` flag that allows a user to determine the size of files of the Parquet files to be written and then write those files in parallel, appending a `_CORE####` to the end of the file name. By running the Arkouda server with: `./arkouda_server --ParquetMsg.parallelWriteThreshold=<num>`, a user is able to control the size of the files that are going to be written. This is currently only supported on pdarrays of natively-supported datatypes (meaning not strings or dataframes), but follow work is on the way.

bradcray · 2025-01-09T19:01:21Z

Reviewing PRs from our team today, I came across this one and wondered about its status. My understanding is:

It only handles ints in its current state, while ultimately one would want to do the same for strings (though maybe that doesn't need to be done in this PR?)
Ben handed it off to @vasslitvinov when he took his new position, though it may have fallen between the cracks with other things Vass was also handling
Vass will be unable to work on this for the next few months, so we should consider having someone else pick it up

Tagging @e-kayrakli for awareness and @jhh67 due to his recent Parquet work (albeit in the runtime rather than module-level code).

stress-tess requested review from jaketrookman and stress-tess December 13, 2023 18:05

jaketrookman reviewed Dec 13, 2023

View reviewed changes

src/ParquetMsg.chpl Show resolved Hide resolved

jaketrookman approved these changes Dec 13, 2023

View reviewed changes

stress-tess approved these changes Dec 13, 2023

View reviewed changes

stress-tess added this pull request to the merge queue Dec 14, 2023

stress-tess removed this pull request from the merge queue due to a manual request Dec 14, 2023

bmcdonald3 added 3 commits April 12, 2024 11:35

Update write parquet call

3e85742

Fix compressed type

159db71

bmcdonald3 force-pushed the pq-parallel-write branch from 929f948 to 159db71 Compare April 12, 2024 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #2880: Add parallel writing when writing pdarrays to Parquet#2881

Closes #2880: Add parallel writing when writing pdarrays to Parquet#2881
bmcdonald3 wants to merge 3 commits intoBears-R-Us:masterfrom
bmcdonald3:pq-parallel-write

bmcdonald3 commented Dec 13, 2023 •

edited by stress-tess

Loading

Uh oh!

Uh oh!

jaketrookman left a comment

Uh oh!

stress-tess left a comment

Uh oh!

stress-tess Dec 13, 2023

Uh oh!

stress-tess Dec 13, 2023

Uh oh!

stress-tess Dec 13, 2023 •

edited

Loading

Uh oh!

drculhane Sep 3, 2025

Uh oh!

bradcray Sep 3, 2025

Uh oh!

stress-tess Dec 13, 2023 •

edited

Loading

Uh oh!

stress-tess commented Dec 14, 2023 •

edited

Loading

Uh oh!

Uh oh!

bradcray commented Jan 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

bmcdonald3 commented Dec 13, 2023 • edited by stress-tess Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jaketrookman left a comment

Choose a reason for hiding this comment

Uh oh!

stress-tess left a comment

Choose a reason for hiding this comment

Uh oh!

stress-tess Dec 13, 2023

Choose a reason for hiding this comment

Uh oh!

stress-tess Dec 13, 2023

Choose a reason for hiding this comment

Uh oh!

stress-tess Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drculhane Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

bradcray Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

stress-tess Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stress-tess commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bradcray commented Jan 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bmcdonald3 commented Dec 13, 2023 •

edited by stress-tess

Loading

stress-tess Dec 13, 2023 •

edited

Loading

stress-tess Dec 13, 2023 •

edited

Loading

stress-tess commented Dec 14, 2023 •

edited

Loading