-
Notifications
You must be signed in to change notification settings - Fork 38
feat(c/sedona-libgpuspatial): Add GPU-accelerated spatial join library #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pwrliang now the CI is working |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing! It has been very cool to watch this project evolve over the last six months and I know this represents a huge amount of work.
This is a large change and I wanted to leave a few high-level things to think about while you're polishing this up.
- I see a bit of commented-out code in some of the files...feel free to file GitHub issues if that code represents a future piece of work that needs doing (and remove the commented-out code)!
- I see CUDA-specific tests, which are great! If there are portions of the code that aren't well-covered by tests we do need to add them (we can open follow-on issues and do this in follow-up PRs too)
- Because we're an Apache project we need the licensing and provenance of the files to be clear. I see some copyright notices from Nvidia...there's a place in LICENSE.md to acknowledge subdirectories where code was copied. We also need license headers on all the files (there's a script in scripts/ that can help do mass addition of the license header to a bunch of files at once).
- It looks like you've done a great job ensuring the casual contributor doesn't have to deal with the GPU build complexity using
default-members. That was one of my initial concerns but it looks great so far.
Give me a ping when you're ready for me to take a look!
@paleolimbot Hi, Dewey, thanks for your attention. Currently, this PR has resolved the license issues, and can pass almost all of the jobs in the CI. I'd like to hear any suggestions from you. @zhangfengcdt has written the Rust part to hook up sedona-db to libgpuspatial, so the credits for building go to him. |
|
@pwrliang Thanks for opening this PR! As per our discussion, we could break down this large PR into smaller ones to make it manageable and reviewable. Ideally, (1)libgpuspatial (c++ and cuda) with tests (2) gpu spatial join module in rust (3) build pipelines and e2e tests. Let me know if you can reduce this to only include (1) with the proper cleanup and apache license headers. We can continue (2) and (3) once this one merges. Thanks! |
@zhangfengcdt Hi Feng, I have changed this PR to only keep the libgpuspatial part according to your suggestions. Please take a look. |
@pwrliang Could you please take a look at the failing lints and tests in the ci? Thanks! |
@zhangfengcdt I have fixed the CI. It failed because the runner did not have NVIDIA drivers, and the tests need to link libcuda.so, which comes from the drivers. After installing them, it succeeded. |
zhangfengcdt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @pwrliang ! It looks good to me now.
|
Thanks Feng for taking a look! I mentioned this to Feng offline, but we're preparing for the 0.2.0 release and I'd like to wait on merging this until we create the release branch. I'll also take a quick look for repo-level stuff before we merge. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay reviewing this!
Thank you for handling most of the licensing and commented out code issues! I still have a few questions about both (most of the below comments).
After this PR is merged I would like to further separate the C library build such that the mechanism for loading libgpuspatial is deferred to runtime (i.e., sd.load_extension("/path/to/installed/libgpuspatial.so"). This would mean that the Rust build and the C library build are completely separated (and would make it much easier to use this from Python). This is how DuckDB handles extensions and is how I'm planning to handle the s2geography dependency since that's causing some build problems for contributors.
c/sedona-libgpuspatial/libgpuspatial/include/gpuspatial/geom/geometry_collection.cuh
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe our top level LICENSE has lines like this in it. Can these attributions be inlined into that file so they are all in one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will merge the NOTICE file into LICENSE. However, I found a problem that there's no "licenses" folder at all
Line 205 in d53d6a1
| This section summarizes those components and their licenses. See licenses/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was copied over from apache/sedona and doesn't apply here...if we get complaints that the LICENSES file is unweildy or inappropriately formatted we can update its structure. Some Apache projects use licenses/, some use NOTICE.txt, or some inline everything into LICENSE. The important part is that vendored dependencies are acknowledged somewhere.
This file is also missing geoarrow-c-geos.
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Peter Nguyen <petern0408@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Fix building issue Fix building issue fix test
|
Thank you @paleolimbot I have resolved most of the issues. Here's the list that remains unsolved. License: We need a Line 205 in d53d6a1
Some commented test cases of RelateEngine cannot pass yet. I removed them and opened an issue The spatial join function in https://github.com/apache/sedona-db/pull/310/files/57d9c8b75e991f39332d71bbd034b94a74931ed3#diff-6eb1464960f8bc6d254ac43cb8509261f81845a5da98c29bd55a3278c7ac3fd9 does not have tests. I opened an issue for it #414 |
|
Thank you for these updates! I'll take a look tomorrow morning. Looking forward to getting this merged! |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI failure is because my merge attempt on the lockfile wasn't quite right (sorry!):
error: the lock file /home/runner/work/sedona-db/sedona-db/Cargo.lock needs to be updated but --locked was passed to prevent this
If you want to try to generate the lock file without accessing the network, remove the --locked flag and use --offline instead.
...any run of cargo should fix that.
License: We need a licenses folder to store all the original license files. LICENSE says there's one but it's not there
I left a note inline...for the purposes of this PR just put everything in LICENSE (and add geoarrow-c-geos).
I think that's it from my end!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was copied over from apache/sedona and doesn't apply here...if we get complaints that the LICENSES file is unweildy or inappropriately formatted we can update its structure. Some Apache projects use licenses/, some use NOTICE.txt, or some inline everything into LICENSE. The important part is that vendored dependencies are acknowledged somewhere.
This file is also missing geoarrow-c-geos.
c/sedona-libgpuspatial/libgpuspatial/include/gpuspatial/relate/im.cuh
Outdated
Show resolved
Hide resolved
|
|
||
| #include <errno.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've copied enough of it here that it needs to be acknowledged in LICENSE. I filed geoarrow/geoarrow-c-geos#6 so that hopefully we can avoid a vendor in the future.
|
We should revert this PR and do it again with the real authors. This PR has be co-authored by 13 people adding 18k lines for everyone where most never even contributed to this PR. This greatly increases the lines added and status of 13 people and masks their real efforts. @jiayuasu normally sends out an email introducing new committers and the top statistic he usually mentions is how many lines committers have added. Padding peoples rankings is really opposite to the Apache Way which promotes a high amount of public contributions over a long period of time. |
…n library (apache#310)" This reverts commit 0a2a499.
I have created a PR to revert this PR. #310 (comment) |
|
Reverting the PR will not fix the issue unless we reset the commit history. Because the commit is already there. And we currently disallowed such reset operation on the main. I am ok with either keeping or resetting the commit history. I am curious what other Sedona committers/ PMC members think. |
|
This one is my fault since I'm the one that hit the merge button with the co-authors listed in the PR description. Liang did a great job keeping that huge PR up-to-date while we all merged PRs around him...the resulting code change was a lot of work to review and resulted in phenomenal code getting merged into the Apache Sedona project. I'd like to keep the commit history as is. Apache Sedona is a tightly-knit community and we are all aware of each others' contributions. Co-authorship is also given on commits to those who fix typos via "suggest" on a review, and I'm not aware of any decisions that are made based on lines of code committed by a co-author any any Apache repo. We have a lot of work to do and I don't think changing the history is a reasonable use of any of our time. |
Since it has already been merged into the main branch, and they don't impact the actual code or performance, I am inclined to accept the commit as is. |
This PR adds a GPU-accelerated spatial join library called
libgpuspatialunderc/sedona-libgpuspatial/libgpuspatialto support SedonaDB's GPU-acceleration feature. The GPU execution is automatically enabled when available and provides significant performance improvements for large-scale spatial joins. A Rust package calledsedona-libgpuspatialfor building and using the library is included underc/sedona-libgpuspatial. Currently, there's no GitHub action runner that has a GPU, so we only build the library and tests without actually running them.