Skip to content

Commit 68cf66e

Browse files
authored
feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes (#442)
This implements part of #436 . Spatial partitioners is the core component of the spatial partitioned spatial join. We need to collect samples of geospatial objects to build the spatial partitioning grid. The goal is that we collect enough number of samples to create a high quality spatial partition even for small datasets, while not collecting too many samples for large datasets to avoid running out of memory. The sampling should be uniform, so that the collected samples could faithfully represent the distribution of the entire dataset. The sampler should only go through the sampled stream in one single pass, since evaluating the sampled stream multiple times may trigger repeated computations of upstream physical operators. The sampling algorithm we adopted is a combination of reservoir sampling and Bernoulli sampling: it collects at least $N_\text{min}$ , at most $N_\text{max}$ samples per partition, and make sure that the sampling rate won’t go below $R$ before hitting $N_\text{max}$. The algorithm maintains a set of sampled envelopes $S$, and will go through 4 stages as the number of rows seen $k$ proceeds: - **Stage 1 - Filling the small reservoir**: When $k < N_\text{min}$, simply add the envelope of the geometry to $S$ - **Stage 2 - Small reservoir sampling**: when $N_\text{min} \leq k < \dfrac{N_\text{min}}{R}$, use [[reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling)](https://en.wikipedia.org/wiki/Reservoir_sampling) method to maintain a fixed number of samples ($N_\text{min}$) in $S$ - **Stage 3 - Bernoulli sampling**: when $k \geq \dfrac{N_\text{min}}{R} \land ||S|| < N_\text{max}$, use Bernoulli sampling to determine if we accept the next sample or not. $S$ starts to grow in this stage. - **Stage 4 - Large reservoir sampling**: when $||S|| = N_\text{max}$, use reservoir sampling method to maintain a fixed number of samples ($N_\text{max}$) in $S$ This algorithm guarantees that: 1. **Collect enough samples even for small partitions**: If number of rows in a partition is not less than $N_\text{min}$, at least $N_\text{min}$ samples will be collected. If number of rows in a partition is less than $N_\text{min}$, all rows will be collected as samples. 2. **Won’t collect too many samples for large partitions**: $||S||$ will never exceed $N_\text{max}$, no matter how large the partition is. 3. **Uniform sampling**: The samples are uniformly sampled even though the algorithm is composed by 4 distinct stages. This is trivial to prove.
1 parent ea4ed97 commit 68cf66e

File tree

4 files changed

+559
-0
lines changed

4 files changed

+559
-0
lines changed

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

rust/sedona-spatial-join/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ wkb = { workspace = true }
6464
geo-index = { workspace = true }
6565
geos = { workspace = true }
6666
float_next_after = { workspace = true }
67+
fastrand = { workspace = true }
6768

6869
[dev-dependencies]
6970
criterion = { workspace = true }

rust/sedona-spatial-join/src/utils.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
// specific language governing permissions and limitations
1616
// under the License.
1717

18+
pub(crate) mod bbox_sampler;
1819
pub(crate) mod concurrent_reservation;
1920
pub(crate) mod init_once_array;
2021
pub(crate) mod join_utils;

0 commit comments

Comments
 (0)