Skip to content

Commit da88968

Browse files
authored
Merge branch 'main' into chore/upgrade-rust-2024-group-5
2 parents 25c88f9 + e8384fb commit da88968

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+2142
-1501
lines changed

benchmarks/README.md

Lines changed: 58 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,6 @@ You can also invoke the helper directly if you need to customise arguments furth
119119
./benchmarks/compile_profile.py --profiles dev release --data /path/to/tpch_sf1
120120
```
121121

122-
123122
## Benchmark with modified configurations
124123

125124
### Select join algorithm
@@ -147,6 +146,19 @@ To verify that datafusion picked up your configuration, run the benchmarks with
147146

148147
## Comparing performance of main and a branch
149148

149+
For TPC-H
150+
```shell
151+
./benchmarks/compare_tpch.sh main mybranch
152+
```
153+
154+
For TPC-DS.
155+
To get data in `DATA_DIR` for TPCDS, please follow instructions in `./benchmarks/bench.sh data tcpds`
156+
```shell
157+
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/compare_tpcds.sh main mybranch
158+
```
159+
160+
Alternatively you can compare manually followng the example velor
161+
150162
```shell
151163
git checkout main
152164

@@ -299,7 +311,6 @@ This will produce output like:
299311
└──────────────┴──────────────┴──────────────┴───────────────┘
300312
```
301313

302-
303314
# Benchmark Runner
304315

305316
The `dfbench` program contains subcommands to run the various
@@ -339,24 +350,28 @@ FLAGS:
339350
```
340351

341352
# Profiling Memory Stats for each benchmark query
353+
342354
The `mem_profile` program wraps benchmark execution to measure memory usage statistics, such as peak RSS. It runs each benchmark query in a separate subprocess, capturing the child process’s stdout to print structured output.
343355

344356
Subcommands supported by mem_profile are the subset of those in `dfbench`.
345-
Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch
357+
Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch, TPCDS
346358

347359
Before running benchmarks, `mem_profile` automatically compiles the benchmark binary (`dfbench`) using `cargo build`. Note that the build profile used for `dfbench` is not tied to the profile used for running `mem_profile` itself. We can explicitly specify the desired build profile using the `--bench-profile` option (e.g. release-nonlto). By prebuilding the binary and running each query in a separate process, we can ensure accurate memory statistics.
348360

349361
Currently, `mem_profile` only supports `mimalloc` as the memory allocator, since it relies on `mimalloc`'s API to collect memory statistics.
350362

351-
Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.
363+
Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.
364+
365+
The benchmark subcommand (e.g., `tpch`) and all following arguments are passed directly to `dfbench`. Be sure to specify `--bench-profile` before the benchmark subcommand.
352366

353-
The benchmark subcommand (e.g., `tpch`) and all following arguments are passed directly to `dfbench`. Be sure to specify `--bench-profile` before the benchmark subcommand.
367+
Example:
354368

355-
Example:
356369
```shell
357370
datafusion$ cargo run --profile release-nonlto --bin mem_profile -- --bench-profile release-nonlto tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet
358371
```
372+
359373
Example Output:
374+
360375
```
361376
Query Time (ms) Peak RSS Peak Commit Major Page Faults
362377
----------------------------------------------------------------
@@ -385,19 +400,21 @@ Query Time (ms) Peak RSS Peak Commit Major Page Faults
385400
```
386401

387402
## Reported Metrics
403+
388404
When running benchmarks, `mem_profile` collects several memory-related statistics using the mimalloc API:
389405

390-
- Peak RSS (Resident Set Size):
391-
The maximum amount of physical memory used by the process.
392-
This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.
406+
- Peak RSS (Resident Set Size):
407+
The maximum amount of physical memory used by the process.
408+
This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.
393409

394410
- Peak Commit:
395-
The peak amount of memory committed by the allocator (i.e., total virtual memory reserved).
396-
This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.
411+
The peak amount of memory committed by the allocator (i.e., total virtual memory reserved).
412+
This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.
397413

398414
- Major Page Faults:
399-
The number of major page faults triggered during execution.
400-
This metric is obtained from the operating system and is not mimalloc-specific.
415+
The number of major page faults triggered during execution.
416+
This metric is obtained from the operating system and is not mimalloc-specific.
417+
401418
# Writing a new benchmark
402419

403420
## Creating or downloading data outside of the benchmark
@@ -586,6 +603,34 @@ This benchmarks is derived from the [TPC-H][1] version
586603
[2]: https://github.com/databricks/tpch-dbgen.git,
587604
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
588605

606+
## TPCDS
607+
608+
Run the tpcds benchmark.
609+
610+
For data please clone `datafusion-benchmarks` repo which contains the predefined parquet data with SF1.
611+
612+
```shell
613+
git clone https://github.com/apache/datafusion-benchmarks
614+
```
615+
616+
Then run the benchmark with the following command:
617+
618+
```shell
619+
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds
620+
```
621+
622+
Alternatively benchmark the specific query
623+
624+
```shell
625+
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds 30
626+
```
627+
628+
More help
629+
630+
```shell
631+
cargo run --release --bin dfbench -- tpcds --help
632+
```
633+
589634
## External Aggregation
590635

591636
Run the benchmark for aggregations with limited memory.

benchmarks/bench.sh

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,9 @@ tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB),
8787
tpch_csv10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single csv file per table, hash join
8888
tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
8989
90+
# TPC-DS Benchmarks
91+
tpcds: TPCDS inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
92+
9093
# Extended TPC-H Benchmarks
9194
sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=1)
9295
sort_tpch10: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=10)
@@ -220,6 +223,9 @@ main() {
220223
tpch_csv10)
221224
data_tpch "10" "csv"
222225
;;
226+
tpcds)
227+
data_tpcds
228+
;;
223229
clickbench_1)
224230
data_clickbench_1
225231
;;
@@ -388,6 +394,7 @@ main() {
388394
run_external_aggr
389395
run_nlj
390396
run_hj
397+
run_tpcds
391398
;;
392399
tpch)
393400
run_tpch "1" "parquet"
@@ -407,6 +414,9 @@ main() {
407414
tpch_mem10)
408415
run_tpch_mem "10"
409416
;;
417+
tpcds)
418+
run_tpcds
419+
;;
410420
cancellation)
411421
run_cancellation
412422
;;
@@ -601,6 +611,24 @@ data_tpch() {
601611
exit 1
602612
}
603613

614+
# Points to TPCDS data generation instructions
615+
data_tpcds() {
616+
TPCDS_DIR="${DATA_DIR}"
617+
618+
# Check if TPCDS data directory exists
619+
if [ ! -d "${TPCDS_DIR}" ]; then
620+
echo ""
621+
echo "For TPC-DS data generation, please clone the datafusion-benchmarks repository:"
622+
echo " git clone https://github.com/apache/datafusion-benchmarks"
623+
echo ""
624+
return 1
625+
fi
626+
627+
echo ""
628+
echo "TPC-DS data already exists in ${TPCDS_DIR}"
629+
echo ""
630+
}
631+
604632
# Runs the tpch benchmark
605633
run_tpch() {
606634
SCALE_FACTOR=$1
@@ -634,6 +662,37 @@ run_tpch_mem() {
634662
debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" -m --format parquet -o "${RESULTS_FILE}" ${QUERY_ARG}
635663
}
636664

665+
# Runs the tpcds benchmark
666+
run_tpcds() {
667+
TPCDS_DIR="${DATA_DIR}"
668+
669+
# Check if TPCDS data directory exists
670+
if [ ! -d "${TPCDS_DIR}" ]; then
671+
echo "Error: TPC-DS data directory does not exist: ${TPCDS_DIR}" >&2
672+
echo "" >&2
673+
echo "Please prepare TPC-DS data first by following instructions:" >&2
674+
echo " ./bench.sh data tpcds" >&2
675+
echo "" >&2
676+
exit 1
677+
fi
678+
679+
# Check if directory contains parquet files
680+
if ! find "${TPCDS_DIR}" -name "*.parquet" -print -quit | grep -q .; then
681+
echo "Error: TPC-DS data directory exists but contains no parquet files: ${TPCDS_DIR}" >&2
682+
echo "" >&2
683+
echo "Please prepare TPC-DS data first by following instructions:" >&2
684+
echo " ./bench.sh data tpcds" >&2
685+
echo "" >&2
686+
exit 1
687+
fi
688+
689+
RESULTS_FILE="${RESULTS_DIR}/tpcds_sf1.json"
690+
echo "RESULTS_FILE: ${RESULTS_FILE}"
691+
echo "Running tpcds benchmark..."
692+
693+
debug_run $CARGO_COMMAND --bin dfbench -- tpcds --iterations 5 --path "${TPCDS_DIR}" --query_path "../datafusion/core/tests/tpc-ds" --prefer_hash_join "${PREFER_HASH_JOIN}" -o "${RESULTS_FILE}" ${QUERY_ARG}
694+
}
695+
637696
# Runs the compile profile benchmark helper
638697
run_compile_profile() {
639698
local profiles=("$@")

benchmarks/compare_tpcds.sh

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
#!/usr/bin/env bash
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
18+
19+
# Compare TPC-DS benchmarks between two branches
20+
21+
set -e
22+
23+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
24+
25+
usage() {
26+
echo "Usage: $0 <branch1> <branch2>"
27+
echo ""
28+
echo "Example: $0 main dev2"
29+
echo ""
30+
echo "Note: TPC-DS benchmarks are not currently implemented in bench.sh"
31+
exit 1
32+
}
33+
34+
BRANCH1=${1:-""}
35+
BRANCH2=${2:-""}
36+
37+
if [ -z "$BRANCH1" ] || [ -z "$BRANCH2" ]; then
38+
usage
39+
fi
40+
41+
# Store current branch
42+
CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
43+
44+
echo "Comparing TPC-DS benchmarks: ${BRANCH1} vs ${BRANCH2}"
45+
46+
# Run benchmark on first branch
47+
git checkout "$BRANCH1"
48+
./benchmarks/bench.sh run tpcds
49+
50+
# Run benchmark on second branch
51+
git checkout "$BRANCH2"
52+
./benchmarks/bench.sh run tpcds
53+
54+
# Compare results
55+
./benchmarks/bench.sh compare "$BRANCH1" "$BRANCH2"
56+
57+
# Return to original branch
58+
git checkout "$CURRENT_BRANCH"

benchmarks/compare_tpch.sh

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#!/usr/bin/env bash
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
18+
19+
# Compare TPC-H benchmarks between two branches
20+
21+
set -e
22+
23+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
24+
25+
usage() {
26+
echo "Usage: $0 <branch1> <branch2>"
27+
echo ""
28+
echo "Example: $0 main dev2"
29+
exit 1
30+
}
31+
32+
BRANCH1=${1:-""}
33+
BRANCH2=${2:-""}
34+
35+
if [ -z "$BRANCH1" ] || [ -z "$BRANCH2" ]; then
36+
usage
37+
fi
38+
39+
# Store current branch
40+
CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
41+
42+
echo "Comparing TPC-H benchmarks: ${BRANCH1} vs ${BRANCH2}"
43+
44+
# Run benchmark on first branch
45+
git checkout "$BRANCH1"
46+
./benchmarks/bench.sh run tpch
47+
48+
# Run benchmark on second branch
49+
git checkout "$BRANCH2"
50+
./benchmarks/bench.sh run tpch
51+
52+
# Compare results
53+
./benchmarks/bench.sh compare "$BRANCH1" "$BRANCH2"
54+
55+
# Return to original branch
56+
git checkout "$CURRENT_BRANCH"

benchmarks/src/bin/dfbench.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
3434
static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;
3535

3636
use datafusion_benchmarks::{
37-
cancellation, clickbench, h2o, hj, imdb, nlj, sort_tpch, tpch,
37+
cancellation, clickbench, h2o, hj, imdb, nlj, sort_tpch, tpcds, tpch,
3838
};
3939

4040
#[derive(Debug, StructOpt)]
@@ -48,6 +48,7 @@ enum Options {
4848
Nlj(nlj::RunOpt),
4949
SortTpch(sort_tpch::RunOpt),
5050
Tpch(tpch::RunOpt),
51+
Tpcds(tpcds::RunOpt),
5152
}
5253

5354
// Main benchmark runner entrypoint
@@ -64,5 +65,6 @@ pub async fn main() -> Result<()> {
6465
Options::Nlj(opt) => opt.run().await,
6566
Options::SortTpch(opt) => opt.run().await,
6667
Options::Tpch(opt) => Box::pin(opt.run()).await,
68+
Options::Tpcds(opt) => Box::pin(opt.run()).await,
6769
}
6870
}

benchmarks/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,6 @@ pub mod hj;
2323
pub mod imdb;
2424
pub mod nlj;
2525
pub mod sort_tpch;
26+
pub mod tpcds;
2627
pub mod tpch;
2728
pub mod util;

0 commit comments

Comments
 (0)