Generalize presto-nvl72 slurm scripts for any cluster by misiugodfrey · Pull Request #352 · rapidsai/velox-testing

misiugodfrey · 2026-05-22T03:52:27Z

Summary

Make presto/slurm/presto-nvl72/ cluster-agnostic: all per-cluster values (partition, account, cpus-per-task, time limits, images, image paths, data root, etc.) now come from ~/.cluster_config.env. See cluster_config.env.example.
Introduce two shared libraries: launcher_common.sh (cluster-variant resolution, sbatch arg assembly, preflight checks, job monitoring) and slurm_common.sh (shared .slurm preamble). Eliminates ~80% of the per-launcher duplication.
Pre-flight checks in every launcher: missing image / data dir / analyzed metastore produce actionable errors pointing at the exact command to run next, before any sbatch submission.
Job state surfaced via sacct: launcher prints Job FAILED (state: …, exit: …) and always displays stderr, so silent failures don't masquerade as success.
launch-analyze-tables.sh is now always CPU (ANALYZE TABLE disables cudf regardless). New CLUSTER_DEFAULT_VARIANT setting lets CPU-only clusters drop the --cpu flag.
gen-data uses the same run_py_script.sh + miniforge3 pattern as analyze/benchmark instead of inventing its own pip flow.
Several CPU-variant fixes in functions.sh:run_worker (gate NVIDIA_VISIBLE_DEVICES export and GDS bind-mounts on VARIANT_TYPE=gpu so CPU nodes don't hit enroot hook failures).
README rewritten as a three-step workflow walkthrough; all 6 entry points now documented.

Test plan

./launch-gen-data.sh -s 1 on CPU cluster → 240 MB of parquet under ${DATA}/tpch-rs-1
./launch-analyze-tables.sh -s 1 on CPU cluster → tpchsf1/ populated in .hive_metastore/
./launch-run.sh -n 1 -s 1 --cpu on CPU cluster → all 22 TPC-H queries pass
./launch-run.sh -n 1 -s 1 on GPU cluster (job currently pending in batch queue)
download image -> gen data -> analyze data -> run_benchmark for tpch sf1k on NVL72 GPU/CPU
Preflight failure modes: missing image / missing data / missing metastore each print correct actionable hint
Failure reporting: jobs that exit non-zero are surfaced as FAILED with stderr displayed

copy-pr-bot · 2026-05-22T03:52:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kingcrimsontianyu · 2026-05-22T19:36:16Z

-# Use POSIX I/O instead of GDS
-./launch-run.sh -n 8 -s 3000 \
-    -w presto-native-worker-gpu-v1 -c presto-coordinator-v1 \
-    --disable-gds
-
-# Use nsys to profile query 5 and 6 for worker 2
-./launch-run.sh -n 8 -s 3000 \
-    -w presto-native-worker-gpu-v1 -c presto-coordinator-v1 \
-    -p --nsys-worker-id 2 -q 5,6


I would suggest keeping these two simple examples in Step 3 — Run benchmarks, minus the -w -c arguments.

For the "test plan", I'd suggest checking if --disable-gds, -p, --nsys-worker-id, -q <query_list> continue to work on the GPU cluster.

misiugodfrey added 2 commits May 21, 2026 16:45

Refactor to use cluster-spefic config files

2df51c7

testing on nvl72

0b5b0c1

misiugodfrey requested a review from a team as a code owner May 22, 2026 03:52

misiugodfrey requested a review from kingcrimsontianyu May 22, 2026 03:52

kingcrimsontianyu reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize presto-nvl72 slurm scripts for any cluster#352

Generalize presto-nvl72 slurm scripts for any cluster#352
misiugodfrey wants to merge 2 commits into
mainfrom
misiug/GeneralizeClusterScripts

misiugodfrey commented May 22, 2026

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

kingcrimsontianyu May 22, 2026

Uh oh!

kingcrimsontianyu May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

misiugodfrey commented May 22, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

kingcrimsontianyu May 22, 2026

Choose a reason for hiding this comment

Uh oh!

kingcrimsontianyu May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants