Skip to content

Add execution engine run scripts: Databricks, Spark Standalone, Dataproc#1

Open
Copilot wants to merge 4 commits into
mainfrom
copilot/add-execution-engines-scripts
Open

Add execution engine run scripts: Databricks, Spark Standalone, Dataproc#1
Copilot wants to merge 4 commits into
mainfrom
copilot/add-execution-engines-scripts

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 20, 2026

Adds shell scripts for submitting and managing Spark workloads across three execution engines, starting from an empty repo.

Structure

databricks/
  run_job.sh          # Trigger existing job by ID or name; poll until done
  submit_notebook.sh  # One-time notebook run on existing or ephemeral cluster
spark-standalone/
  start_cluster.sh    # Start master + N local workers (configurable cores/mem)
  stop_cluster.sh     # Gracefully stop workers then master
  submit_job.sh       # spark-submit wrapper (JAR + Python, --conf, deploy mode)
dataproc/
  create_cluster.sh   # Create cluster (machine types, auto-delete, labels, properties)
  delete_cluster.sh   # Delete cluster with optional confirmation
  submit_job.sh       # Submit spark/pyspark/hadoop/hive/pig/presto jobs

Key design points

  • Databricks — uses Jobs API 2.1 (run-now / runs/submit); run_job.sh supports name→ID resolution via URL-encoded list query
  • Dataproc — all Spark/job properties collected into a single --properties key=val,key=val argument (multiple --properties flags are silently last-wins in gcloud)
  • All scripts: set -euo pipefail, named flags with defaults, --help, dependency checks (require_cmd), clear error messages on missing required inputs

Usage

# Databricks: trigger job by name and wait
DATABRICKS_HOST=https://adb-xxx.azuredatabricks.net DATABRICKS_TOKEN=dapi... \
./databricks/run_job.sh --job-name nightly_etl --params '{"date":"2024-01-01"}' --wait

# Spark Standalone: start cluster, submit, stop
SPARK_HOME=/opt/spark ./spark-standalone/start_cluster.sh --workers 2 --worker-cores 4 --worker-memory 8g
./spark-standalone/submit_job.sh --app /path/to/etl.py
./spark-standalone/stop_cluster.sh

# Dataproc: create, submit, teardown
GCP_PROJECT=my-project GCP_REGION=us-central1 \
./dataproc/create_cluster.sh --cluster-name etl --num-workers 4 --max-idle 30m
./dataproc/submit_job.sh --cluster-name etl --job-type pyspark --app gs://bucket/etl.py -- --date 2024-01-01
./dataproc/delete_cluster.sh --cluster-name etl --yes
Original prompt

I want to add different execution engines run scripts like databricks, Spark-Standalone, Dataproc

@arshewarTarun arshewarTarun marked this pull request as ready for review April 20, 2026 07:03
Copilot AI review requested due to automatic review settings April 20, 2026 07:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI changed the title [WIP] Add different execution engines for running scripts Add execution engine run scripts: Databricks, Spark Standalone, Dataproc Apr 20, 2026
Copilot AI requested a review from arshewarTarun April 20, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants