Merge pull request #320 from mingkang111/update_pyspark_dataflow_application_notebook

mingkang111 · web-flow · commit 6e4b2d53d533 · 2023-06-02T14:57:56.000-07:00
Update ADS pyspark-dataflow-application-notebook
diff --git a/notebook_examples/pyspark-data_flow-application.ipynb b/notebook_examples/pyspark-data_flow-application.ipynb
@@ -25,6 +25,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -43,9 +44,9 @@
     "\n",
     "This notebook provides Apache Spark operations for customers by bridging the existing local PySpark workflows with cloud based capabilities. Data scientists can use their familiar local environments with JupyterLab and work with remote data and remote clusters simply by selecting a kernel. The operations that will be demonstrated are: how to use the interactive Spark environment and produce a Spark script; how to prepare and create an application; how to prepare and create a run; how to list existing dataflow applications; and how to retrieve and display the logs.\n",
     "\n",
-    "The purpose of the `dataflow` module is to provide an efficient and convenient way for users to launch a Spark application and run Spark jobs. The interactive Spark kernel provides a simple and efficient way to edit and build your Spark script, and easy access to read from OCI Object Storage.\n",
+    "The interactive Spark kernel provides a simple and efficient way to edit and build your Spark script, and easy access to read from OCI Object Storage.\n",
     "\n",
-    "Compatible conda pack: [PySpark 2.4 and Data Flow](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.7 (version 3.0)\n",
+    "Compatible conda pack: [PySpark 3.2 and Data Flow](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 2.0)\n",
     "\n",
     "---\n",
     "\n",
@@ -71,19 +72,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import io\n",
     "import matplotlib.pyplot as plt\n",
     "import os\n",
     "import pandas as pd\n",
-    "import tempfile\n",
-    "import uuid\n",
     "\n",
-    "from ads.dataflow.dataflow import DataFlow\n",
-    "from os import path\n",
     "from pyspark.sql import SparkSession"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -111,10 +108,11 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Load the Employee Attrition data file from Oracle Cloud Infrastructure Object Storage into an Apache Spark DataFrame"
+    "Load the Employee Attrition data file from Oracle Cloud Infrastructure Object Storage into an Apache Spark DataFrame. You can configure your `core-site.xml` for accessing to Object Storage by `odsc core-site config` command. Running `odsc core-site config -h` in terminal to see the usage."
    ]
   },
   {
@@ -137,6 +135,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -153,6 +152,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -194,6 +194,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -210,6 +211,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -239,6 +241,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -275,135 +278,25 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Note: other compression formats Data Flow supports today include snappy parquet (example above) and gzip on both csv and parquet."
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We have come to a query that we want to run in Data Flow from previous explorations. Please refer to the dataflow.ipynb on how to submit a job to dataflow"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "dataflow_base_folder = tempfile.mkdtemp()\n",
-    "data_flow = DataFlow(dataflow_base_folder=dataflow_base_folder)\n",
-    "print(\"Data flow directory: {}\".format(dataflow_base_folder))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pyspark_file_path = path.join(\n",
-    "    dataflow_base_folder, \"example-{}.py\".format(str(uuid.uuid4())[-6:])\n",
-    ")\n",
-    "script = '''\n",
-    "from pyspark.sql import SparkSession\n",
-    "\n",
-    "def main():\n",
-    "    \n",
-    "    # Create a Spark session\n",
-    "    spark = SparkSession \\\\\n",
-    "        .builder \\\\\n",
-    "        .appName(\"Python Spark SQL basic example\") \\\\\n",
-    "        .getOrCreate()\n",
-    "    \n",
-    "    # Load a csv file from dataflow public storage\n",
-    "    df = spark \\\\\n",
-    "        .read \\\\\n",
-    "        .format(\"csv\") \\\\\n",
-    "        .option(\"header\", \"true\") \\\\\n",
-    "        .option(\"multiLine\", \"true\") \\\\\n",
-    "        .load(\"oci://hosted-ds-datasets@bigdatadatasciencelarge/synthetic/orcl_attrition.csv\")\n",
-    "    \n",
-    "    # Create a temp view and do some SQL operations\n",
-    "    df.createOrReplaceTempView(\"emp_attrition\")\n",
-    "    query_result_df = spark.sql(\"\"\"\n",
-    "        SELECT \n",
-    "            Age,\n",
-    "            MonthlyIncome,\n",
-    "            YearsInIndustry\n",
-    "        FROM emp_attrition \n",
-    "    \"\"\")\n",
-    "    \n",
-    "    # Convert the filtered Apache Spark DataFrame into JSON format\n",
-    "    # Note: we are writing to the Spark stdout log so that we can retrieve the log later at the end of the notebook.\n",
-    "    print('\\\\n'.join(query_result_df.toJSON().collect()))\n",
-    "    \n",
-    "if __name__ == '__main__':\n",
-    "    main()\n",
-    "'''\n",
-    "\n",
-    "with open(pyspark_file_path, \"w\") as f:\n",
-    "    print(script.strip(), file=f)\n",
-    "\n",
-    "print(\"Script path: {}\".format(pyspark_file_path))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "script_bucket = \"test\"  # Update the value\n",
-    "logs_bucket = \"dataflow-log\"  # Update the value\n",
-    "display_name = \"sample_Data_Flow_app\"\n",
-    "\n",
-    "app_config = data_flow.prepare_app(\n",
-    "    display_name=display_name,\n",
-    "    script_bucket=script_bucket,\n",
-    "    pyspark_file_path=pyspark_file_path,\n",
-    "    logs_bucket=logs_bucket,\n",
-    ")\n",
-    "\n",
-    "app = data_flow.create_app(app_config)\n",
-    "\n",
-    "run_display_name = \"sample_Data_Flow_run\"\n",
-    "run_config = app.prepare_run(run_display_name=run_display_name)\n",
-    "\n",
-    "run = app.run(run_config, save_log_to_local=True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "run.status"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "run.config"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "run.oci_link"
+    "<a id='df_app'></a>\n",
+    "## Create a Data Flow application\n",
+    "`oracle-ads` provides different ways to submit your code to Data Flow for workloads that require larger resources. To learn more, read the [user guide](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/apachespark/dataflow.html#)."
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [