Update to include AWS download instructions

fedorov · fedorov · commit 7b2b4ce15f35 · 2023-05-03T15:22:32.000-04:00
diff --git a/notebooks/getting_started/part3_exploring_cohorts.ipynb b/notebooks/getting_started/part3_exploring_cohorts.ipynb
@@ -1,5 +1,15 @@
 {
   "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part3_exploring_cohorts.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -23,7 +33,7 @@
         "---\n",
         "Initial version: Nov 2022\n",
         "\n",
-        "Updated: "
+        "Updated: May 2023"
       ]
     },
     {
@@ -105,12 +115,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/",
-          "height": 424
-        },
-        "id": "sVBOII5GP9cw",
-        "outputId": "fc199d98-c22c-4d88-a89a-962698ddd4ee"
+        "id": "sVBOII5GP9cw"
       },
       "outputs": [],
       "source": [
@@ -160,7 +165,7 @@
         "\n",
         "While defining your cohort, you may be looking for subsets of collections, patients or studies that meet your search critieria. But when it comes to downloading the cohort, you will always need to get the URLs of the individual files from the cohort.\n",
         "\n",
-        "The following cell uses the query from the above to get the list of study identifiers that meet our search critieria, and then selects all of the rows correspoing to the DICOM instances (files)that are included in those studies. The column that contains the URI that can be used to download the file is in the `gcs_url` column. We will also query the `instance_size` column, which we can use to calculate the size of the files corresponding to the cohort.\n",
+        "The following cell uses the query from the above to get the list of study identifiers that meet our search critieria, and then selects all of the rows correspoing to the DICOM instances (files)that are included in those studies. The column that contains the URI that can be used to download the file from Google GCS bucket is in the `gcs_url` column, while the location of that same file in Amazon AWS bucket is in the `aws_url` column. We will also query the `instance_size` column, which we can use to calculate the size of the files corresponding to the cohort.\n",
         "\n",
         "As we learned in the part 2 of this tutorial series, each row of the IDC metadata table corresponds to a single DICOM file, and the attributes we mentioned above are assigned at the granularity of the individual files.\n",
         "\n",
@@ -171,12 +176,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/",
-          "height": 441
-        },
-        "id": "D4dkFpEVfDO1",
-        "outputId": "8a5c6f34-d92f-4d6c-dccf-792819882901"
+        "id": "D4dkFpEVfDO1"
       },
       "outputs": [],
       "source": [
@@ -189,6 +189,7 @@
         "selection_query = \"\"\"\n",
         "SELECT\n",
         "  gcs_url,\n",
+        "  aws_url,\n",
         "  instance_size\n",
         "FROM\n",
         "  bigquery-public-data.idc_current.dicom_all\n",
@@ -238,69 +239,36 @@
         "id": "ojgbk_nMhbv0"
       },
       "source": [
-        "Now that we have URLs of the individual files, we can use `gsutil` command line tool from Google Cloud SDK to download them. When using Colab, Google Cloud SDK is pre-installed, but if you want to download files to your computer directly you will need to install SDK first! Note that download instructions are documented [here](https://learn.canceridc.dev/data/downloading-data).\n",
+        "Now that we have URLs of the individual files, we can use the open source `s5cmd` command line tool to download the files either from the GCS or AWS locations. Note that download instructions are documented [here](https://learn.canceridc.dev/data/downloading-data).\n",
         "\n",
-        "To download the files, we will first save GCS URLs into the manifest file, and then pass that manifest to `gsutil`."
+        "[s5cmd](https://github.com/peak/s5cmd) is an open source very fast S3 and local filesystem execution tool, which, as experiments showed, is significantly faster than the Google-provided `gsutil`. Let's first install it."
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "8CVLUiMMijjE",
-        "outputId": "3afc6cb2-5fea-4785-e900-65eaddeffc53"
+        "id": "mr-F6YXrWHOm"
       },
       "outputs": [],
       "source": [
-        "selection_df[\"gcs_url\"].to_csv(\"manifest.txt\", header=False, index=False)\n",
+        "!wget https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz && tar zxf s5cmd_2.0.0_Linux-64bit.tar.gz\n",
         "\n",
-        "!rm -rf downloaded_cohort_files && mkdir downloaded_cohort_files\n",
-        "!cat manifest.txt | gsutil -m cp -I downloaded_cohort_files"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "9yJD5pUSjXo0"
-      },
-      "source": [
-        "Once the cell above is done, you can expand the left-side \"Files\" panel in the Colab interface and confirm that `downloaded_cohort_files` is not empty."
+        "!./s5cmd --help"
       ]
     },
     {
       "cell_type": "markdown",
-      "metadata": {
-        "id": "70hNEc71T_VZ"
-      },
       "source": [
-        "### Faster download using `s5cmd`\n",
-        "\n",
-        "[s5cmd](https://github.com/peak/s5cmd) is an open source very fast S3 and local filesystem execution tool, which, as experiments showed, is significantly faster than the Google-provided `gsutil`."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
+        "In order to use `s5cmd` we need to create the manifest that contains the commands to download individual files. Note that the `gs` bucket prefix for the Google GCS locations is replaced with `s3`. This is because `s5cmd` is a provider-agnostic tool that implements S3 API, and Google GCS buckets support S3 API."
+      ],
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "mr-F6YXrWHOm",
-        "outputId": "de9cdb9c-af7a-44ac-8005-3e4076914a0a"
-      },
-      "outputs": [],
-      "source": [
-        "!wget https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz && tar zxf s5cmd_2.0.0_Linux-64bit.tar.gz\n",
-        "\n",
-        "!./s5cmd --help"
-      ]
+        "id": "gdrjpXYM--IH"
+      }
     },
     {
       "cell_type": "code",
-      "execution_count": 6,
+      "execution_count": 3,
       "metadata": {
         "id": "gZhsxNbuWXy7"
       },
@@ -314,7 +282,8 @@
         "\n",
         "selection_query = \"\"\"\n",
         "SELECT\n",
-        "  CONCAT(\"cp \",REPLACE(gcs_url, \"gs://\", \"s3://\"), \" .\") as s5cmd_command,\n",
+        "  CONCAT(\"cp \",REPLACE(gcs_url, \"gs://\", \"s3://\"), \" .\") as s5cmd_gcs_command,\n",
+        "  CONCAT(\"cp \",aws_url, \" .\") as s5cmd_aws_command,\n",
         "  StudyInstanceUID,\n",
         "FROM\n",
         "  bigquery-public-data.idc_current.dicom_all\n",
@@ -354,24 +323,45 @@
         "\n"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Next two cells demonstrate how to generate a manifest and download the files corresponding to your selection from either GCS or AWS buckets."
+      ],
+      "metadata": {
+        "id": "M94EtV6jBnXg"
+      }
+    },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "DvFRdvyPWpAE",
-        "outputId": "8c7d2194-78cd-4ff3-93ff-33546cf51197"
+        "id": "DvFRdvyPWpAE"
       },
       "outputs": [],
       "source": [
-        "selection_df[\"s5cmd_command\"].to_csv(\"/content/s5cmd_manifest.txt\", header=False, index=False)\n",
+        "# Download files from GCS\n",
+        "selection_df[\"s5cmd_gcs_command\"].to_csv(\"/content/s5cmd_gcs_manifest.txt\", header=False, index=False)\n",
         "\n",
-        "!rm -rf downloaded_cohort_files && mkdir downloaded_cohort_files\n",
-        "!cd downloaded_cohort_files && /content/s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com run /content/s5cmd_manifest.txt"
+        "!rm -rf gcs_downloaded_cohort_files && mkdir gcs_downloaded_cohort_files\n",
+        "!cd gcs_downloaded_cohort_files && /content/s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com run /content/s5cmd_gcs_manifest.txt"
       ]
     },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Download files from AWS\n",
+        "selection_df[\"s5cmd_aws_command\"].to_csv(\"/content/s5cmd_aws_manifest.txt\", header=False, index=False)\n",
+        "\n",
+        "!rm -rf aws_downloaded_cohort_files && mkdir aws_downloaded_cohort_files\n",
+        "!cd aws_downloaded_cohort_files && /content/s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com run /content/s5cmd_aws_manifest.txt"
+      ],
+      "metadata": {
+        "id": "e6BjWfTTC5kS"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -394,12 +384,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/",
-          "height": 206
-        },
-        "id": "bW5ULTmfpa7g",
-        "outputId": "90d45223-f381-4d3f-e2f5-6eec2b653951"
+        "id": "bW5ULTmfpa7g"
       },
       "outputs": [],
       "source": [
@@ -460,11 +445,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "rmEGwPyUqrYr",
-        "outputId": "548d7f8c-4edc-4504-a671-bfb59c5c9331"
+        "id": "rmEGwPyUqrYr"
       },
       "outputs": [],
       "source": [
@@ -479,7 +460,7 @@
       "source": [
         "You can observe that there are two rows for the collection `qin_lung_ct`, which highlights important point that `collection_id` should be treated as a label grouping together both the items released by the original contributors of what initially formed the collection, but also the analysis results of the data in the original collection that might be contributed later. \n",
         "\n",
-        "![collection](https://www.dropbox.com/s/ot7dbz5umligji1/collection.png?raw=1)\n",
+        "![collection](https://www.dropbox.com/s/s6t44nqb2s4ye9l/tcia_collection.jpg?raw=1)\n",
         "\n",
         "In the example above, [`qin_lung_ct` collection ](https://doi.org/10.7937/K9/TCIA.2015.NPGZYZBZ) was complemented by the [segmentations of the lung nodules](https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI), with both images and segmentations becoming part of the same collection, but having distinct DOIs and attribution requirements."
       ]
@@ -512,11 +493,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "JVIqa3GWUV64",
-        "outputId": "0df93427-00b0-43c1-d9b9-1f12676cb49a"
+        "id": "JVIqa3GWUV64"
       },
       "outputs": [],
       "source": [
@@ -527,7 +504,6 @@
       ]
     },
     {
-      "attachments": {},
       "cell_type": "markdown",
       "metadata": {
         "id": "-zRHvX1ZK-Dn"
@@ -539,9 +515,10 @@
       ]
     },
     {
-      "attachments": {},
       "cell_type": "markdown",
-      "metadata": {},
+      "metadata": {
+        "id": "pvcX4ZhS-OEu"
+      },
       "source": [
         "We can use itkwidgets to view the full 3D model or view slices as well as choose whether or not we'd like to display segmentation label maps with our image data.\n",
         "\n",
@@ -552,11 +529,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "Bx9snM-MK-v9",
-        "outputId": "988ea7e4-8a6c-47fa-ebaf-67d98565bb82"
+        "id": "Bx9snM-MK-v9"
       },
       "outputs": [],
       "source": [
@@ -567,7 +540,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 12,
+      "execution_count": null,
       "metadata": {
         "id": "o9hS4wamM3jO"
       },
@@ -580,11 +553,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "QEPR3kszK_o-",
-        "outputId": "8791fea9-0869-4cf6-c49f-6c4f6c53eda7"
+        "id": "QEPR3kszK_o-"
       },
       "outputs": [],
       "source": [
@@ -604,7 +573,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 14,
+      "execution_count": null,
       "metadata": {
         "id": "Db18CEtoK_Md"
       },
@@ -616,7 +585,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 15,
+      "execution_count": null,
       "metadata": {
         "id": "FksVuvkgLbpG"
       },
@@ -641,7 +610,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 16,
+      "execution_count": null,
       "metadata": {
         "id": "dtNEiziLLlWv"
       },
@@ -654,7 +623,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 17,
+      "execution_count": null,
       "metadata": {
         "id": "Rvf1sEizLwyR"
       },
@@ -682,7 +651,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 18,
+      "execution_count": null,
       "metadata": {
         "colab": {
           "base_uri": "https://localhost:8080/",
@@ -1087,7 +1056,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 19,
+      "execution_count": null,
       "metadata": {
         "id": "sUi81pCUL5_0"
       },
@@ -1110,7 +1079,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 20,
+      "execution_count": null,
       "metadata": {
         "id": "XSxs_YqeL6XW"
       },
@@ -1172,7 +1141,8 @@
   ],
   "metadata": {
     "colab": {
-      "provenance": []
+      "provenance": [],
+      "include_colab_link": true
     },
     "gpuClass": "standard",
     "kernelspec": {