simplify the query to get all patients that have clinical data

fedorov · fedorov · commit 609b4d159e79 · 2023-05-22T17:41:46.000-04:00
diff --git a/notebooks/clinical_data_intro.ipynb b/notebooks/clinical_data_intro.ipynb
@@ -30,7 +30,7 @@
         "\n",
         "Prepared: July 2022\n",
         "\n",
-        "Updated: Dec 2022"
+        "Updated: May 2023"
       ]
     },
     {
@@ -76,7 +76,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 2,
       "metadata": {
         "id": "o8WdiIiBQwav"
       },
@@ -712,7 +712,9 @@
         "\n",
         "Sometime you may want to know whether specific patient has any clinical data available. One way to do this is to locate the collection that patient belongs to, and then check whether any of the clinical data tables (if any) that are available for that collection have that patient identifier.\n",
         "\n",
-        "Alternatively, we can build a complete list of patients that have clinical data by performing a union on all of the `dicom_patient_id` columns across all of the clinical data tables, which is what we do in the next cell."
+        "Alternatively, we can build a complete list of patients that have clinical data by performing a union on all of the `dicom_patient_id` columns across all of the clinical data tables, which is what we do in the next cell.\n",
+        "\n",
+        "In this query we use the ability of BigQuery to [query multiple tables using a wildcard table](https://cloud.google.com/bigquery/docs/querying-wildcard-tables). Note that here we refer to the specific version of the data, since `idc_current_clinical` dataset contains views, which cannot be queried through prefix."
       ]
     },
     {
@@ -725,30 +727,20 @@
       "source": [
         "import re\n",
         "\n",
-        "all_clinical_tables = column_metadata_df[\"table_name\"].unique()\n",
-        "query = \"with patients_unionized as (SELECT dicom_patient_id FROM \"+re.sub(\"idc_v[0-9]*_clinical\", \"idc_current_clinical\", all_clinical_tables[0])\n",
-        "for clinical_table in all_clinical_tables[1:]:\n",
-        "  query = query+\" UNION ALL SELECT dicom_patient_id FROM \"+re.sub(\"idc_v[0-9]*_clinical\", \"idc_current_clinical\", clinical_table)\n",
-        "\n",
-        "selection_query = query+\") select distinct(dicom_patient_id) from patients_unionized\"\n",
-        "\n",
-        "#print(selection_query)\n",
+        "selection_query = \"\"\"\n",
+        "SELECT\n",
+        "  DISTINCT(dicom_patient_id)\n",
+        "FROM\n",
+        "  `bigquery-public-data.idc_v14_clinical.*`\n",
+        "WHERE\n",
+        "  _TABLE_SUFFIX NOT IN (\"table_metadata\",\n",
+        "    \"column_metadata\" )\n",
+        "\"\"\"\n",
         "\n",
         "selection_result = bq_client.query(selection_query)\n",
-        "patients_df = selection_result.result().to_dataframe()\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "Ca63J0HWiXjH"
-      },
-      "outputs": [],
-      "source": [
-        "patients = patients_df[\"dicom_patient_id\"].unique().tolist()\n",
+        "patients_df = selection_result.result().to_dataframe()\n",
         "\n",
-        "print(\"\\n\".join(patients))"
+        "patients_df\n"
       ]
     },
     {