|
29 | 29 | "\n", |
30 | 30 | "In this notebook you will be introduced into how IDC organizes the metadata accompanying images available in IDC, and how that metadata can be used to define subsets of data.\n", |
31 | 31 | "\n", |
| 32 | + "This documentation page can be used as a complement if you would like to learn more about how IDC metadata is organized: https://learn.canceridc.dev/data/organization-of-data/files-and-metadata.\n", |
| 33 | + "\n", |
32 | 34 | "---\n", |
33 | 35 | "Initial version: Nov 2022\n", |
34 | 36 | "\n", |
35 | | - "Updated: \n" |
| 37 | + "Updated: Oct 2023\n" |
36 | 38 | ] |
37 | 39 | }, |
38 | 40 | { |
|
89 | 91 | "source": [ |
90 | 92 | "## Why do I need to search?\n", |
91 | 93 | "\n", |
92 | | - "Think of IDC as a library. Image files are books, and we have ~45 TB of those. When you go to a library, you want to check out just the books that you want to read. In order to find a book in a large library you need a catalog. \n", |
| 94 | + "Think of IDC as a library. Image files are books, and we have ~45 TB of those. When you go to a library, you want to check out just the books that you want to read. In order to find a book in a large library you need a catalog.\n", |
93 | 95 | "\n", |
94 | 96 | "Just as in the library, IDC maintains a catalog that indexes a variety of metadata fields describing the files we curate. That metadata catalog is accessible in a large database table that you should be using to search and subset the images. Each row in that table corresponds to a file, and includes the location of the file alongside the metadata attributes describing that file.\n" |
95 | 97 | ] |
|
102 | 104 | "source": [ |
103 | 105 | "## How do I search?\n", |
104 | 106 | "\n", |
105 | | - "When you search, or _query_ IDC catalog, you specify what criteria should the metadata describing the selected files satisfy. \n", |
| 107 | + "When you search, or _query_ IDC catalog, you specify what criteria should the metadata describing the selected files satisfy.\n", |
106 | 108 | "\n", |
107 | | - "Queries can be as simple as \n", |
| 109 | + "Queries can be as simple as\n", |
108 | 110 | "\n", |
109 | | - "* \"_everything in collection X_\", \n", |
| 111 | + "* \"_everything in collection X_\",\n", |
110 | 112 | "\n", |
111 | | - "or as complex as \n", |
| 113 | + "or as complex as\n", |
112 | 114 | "\n", |
113 | 115 | "* \"_files corresponding to CT images of female patients that are accompanied by annotations of lung tumors that are larger than 1500 mm^3 in volume_\".\n", |
114 | 116 | "\n", |
115 | 117 | "Although it would be very nice to just state what you need in free form, in practice queries need to be written in a formal way.\n", |
116 | 118 | "\n", |
117 | | - "IDC organizes all of the metadata into large tables, where each row corresponds to one image file (as of IDC data release v12, we index ~42 millions of files) and each column represents a metadata attribute present in one or more files in IDC (currently, we index hundreds of such attributes). \n", |
| 119 | + "IDC organizes all of the metadata into large tables, where each row corresponds to one image file (as of IDC data release v12, we index ~42 millions of files) and each column represents a metadata attribute present in one or more files in IDC (currently, we index hundreds of such attributes).\n", |
118 | 120 | "\n", |
119 | | - "IDC metadata tables are maintained in [GCP BigQuery](https://cloud.google.com/bigquery), with only a tiny subset of the attributes indexed in the catalog available via the [IDC Portal exploration page](https://imaging.datacommons.cancer.gov/explore/). IDC metadata can be queried using Standard Query Language (SQL), and does not require learning any IDC-specific API. \n", |
| 121 | + "IDC metadata tables are maintained in [GCP BigQuery](https://cloud.google.com/bigquery), with only a tiny subset of the attributes indexed in the catalog available via the [IDC Portal exploration page](https://imaging.datacommons.cancer.gov/explore/). IDC metadata can be queried using Standard Query Language (SQL), and does not require learning any IDC-specific API.\n", |
120 | 122 | "\n", |
121 | 123 | "In the following steps of the tutorial we will use just a few of the attributes (SQL table columns) to get started. You will be able to use the same principles and SQL queries to extend your search criteria to include any of the other attributes indexed by IDC." |
122 | 124 | ] |
|
134 | 136 | "As the very first query, let's get the list of all the image collections available in IDC. Here is that query:\n", |
135 | 137 | "\n", |
136 | 138 | "```sql\n", |
137 | | - "SELECT \n", |
138 | | - " DISTINCT(collection_id) \n", |
139 | | - "FROM \n", |
| 139 | + "SELECT\n", |
| 140 | + " DISTINCT(collection_id)\n", |
| 141 | + "FROM\n", |
140 | 142 | " bigquery-public-data.idc_current.dicom_all\n", |
141 | 143 | "```\n", |
142 | 144 | "\n", |
|
168 | 170 | "2. `idc_current`is a _dataset_ within the `bigquery-public-data` project. Think of BigQuery datasets as containers that are used to organize and control access to the tables within the project.\n", |
169 | 171 | "3. `dicom_all` is one of the tables within the `idc_current` dataset. As you spend more time learning about IDC, you will hopefully leverage other tables available in that dataset.\n", |
170 | 172 | "\n", |
171 | | - "If you now look back at the [BigQuery console](https://console.cloud.google.com/bigquery) and expand the list of datasets under the `bigquery-public-data` project, you will see that in addition to the `idc_current` dataset there are also datasets `idc_v14`, `idc_v13`, etc all the way to `idc_v1`. Those datasets correspond to the IDC data release versions, with `idc_current` being an alias for the latest (at the moment of writing this, v14 is the latest release) version of IDC data. \n", |
| 173 | + "If you now look back at the [BigQuery console](https://console.cloud.google.com/bigquery) and expand the list of datasets under the `bigquery-public-data` project, you will see that in addition to the `idc_current` dataset there are also datasets `idc_v14`, `idc_v13`, etc all the way to `idc_v1`. Those datasets correspond to the IDC data release versions, with `idc_current` being an alias for the latest (at the moment of writing this, v14 is the latest release) version of IDC data.\n", |
172 | 174 | "\n", |
173 | | - "We will not spend time discussing how IDC versioning works, but it is important to know that \n", |
| 175 | + "We will not spend time discussing how IDC versioning works, but it is important to know that\n", |
174 | 176 | "\n", |
175 | 177 | "1. IDC data is versioned;\n", |
176 | 178 | "2. queries against the `idc_current` dataset are equivalent to the queries against the latest version (currently, `idc_v14`) of IDC data;\n", |
|
207 | 209 | "bq_client = bigquery.Client(my_ProjectID)\n", |
208 | 210 | "\n", |
209 | 211 | "selection_query = \"\"\"\n", |
210 | | - "SELECT \n", |
211 | | - " DISTINCT(collection_id) \n", |
212 | | - "FROM \n", |
| 212 | + "SELECT\n", |
| 213 | + " DISTINCT(collection_id)\n", |
| 214 | + "FROM\n", |
213 | 215 | " bigquery-public-data.idc_current.dicom_all\n", |
214 | 216 | "\"\"\"\n", |
215 | 217 | "\n", |
|
362 | 364 | "\n", |
363 | 365 | "# Execution of this cell will fail unless you wrote the query below!\n", |
364 | 366 | "selection_query = \"\"\"\n", |
365 | | - "SELECT \n", |
| 367 | + "SELECT\n", |
366 | 368 | " DISTINCT(collection_id)\n", |
367 | 369 | "FROM\n", |
368 | 370 | " bigquery-public-data.idc_current.dicom_all\n", |
|
389 | 391 | "source": [ |
390 | 392 | "## DICOM data model: Patients, studies, series and instances\n", |
391 | 393 | "\n", |
392 | | - "Up to now we searched the data at the granularity of the collections. In practice, we often want to know how many patients meet our search criteria, or what are the specific images that we need to download. \n", |
| 394 | + "Up to now we searched the data at the granularity of the collections. In practice, we often want to know how many patients meet our search criteria, or what are the specific images that we need to download.\n", |
393 | 395 | "\n", |
394 | | - "IDC is using DICOM for data representation, and in the DICOM data model, patients (identified by `PatientID`) undergo imaging exams (or _studies_, in DICOM nomenclature). \n", |
| 396 | + "IDC is using DICOM for data representation, and in the DICOM data model, patients (identified by `PatientID`) undergo imaging exams (or _studies_, in DICOM nomenclature).\n", |
395 | 397 | "\n", |
396 | | - "Each patient will have one or more studies, with each study identified uniquely by the attribute `StudyInstanceUID`. During each of the imaging studies one or more imaging _series_ will be collected. As an example, a Computed Tomography (CT) imaging study may include a volume sweep before and after administration of the contrast agent. Imaging series are uniqiely identified by `SeriesInstanceUID`. \n", |
| 398 | + "Each patient will have one or more studies, with each study identified uniquely by the attribute `StudyInstanceUID`. During each of the imaging studies one or more imaging _series_ will be collected. As an example, a Computed Tomography (CT) imaging study may include a volume sweep before and after administration of the contrast agent. Imaging series are uniqiely identified by `SeriesInstanceUID`.\n", |
397 | 399 | "\n", |
398 | 400 | "Finally, each imaging series contains one or more _instances_, where each instance corresponds to a file. Most often, one instance corresponds to a single slice from a cross-sectional image. Individual instances are identified by unique `SOPInstanceUID` values.\n", |
399 | 401 | "\n", |
|
419 | 421 | "bq_client = bigquery.Client(my_ProjectID)\n", |
420 | 422 | "\n", |
421 | 423 | "selection_query = \"\"\"\n", |
422 | | - "SELECT \n", |
| 424 | + "SELECT\n", |
423 | 425 | " COUNT(DISTINCT(PatientID)) as patient_cnt\n", |
424 | 426 | "FROM\n", |
425 | 427 | " bigquery-public-data.idc_current.dicom_all\n", |
|
493 | 495 | "\n", |
494 | 496 | "In many cases, image analysis is done at the granularity of the individual DICOM series. In some cases DICOM series corresponds to a single instance (e.g., for X-ray modalities), but in most cases imaging modalities are cross-sectional, containing multiple slices, with each slice stored in a separate instance (file), which can be reconstructed into a 3D volume.\n", |
495 | 497 | "\n", |
496 | | - "From the examples and queries above, you should have developed some understanding about the modalities and few other collection-level characteristics for the data included in IDC. As an example, we know that IDC data contains MR images of Liver. \n", |
| 498 | + "From the examples and queries above, you should have developed some understanding about the modalities and few other collection-level characteristics for the data included in IDC. As an example, we know that IDC data contains MR images of Liver.\n", |
497 | 499 | "\n", |
498 | 500 | "In the following query we select the UID of a sample MR series from the images covering Liver cancer." |
499 | 501 | ] |
|
521 | 523 | "WHERE\n", |
522 | 524 | " Modality = \"MR\" AND tcia_tumorLocation = \"Liver\"\n", |
523 | 525 | "\n", |
524 | | - "# note the use of this new operator that makes the query \n", |
525 | | - "# return just the first one of the matching rows \n", |
| 526 | + "# note the use of this new operator that makes the query\n", |
| 527 | + "# return just the first one of the matching rows\n", |
526 | 528 | "LIMIT\n", |
527 | 529 | " 1\n", |
528 | 530 | "\"\"\"\n", |
|
539 | 541 | "id": "ab2lCsz5dyxA" |
540 | 542 | }, |
541 | 543 | "source": [ |
542 | | - "The result of this query is the _unique identifier_ for a DICOM series that meets the selection criteria. " |
| 544 | + "The result of this query is the _unique identifier_ for a DICOM series that meets the selection criteria." |
543 | 545 | ] |
544 | 546 | }, |
545 | 547 | { |
|
607 | 609 | "id": "5cbTYu-4hxH_" |
608 | 610 | }, |
609 | 611 | "source": [ |
610 | | - "This query introduces a couple of more advanced concepts: \n", |
| 612 | + "This query introduces a couple of more advanced concepts:\n", |
611 | 613 | "* we use `WITH` operator to define an intermediate query that writes the result into `temp_result` table, which is then queried\n", |
612 | 614 | "* we capture all of the distinct values of `Modality` into an _array_ `modalities`, since we want to check for presence of both MR and SEG modalities in the study.\n", |
613 | 615 | "\n", |
|
0 commit comments