DNAmArray_workflow/01_GettingStarted.Rmd at master · molepi/DNAmArray_workflow · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
```{r, child="_setup.Rmd"}
```

***

# Installation #

The [**DNAmArray**](https://github.com/molepi/DNAmArray) package can be installed in several ways, and has been tested for >= R-4.4.3 on various Linux-builds and Windows.

***

### Install using **devtools** ###

First, install [**devtools**](https://github.com/hadley/devtools), and then use the `install_github()` function to fetch the [**DNAmArray**](https://github.com/molepi/DNAmArray) package.

```{r 101devtools, eval=FALSE}
library(devtools)
install_github("molepi/DNAmArray")
```

***

### Install from source using `git/R` ###

Using [git](https://git-scm.com/), you can `git clone` our repository and then install the package, changing `_x.y.z.` to the relevant version.

```{r 102git, eval=FALSE}
git clone git@github.com/molepi/DNAmArray.git
R CMD build DNAmArray
R CMD INSTALL DNAmArray_x.y.z.tar.gz
```

***

# Loading packages #

First, load the packages that are required for this workflow:

* [**DNAmArray**](https://github.com/molepi/DNAmArray) - the main package, containing in-house build functions for the preprocessing of DNAm data
* [**MethylAid**](https://www.bioconductor.org/packages/release/bioc/html/MethylAid.html)<sup>10</sup> - for sample-level quality control
* [**omicsPrint**](https://bioconductor.org/packages/release/bioc/html/omicsPrint.html)<sup>12</sup> - in-house tool use to detect sample linkage errors and resolve them
* [**bacon**](http://bioconductor.org/packages/bacon)<sup>11</sup> - in-house tool for reducing bias and inflation in EWAS test statistics
* [**GEOquery**](https://bioconductor.org/packages/release/bioc/html/GEOquery.html)<sup>14</sup> - bridges the gap between [BioConductor](https://www.bioconductor.org) tools and [GEO](https://www.ncbi.nlm.nih.gov/geo/)
* [**tidyverse**](https://www.tidyverse.org/) - for data wrangling
* [**reshape2**](https://cran.r-project.org/web/packages/reshape2/index.html) - for data wrangling
* [**ggrepel**](https://ggrepel.slowkow.com/) - for data visualization
* [**ComplexHeatmap**](https://www.bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) - for creating heatmaps
* [**circlize**](https://cran.r-project.org/web/packages/circlize/index.html) - designed to allow circular plots, but also used to create custom colour palettes
* [**BiocParallel**](https://www.bioconductor.org/packages/release/bioc/html/BiocParallel.html) - to parallelize processing of genomic data
* [**IlluminaHumanMethylationEPICmanifest**](https://www.bioconductor.org/packages/release/data/annotation/html/IlluminaHumanMethylationEPICmanifest.html) - containing EPIC probe annotation for the example data (other packages are available for 450k or EPICv2)
* [**minfi**](https://bioconductor.org/packages/release/bioc/html/minfi.html)<sup>9</sup> - functions to preprocess DNAm data
* [**wateRmelon**](https://www.bioconductor.org/packages/devel/bioc/html/wateRmelon.html)<sup>30</sup> - functions to preprocess DNAm data
* [**snow**](https://cran.r-project.org/web/packages/snow/index.html) - required by MethyLImp2 for parallelization
* [**limma**](https://www.bioconductor.org/packages/release/bioc/html/limma.html)<sup>27</sup> - for EWAS analysis
* [**EpiDISH**](https://www.bioconductor.org/packages/release/bioc/html/EpiDISH.html)<sup>21</sup> - for cell count predictions
* [**sva**](https://bioconductor.org/packages/release/bioc/html/sva.html)<sup>23-25</sup> - for estimating latent factors
* [**methyLImp2**](https://bioconductor.org/packages/release/bioc/html/methyLImp2.html)<sup>20</sup>
* [**SummarizedExperiment**](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html)

```{r 103packages, message=FALSE}
library(DNAmArray)
library(MethylAid)
library(omicsPrint)
library(bacon)

library(GEOquery)
library(tidyverse)
library(ggrepel)
library(reshape2)
library(ComplexHeatmap)
library(circlize)
library(BiocParallel)
library(IlluminaHumanMethylationEPICmanifest)
library(minfi)
library(wateRmelon)
library(snow)

library(limma)
library(EpiDISH)
library(sva)
library(methyLImp2)
library(SummarizedExperiment)
```


***

# Importing IDATs #

The first step in analysing microarray data is importing raw intensity files into your software program. In this example, we show how to import raw IDAT files from GEO into R, but similar strategies can be employed for all Illumina DNAm array files when using R.

Using the `getGEOSuppFiles()` function, supplementary data is downloaded to the current working directory. These consist of the raw IDAT files alongside relevant documentation.

```{r 104getgeofiles, eval=FALSE}
getGEOSuppFiles("GSE116339")
untar(tarfile="../GSE116339/GSE116339_RAW.tar", exdir="../GSE116339/IDATs")
```

The data can then be efficiently unpacked using the `gunzip()` function.

```{r 105idats, eval=FALSE}
setwd("../GSE116339/IDATs")
sapply(list.files(), gunzip)
```

`getGEO()` can then be used to import SOFT format microarray data into R as a large GSE-class list, and extract the metadata of interest.

```{r 106getgeo}
GSE116339 <- getGEO(filename='./data/GSE116339_series_matrix.txt.gz', getGPL = FALSE)
targets <- phenoData(GSE116339)@data
rm(GSE116339)
```

***

# Preparing `targets` #

Before progressing further, take some time to get familiar with the `targets` data frame, removing duplicate information and converting variables to relevant classes.

``` {r 107structure, warning=F}
targets <- targets %>% dplyr::select(sample_ID = `sample_id:ch1`,
                              geo_accession,
                              sex = `gender:ch1`,
                              age = `age:ch1`,
                              log_total_pbb = `ln(totalpbb):ch1`,
                              pbb_153 = `pbb-153:ch1`,
                              pbb_77 = `pbb-77:ch1`,
                              pbb_101 = `pbb-101:ch1`,
                              pbb_180 = `pbb-180:ch1`,
                              supplementary_file) %>%
  separate(supplementary_file, sep="_", remove=F, into=c(NA, "plate", "row", NA)) %>%
  mutate(age = as.numeric(age),
         log_total_pbb = as.numeric(log_total_pbb),
         pbb_153 = as.numeric(pbb_153),
         pbb_77 = as.numeric(pbb_77),
         pbb_101 = as.numeric(pbb_101),
         pbb_180 = as.numeric(pbb_180),
         row = as.numeric(substr(row, 3, 3)))

str(targets)
```

This data frame consists of 679 observations and phenotypic information for 51 variables is stored after cleaning. This includes:

* `sample_ID` - the ID of the sample
* `geo_accession` - the GEO accession number of the sample
* `sex` - male or female
* `age` - age of the individual
* `log_total_pbb`, `pbb_153`, `pbb_77`, `pb_101`, `pbb_180` - exposure levels
* `supplementary_file` - location of IDAT file
* `plate` - EPIC array number
* `row` - row number on the array (continuous)

***