|
| 1 | +--- |
| 2 | +title: "Omicslog" |
| 3 | +author: "Juan Henao" |
| 4 | +date: "2025-11-04" |
| 5 | +package: tidyomics |
| 6 | +tags: |
| 7 | + - tidyomics/tidyomicsBlog |
| 8 | + - logging |
| 9 | + - tidyverse |
| 10 | + - bioconductor |
| 11 | +description: "Providing logging capabilities for SummarizedExperiment objects." |
| 12 | +image: omicslog_blog_logo.png |
| 13 | +format: |
| 14 | + html: |
| 15 | + toc: true |
| 16 | + toc-float: true |
| 17 | +execute: |
| 18 | + freeze: true |
| 19 | +--- |
| 20 | + |
| 21 | +# Welcome to omicslog! |
| 22 | + |
| 23 | +I still remember being in front of my PI, trying to recall, or even worse, to guess the number of samples we had ignored when running a specific analysis, such as filtering low-count genes for DEG analysis or excluding biological samples collected outside a target time window when we aimed to discover biomarker candidates for early disease detection. I especially remember how that became even worse across the different projects we were working on simultaneously. |
| 24 | + |
| 25 | +The solution was always the same: rerun the whole code up to the line that could answer those questions. That was even more frustrating considering that, in many cases, those questions came from pure curiosity rather than information that would be included in the final publication. |
| 26 | + |
| 27 | +Inspired by the `lab notebook` from my wet lab colleagues and the `tidylog` package, we present `omicslog`, a package that provides logging capabilities for omics-oriented objects. Our goal is to establish a standard for tracking changes to these objects, acting as an automated dry lab notebook and improving the reproducibility of specific analyses. |
| 28 | + |
| 29 | +We started by enabling logging for the `SummarizedExperiment` class, powered by `tidyomics` functionalities. Every function in the pipeline is evaluated, with changes traced and aggregated as metadata. |
| 30 | + |
| 31 | +Let’s start with a practical example, beginning with the package installation and library loading: |
| 32 | + |
| 33 | +```r |
| 34 | +if (!require("devtools", quietly = TRUE)) |
| 35 | + install.packages("devtools") |
| 36 | + |
| 37 | +devtools::install_github("tidyomics/omicslog") |
| 38 | + |
| 39 | +library(SummarizedExperiment) |
| 40 | +library(tidySummarizedExperiment) |
| 41 | +library(omicslog) |
| 42 | +``` |
| 43 | + |
| 44 | +For this example, we worked with the `airway` dataset. To extend `tidyomics` with `omicslog`, it is only necessary to add the `log_start()` function before applying the different filtering criteria: |
| 45 | + |
| 46 | +```r |
| 47 | +data(airway, package = "airway") |
| 48 | + |
| 49 | +result <- |
| 50 | + airway |> |
| 51 | + log_start() |> # Starting the logging operations |
| 52 | + filter(dex == "trt") |> |
| 53 | + select(!albut) |> |
| 54 | + mutate(dex_upper = toupper(dex)) |> |
| 55 | + extract(col = dex, into = "treat") |> |
| 56 | + mutate(Run = tolower(Run)) |> |
| 57 | + filter(.feature == "ENSG00000000003") |> |
| 58 | + slice(3) |
| 59 | + |
| 60 | +result |
| 61 | +``` |
| 62 | + |
| 63 | +:::{.smaller} |
| 64 | +```r |
| 65 | +#> # A SummarizedExperiment-tibble abstraction: 1 × 1 |
| 66 | +#> # Features=1 | Samples=1 | Assays=counts |
| 67 | +#> .feature .sample counts SampleName cell treat Run avgLength Experiment |
| 68 | +#> <chr> <chr> <int> <fct> <fct> <chr> <chr> <int> <fct> |
| 69 | +#> 1 ENSG00000000… SRR103… 1047 GSM1275871 N080… trt srr1… 126 SRX384354 |
| 70 | +#> # ℹ 3 more variables: Sample <fct>, BioSample <fct>, dex_upper <chr> |
| 71 | +#> |
| 72 | +#> Operation log: |
| 73 | +#> [2025-12-17 13:21:30] filter: removed 4 samples (50%), 4 samples remaining |
| 74 | +#> [2025-12-17 13:21:31] select: removed 1 (11%), 8 column(s) remaining |
| 75 | +#> [2025-12-17 13:21:31] mutate: added 1 new column(s): dex_upper |
| 76 | +#> [2025-12-17 13:21:31] extract: extracted 'dex' into column: treat (original removed) |
| 77 | +#> [2025-12-17 13:21:31] mutate: modified column(s): Run |
| 78 | +#> [2025-12-17 13:21:31] filter: removed 64101 genes (100%), 1 genes remaining |
| 79 | +#> [2025-12-17 13:21:31] slice: Kept 1/4 rows (25.0%); removed 3 rows |
| 80 | +``` |
| 81 | +::: |
| 82 | + |
| 83 | +As a result, the `metadata` shows a short description of the different modifications the `SummarizedExperiment` underwent during each function in the pipeline, formatted as `[TIME] FUNCTION NAME: ONE-LINE DESCRIPTION`. |
| 84 | + |
| 85 | +Notwithstanding, `omicslog` can also work with base R commands by simply adding the dataset name to the `log_start()` function: |
| 86 | + |
| 87 | +```r |
| 88 | +options(restore_SummarizedExperiment_show = TRUE) |
| 89 | + |
| 90 | +result_base <- log_start(airway) # Starting the logging operations |
| 91 | + |
| 92 | +result_base <- result_base[, colData(result_base)$dex == "trt"] |
| 93 | +colData(result_base)$dex_upper <- toupper(colData(result_base)$dex) |
| 94 | +colData(result_base)$Run <- tolower(colData(result_base)$Run) |
| 95 | +result_base <- result_base[rownames(result_base) == "ENSG00000000003", ] |
| 96 | + |
| 97 | +result_base |
| 98 | +``` |
| 99 | + |
| 100 | +:::{.smaller} |
| 101 | +```r |
| 102 | +#> class: SummarizedExperimentLogged |
| 103 | +#> dim: 1 4 |
| 104 | +#> metadata(1): '' |
| 105 | +#> assays(1): counts |
| 106 | +#> rownames(1): ENSG00000000003 |
| 107 | +#> rowData names(0): |
| 108 | +#> colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521 |
| 109 | +#> colData names(10): SampleName cell ... BioSample dex_upper |
| 110 | +#> |
| 111 | +#> Operation log: |
| 112 | +#> [2025-12-17 13:22:58] subset: removed 4 samples (50%), 4 samples remaining |
| 113 | +#> [2025-12-17 13:22:58] colData<-: added 1 new column(s): dex_upper |
| 114 | +#> [2025-12-17 13:22:58] colData<-: modified column 'Run' |
| 115 | +#> [2025-12-17 13:22:58] subset: removed 64101 genes (100%), 1 genes remaining |
| 116 | +``` |
| 117 | +::: |
| 118 | + |
| 119 | +# Behind the scenes |
| 120 | + |
| 121 | +How does `omicslog` operate? In essence, for every function you apply to a `SummarizedExperiment` object, it tracks changes in rows and columns and records a message describing those changes in a dedicated logging structure stored in the object’s `metadata`. |
| 122 | + |
| 123 | +Let us suppose we want to filter the `airway` dataset to retain only samples treated with dexamethasone (`dex == "trt"`): |
| 124 | + |
| 125 | +```r |
| 126 | +result1 <- airway |> filter(dex == "untrt") |
| 127 | +``` |
| 128 | + |
| 129 | +How many samples did we keep? Let us find out: |
| 130 | + |
| 131 | +```r |
| 132 | +remaining_samples <- length(colData(result1)$Sample) |
| 133 | +remaining_samples |
| 134 | +``` |
| 135 | + |
| 136 | +:::{.smaller} |
| 137 | +```r |
| 138 | +#> [1] 4 |
| 139 | +``` |
| 140 | +::: |
| 141 | + |
| 142 | +What about the removed data? How many samples were discarded? |
| 143 | + |
| 144 | +```r |
| 145 | +samples_removed <- length(colData(airway)$Sample) - length(colData(result1)$Sample) |
| 146 | +samples_removed |
| 147 | +``` |
| 148 | + |
| 149 | +:::{.smaller} |
| 150 | +```r |
| 151 | +#> [1] 4 |
| 152 | +``` |
| 153 | +::: |
| 154 | + |
| 155 | +It is often useful to express this change as a percentage, since we may be discarding a substantial amount of information: |
| 156 | + |
| 157 | +```r |
| 158 | +percentage <- round(100 - samples_removed / length(colData(airway)$Sample) * 100,2) |
| 159 | +percentage |
| 160 | +``` |
| 161 | + |
| 162 | +:::{.smaller} |
| 163 | +```r |
| 164 | +#> [1] 50 |
| 165 | +``` |
| 166 | +::: |
| 167 | + |
| 168 | +At this point, we have a clear idea of how much the dataset has been modified. However, in practice, we often need to retrieve this kind of information repeatedly. To avoid manual bookkeeping, we would like to store it directly in the object itself, using the `metadata` slot. |
| 169 | + |
| 170 | +Before doing so, we need some additional context, such as *when* the operation was executed and *which* function was used: |
| 171 | + |
| 172 | +```r |
| 173 | +time <- Sys.time() |
| 174 | +func <- "filter" |
| 175 | +``` |
| 176 | + |
| 177 | +The most straightforward way to persist this information is to create a concise log message: |
| 178 | + |
| 179 | +```r |
| 180 | +result1@metadata$log_history <- paste(time, func,": removed", samples_removed, "samples", "(", percentage,"%)", remaining_samples, "samples remaining") |
| 181 | +result1@metadata$log_history |
| 182 | +``` |
| 183 | + |
| 184 | +:::{.smaller} |
| 185 | +```r |
| 186 | +#> [1] "2025-12-17 13:27:34 filter : removed 4 samples ( 50 %) 4 samples remaining" |
| 187 | +``` |
| 188 | +::: |
| 189 | + |
| 190 | +Column-related operations follow the same logic. For example, let us remove the `albut` column, as we are not interested in samples treated with albuterol: |
| 191 | + |
| 192 | +```r |
| 193 | +result2 <- result1 |> |
| 194 | + select(!albut) |
| 195 | +``` |
| 196 | + |
| 197 | +Even though we know that exactly one column was removed, it is still valuable to keep track of *how* the dataset was modified, *when* the change occurred, and *which* function was responsible: |
| 198 | + |
| 199 | +```r |
| 200 | +columns_removed <- ncol(colData(result1)) - ncol(colData(result2)) |
| 201 | +columns_remaining <- ncol(colData(result2)) |
| 202 | +percentage <- 100 - round(ncol(colData(result2)) / ncol(colData(result1)) * 100,2) |
| 203 | +time <- Sys.time() |
| 204 | +func <- "select" |
| 205 | + |
| 206 | +result1@metadata$log_history <- c(result1@metadata$log_history, |
| 207 | +paste(time, func,": removed", columns_removed, "(", percentage,"%)", columns_remaining, "column(s) remaining") |
| 208 | +) |
| 209 | +result1@metadata$log_history |
| 210 | +``` |
| 211 | + |
| 212 | +:::{.smaller} |
| 213 | +```r |
| 214 | +#> [1] "2025-12-17 13:27:34 filter : removed 4 samples ( 50 %) 4 samples remaining" |
| 215 | +#> [2] "2025-12-17 13:28:38 select : removed 1 ( 11.11 %) 8 column(s) remaining" |
| 216 | +``` |
| 217 | +::: |
| 218 | + |
| 219 | +As shown above, we extract the same type of information and append a new log entry to the `metadata` slot, just as we did for the row-based operation. |
| 220 | + |
| 221 | +Too much work for a single data transformation? We agree. This is exactly where `omicslog` comes in—handling all logging operations automatically, so you can focus on the analysis. |
| 222 | + |
| 223 | +# We need your feedback! |
| 224 | + |
| 225 | +Besides the messages shown above, what other operation details might you be interested in logging for an omics-oriented project? |
| 226 | + |
| 227 | +Don’t hesitate to open an issue in the [omicslog](https://github.com/tidyomics/omicslog "logging capabilities for SummarizedExperiment objects") GitHub repo. |
0 commit comments