-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/refactor_DIMS_MakeInit #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
e1e144e
f8fbc49
c4d4cf1
99860c5
6dd934e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| # define parameters | ||
| args <- commandArgs(trailingOnly = TRUE) | ||
|
|
||
| sample_sheet <- as.data.frame(read.csv(args[1], sep = "\t")) | ||
| preprocessing_scripts_dir <- args[2] | ||
|
|
||
| # load in function script | ||
| source(paste0(preprocessing_scripts_dir, "parse_samplesheet_functions.R")) | ||
|
|
||
| # generate the replication pattern | ||
| repl_pattern <- generate_repl_pattern(sample_sheet) | ||
|
|
||
| # write the replication pattern to text file for troubleshooting purposes | ||
| sink("replication_pattern.txt") | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add something to the output name to more clearly indicate that this is a logging output, and not actual 'data'. e.g., |
||
| print(repl_pattern) | ||
| sink() | ||
|
|
||
| # save replication pattern to file | ||
| save(repl_pattern, file = "init.RData") | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,18 @@ | ||||||
| process ParseSamplesheet { | ||||||
| tag "DIMS ParseSamplesheet" | ||||||
| label 'ParseSamplesheet' | ||||||
| container = 'docker://umcugenbioinf/dims:1.3' | ||||||
| shell = ['/bin/bash', '-euo', 'pipefail'] | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Remove those here, shell directives are/should be declared in the main nextflow.config |
||||||
|
|
||||||
| input: | ||||||
| path(samplesheet) | ||||||
|
|
||||||
| output: | ||||||
| path('init.RData') | ||||||
| path('replication_pattern.txt') | ||||||
|
Comment on lines
+11
to
+12
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. outputs are missing emit statements |
||||||
|
|
||||||
| script: | ||||||
| """ | ||||||
| Rscript ${baseDir}/CustomModules/DIMS/ParseSamplesheet.R $samplesheet $params.preprocessing_scripts_dir | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using $params inside a module/process is not nextflow best practice. The best option is to add it as an input. So input becomes: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is the Rscript referred from the baseDir? For stability sake, it it would make more sense to use a resource folder and put it in there. For example in the PRS repo: https://github.com/UMCUGenetics/DxNextflowPRS/blob/develop/modules/local/SNPlist/main.nf (this is python but it works in the same way) |
||||||
| """ | ||||||
| } | ||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,29 @@ | ||||||||||||||||||||||
| # function for parse_samplesheet | ||||||||||||||||||||||
| generate_repl_pattern <- function(sample_sheet) { | ||||||||||||||||||||||
| #' Generate replication pattern list based on information in sample_sheet | ||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the function name should be |
||||||||||||||||||||||
| #' | ||||||||||||||||||||||
| #' @param sample_names: vector of sample names (vector of strings) | ||||||||||||||||||||||
| #' @param sample_sheet: matrix of file names and sample names | ||||||||||||||||||||||
| #' | ||||||||||||||||||||||
| #' @return ints_sorted: list of sample names with corresponding file names (technical replicates) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't these |
||||||||||||||||||||||
| # get the right columns from the samplesheet | ||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In-line comments are more usefull if they explain a choice, rather than stating what the line(s) below it does.
does not make it clearer, but if you explain the rationale then it does become clearer :) |
||||||||||||||||||||||
| file_name_col <- grep("File_Name|File Name", colnames(sample_sheet)) | ||||||||||||||||||||||
| sample_name_col <- grep("Sample_Name|Sample Name", colnames(sample_sheet)) | ||||||||||||||||||||||
| # get the unique sample names from the samplesheet | ||||||||||||||||||||||
| sample_names <- sort(unique(trimws(as.vector(unlist(sample_sheet[sample_name_col]))))) | ||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you consider using piping to make this more readable?
Suggested change
|
||||||||||||||||||||||
| # remove all characters from sample_names which are not letters, numbers, hyphens and periods | ||||||||||||||||||||||
| sample_names <- gsub("[^-.[:alnum:]]", "_", sample_names) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # create replication pattern (which technical replicates belong to which sample) | ||||||||||||||||||||||
| repl_pattern <- c() | ||||||||||||||||||||||
| for (sample_group in sample_names) { | ||||||||||||||||||||||
| file_indices <- which(sample_sheet[, sample_name_col] == sample_group) | ||||||||||||||||||||||
| file_names <- sample_sheet[file_indices, file_name_col] | ||||||||||||||||||||||
| repl_pattern <- c(repl_pattern, list(file_names)) | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
|
Comment on lines
+19
to
+24
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Of misschien?
Suggested change
|
||||||||||||||||||||||
| names(repl_pattern) <- sample_names | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| return(repl_pattern) | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,24 @@ | ||||||||||
| # unit tests for ParseSamplesheet | ||||||||||
| # function: generate_repl_pattern | ||||||||||
|
|
||||||||||
| # source all functions for PeakGrouping | ||||||||||
| source("../../preprocessing/parse_samplesheet_functions.R") | ||||||||||
|
Comment on lines
+4
to
+5
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This path seems wrong. Shouldn't it be one level down?
Suggested change
|
||||||||||
|
|
||||||||||
| # test generate_repl_pattern | ||||||||||
| testthat::test_that("replication pattern is correctly generated", { | ||||||||||
| # create sample sheet tot test on: | ||||||||||
| test_file_names <- paste0(rep("RES_20260101_", 6), sprintf("%03d", 1:6)) | ||||||||||
| test_sample_names <- sort(rep(c("C1", "P2", "P3"), 2)) | ||||||||||
| test_sample_sheet <- as.data.frame(cbind(File_Name = test_file_names, Sample_Name = test_sample_names)) | ||||||||||
|
|
||||||||||
| # test that a list of length 3 is generated | ||||||||||
| expect_length(generate_repl_pattern(test_sample_sheet), 3) | ||||||||||
| # test list names | ||||||||||
| expect_equal(names(generate_repl_pattern(test_sample_sheet)), unique(test_sample_names), TRUE) | ||||||||||
|
|
||||||||||
| # test what happens if any sample name is used twice | ||||||||||
| test_sample_names <- gsub("P3", "P2", test_sample_names) | ||||||||||
| test_sample_sheet <- as.data.frame(cbind(File_Name = test_file_names, Sample_Name = test_sample_names)) | ||||||||||
| expect_length(generate_repl_pattern(test_sample_sheet), 2) | ||||||||||
| expect_length(generate_repl_pattern(test_sample_sheet)$P2, 4) | ||||||||||
| }) | ||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes much more sense to put the argument parsing and the functions in the same script