-
Notifications
You must be signed in to change notification settings - Fork 1
Pipeline Overview & Command Line Options
The piepline calls variants on Illumina paired end(PE) / single end (SE) reads provided in a directory and generate a phylogenetic tree from recombinant filtered high quality variants found against the reference genome.
The pipeline is divided into two individual steps (-steps option) which should be run in sequential order (see below for command line arguments).
1. Variant Calling: This step will run all standard variant calling steps on sample reads residing in the input reads directory.
The possible options are:
Option All: This will run all variant calling steps from read trimming to variant calling.
Option clean,align,post-align,varcall,filter,stats: Use these options, if you want to run individual steps involved in variant calling steps. This will run variant calling steps starting from read cleaning/trimming, alignment to reference genome, post-alignment sam/bam format conversion, variant calling/filtering and finally generates mapping/variant statistics.
You can run a part of the pipeline by customizing the order of the -steps argument. For example, to skip the Trimmomatic cleaning part and instead run the pipeline from alignment step, run it with the following options:
"align,post-align,varcall,filter,stats"
Note: The order of variant calling steps needs to be sequential. If skipping any of the steps, make sure the previous steps finished without any errors and their results are already present in output folder.

2. Generate various Core/Non-core SNP consensus, SNP/Indel Matrices and recombination filtering with Gubbins:
Option core_All: This will run all the core SNP consensus/matrix generating steps. Once the core variants and matrices are generated, the command will run gubbins and RAxML/iqtree jobs.
Note: Samples not meeting minimum Depth of Coverage threshold (10X) will be removed before running this step.
Some of the types of results generated by this step are:
- prefix_core_results directory under the output directory which will be the final results folder.
- intermediate data files required for generating core SNP matrix/consensus in vcf/fasta format.
- Various data matrices will be generated during this step that can be used for diagnosing variant filter criterias and their impact on overall distribution of core variants.
- Gubbins recombination filtered consensus fasta files and RaxML/Iqtree trees generated from this recombination filtered consensus.

usage: snpkit.py [-h] -type TYPE -readsdir DIR -index INDEX [-steps STEPS]
-analysis PREFIX -outdir OUTPUT [-cluster CLUSTER]
[-scheduler SCHEDULER] [-config CONFIG] [-suffix SUFFIX]
[-filenames FILENAMES] [-clean]
[-extract_unmapped EXTRACT_UNMAPPED] [-gubbins GUBBINS]
[-outgroup OUTGROUP] [-downsample DOWNSAMPLE]
[-coverage_depth COVERAGE_DEPTH] [-genomesize GENOMESIZE]
[-dryrun] [-mask] [-clip]
SNPKIT - A workflow for Microbial Variant Calling, Recombination detection and Phylogenetic tree reconstruction.
optional arguments:
-h, --help show this help message and exit
INPUT:
-type TYPE Type of illumina reads. Options: SE for single-end, PE for paired-end
-readsdir DIR path to input sequencing reads data folder.
-index INDEX Reference genome index name (prefix) as described in -config file.
-steps STEPS Run this part of snpkit. Options: All, core_All
All: run first part of snpkit - trimming, mapping, calling variants;
core_All: run second part of snpkit - filter variants, generate core/non-core multi-fasta alignments, SNP/Indel Matrices.
-analysis PREFIX prefix for output files.
OUTPUT:
-outdir OUTPUT output directory path ending with output directory name.
RESOURCES:
-cluster CLUSTER run snpkit in local or cluster mode. Default: local
-scheduler SCHEDULER Type of HPC job scheduler. Supports PBS, SLURM
SUBSAMPLE:
-downsample DOWNSAMPLE
Subsample reads to a default depth of 100X or user specified -coverage_depth
-coverage_depth COVERAGE_DEPTH
Downsample reads to this depth
-genomesize GENOMESIZE
Genome size to calculate raw coverage
Optional:
-config CONFIG Path to YAML config file. Default: snpkit config - snpkit/config
-suffix SUFFIX Custom fastq reads suffix. Default: fastq.gz. supports *.fastq, *.fastq.gz, *.fq.gz, *.fq;
-filenames FILENAMES Run snpkit on a subset of files in -readsdir folder. The file should contain single-end filename per line.
-clean Delete all intermediate files. Default: yes Options: yes, no
-gubbins GUBBINS Run Gubbins. Options: yes,no. Default: no
-dryrun Perform a trial run without running any jobs.
-mask Mask Gubbins detected recombinant region and run Iqtree on masked alignment
-clip Filter SAM file for soft and hard clipped alignments. Default: OFF
Development Phase:
-extract_unmapped EXTRACT_UNMAPPED
Extract unmapped reads, assemble it and detect AMR genes using ariba
-outgroup OUTGROUP Outgroup sample name. Alpha testing version. Not recommended.