Skip to content

Pipeline Overview & Command Line Options

Kyle Gontjes edited this page Dec 30, 2024 · 1 revision

Steps

The piepline calls variants on Illumina paired end(PE) / single end (SE) reads provided in a directory and generate a phylogenetic tree from recombinant filtered high quality variants found against the reference genome.

The pipeline is divided into two individual steps (-steps option) which should be run in sequential order (see below for command line arguments).

1. Variant Calling: This step will run all standard variant calling steps on sample reads residing in the input reads directory.

The possible options are:

Option All: This will run all variant calling steps from read trimming to variant calling.

Option clean,align,post-align,varcall,filter,stats: Use these options, if you want to run individual steps involved in variant calling steps. This will run variant calling steps starting from read cleaning/trimming, alignment to reference genome, post-alignment sam/bam format conversion, variant calling/filtering and finally generates mapping/variant statistics.

You can run a part of the pipeline by customizing the order of the -steps argument. For example, to skip the Trimmomatic cleaning part and instead run the pipeline from alignment step, run it with the following options:

"align,post-align,varcall,filter,stats"

Note: The order of variant calling steps needs to be sequential. If skipping any of the steps, make sure the previous steps finished without any errors and their results are already present in output folder.

alt tag

2. Generate various Core/Non-core SNP consensus, SNP/Indel Matrices and recombination filtering with Gubbins:

Option core_All: This will run all the core SNP consensus/matrix generating steps. Once the core variants and matrices are generated, the command will run gubbins and RAxML/iqtree jobs.

Note: Samples not meeting minimum Depth of Coverage threshold (10X) will be removed before running this step.

Some of the types of results generated by this step are:

  • prefix_core_results directory under the output directory which will be the final results folder.
  • intermediate data files required for generating core SNP matrix/consensus in vcf/fasta format.
  • Various data matrices will be generated during this step that can be used for diagnosing variant filter criterias and their impact on overall distribution of core variants.
  • Gubbins recombination filtered consensus fasta files and RaxML/Iqtree trees generated from this recombination filtered consensus.

alt tag

Command line options


usage: snpkit.py [-h] -type TYPE -readsdir DIR -index INDEX [-steps STEPS]
                 -analysis PREFIX -outdir OUTPUT [-cluster CLUSTER]
                 [-scheduler SCHEDULER] [-config CONFIG] [-suffix SUFFIX]
                 [-filenames FILENAMES] [-clean]
                 [-extract_unmapped EXTRACT_UNMAPPED] [-gubbins GUBBINS]
                 [-outgroup OUTGROUP] [-downsample DOWNSAMPLE]
                 [-coverage_depth COVERAGE_DEPTH] [-genomesize GENOMESIZE]
                 [-dryrun] [-mask] [-clip]

SNPKIT - A workflow for Microbial Variant Calling, Recombination detection and Phylogenetic tree reconstruction.

optional arguments:
  -h, --help            show this help message and exit

INPUT:
  -type TYPE            Type of illumina reads. Options: SE for single-end, PE for paired-end
  -readsdir DIR         path to input sequencing reads data folder.
  -index INDEX          Reference genome index name (prefix) as described in -config file.
  -steps STEPS          Run this part of snpkit. Options: All, core_All
                        All: run first part of snpkit - trimming, mapping, calling variants;
                        core_All: run second part of snpkit - filter variants, generate core/non-core multi-fasta alignments, SNP/Indel Matrices.
  -analysis PREFIX      prefix for output files.

OUTPUT:
  -outdir OUTPUT        output directory path ending with output directory name.

RESOURCES:
  -cluster CLUSTER      run snpkit in local or cluster mode. Default: local
  -scheduler SCHEDULER  Type of HPC job scheduler. Supports PBS, SLURM

SUBSAMPLE:
  -downsample DOWNSAMPLE
                        Subsample reads to a default depth of 100X or user specified -coverage_depth
  -coverage_depth COVERAGE_DEPTH
                        Downsample reads to this depth
  -genomesize GENOMESIZE
                        Genome size to calculate raw coverage

Optional:
  -config CONFIG        Path to YAML config file. Default: snpkit config - snpkit/config
  -suffix SUFFIX        Custom fastq reads suffix. Default: fastq.gz. supports *.fastq, *.fastq.gz, *.fq.gz, *.fq; 
  -filenames FILENAMES  Run snpkit on a subset of files in -readsdir folder. The file should contain single-end filename per line.
  -clean                Delete all intermediate files. Default: yes Options: yes, no
  -gubbins GUBBINS      Run Gubbins. Options: yes,no. Default: no
  -dryrun               Perform a trial run without running any jobs.
  -mask                 Mask Gubbins detected recombinant region and run Iqtree on masked alignment
  -clip                 Filter SAM file for soft and hard clipped alignments. Default: OFF

Development Phase:
  -extract_unmapped EXTRACT_UNMAPPED
                        Extract unmapped reads, assemble it and detect AMR genes using ariba
  -outgroup OUTGROUP    Outgroup sample name. Alpha testing version. Not recommended.

Clone this wiki locally