Skip to content

06. Assembly

Krista Ternus edited this page May 31, 2019 · 8 revisions

Assembly

Table of Contents

Workflow Overview

The tools within this workflow perform metagenome assemblies with the de novo assemblers metaSPAdes in SPAdes version 3.11.1, as well as MEGAHIT version 1.1.2, on trimmed Illumina paired-end reads. QUAST version 4.5 is used to evaluate the assemblies, and MultiQC version 1.4 provides aggregated visualizations for the QUAST reports of each assembler. This workflow has been tested to run offline in an air-gapped system following the execution of the Read Filtering Workflow.

Required Files

If you have not already, you will need to activate your metag environment and perform the Offline Setup for the assembly workflow:

[user@localhost ~]$ source activate metag 

(metag)[user@localhost ~]$ cd metagenomics/workflows

(metag)[user@localhost workflows]$ python download_offline_files.py --workflow assembly  

Singularity Images

In the metagenomics/container_images/ directory, you should see the following Singularity images that were created when running the assembly or all flag during the Offline Setup:

File Name File Size
spades-3.11.1--py36_zlib1.2.8_0.simg 52 MB
megahit-1.1.2--py35_0.simg 49 MB
quast-4.5--boost1.61_1.simg 657 MB
multiqc-1.4--py35_0.simg 474 MB

If you are missing any of these files, you should re-run the appropriate setup command, as per instructions in the Offline Setup.

Input Files

The assembly workflow uses the Illumina paired-end filtered reads (outputs from the Read Filtering Workflow) as its inputs. These files should be located in the metagenomics/workflows/data directory:

File Name File Size
SRR606249_subset10_1_reads_trim2_1.fq.gz 381 MB
SRR606249_subset10_1_reads_trim2_2.fq.gz 374 MB
SRR606249_subset10_1_reads_trim30_1.fq.gz 327 MB
SRR606249_subset10_1_reads_trim30_2.fq.gz 313 MB

If these files look good to go, then you may proceed to run the rest of the workflow offline.

Workflow Execution

Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.

After the config file is ready, be sure to specify the Singularity bindpath from the metagenomics/workflows directory before running the assembly workflow.

export SINGULARITY_BINDPATH="data:/tmp"  

You can then execute of the workflows through snakemake using the following command:

snakemake --use-singularity {rules} {other options}

The following rules are available for execution in the assembly workflow. These rules and their parameters are also listed under "workflows" in your config/default_workflowconfig.settings config file:

Rule Description
assembly_metaspades_workflow metaSPAdes assembles filtered reads
assembly_megahit_workflow MEGAHIT assembles filtered reads
assembly_all_workflow metaSPAdes and MEGAHIT both independently assemble filtered reads
assembly_quast_workflow QUAST evaluates the assemblies
assembly_multiqc_workflow MultiQC aggregates all QUAST reports into a single report

These rules can be run independently, or run together by listing them back to back in the command as such:

snakemake --use-singularity assembly_all_workflow assembly_quast_workflow assembly_multiqc_workflow

The following command will run only the metSPAdes assembler:

snakemake --use-singularity assembly_metaspades_workflow

The following command will run only the MEGAHIT assembler:

snakemake --use-singularity assembly_megahit_workflow

Both assemblers can be run in tandem with the assembly_all_workflow rule:

snakemake --use-singularity assembly_all_workflow 

To evaluate the assemblies, QUAST can be run with the assembly_quast_workflow rule:

snakemake --use-singularity assembly_quast_workflow

The assembly_multiqc_workflow rule concatenates all of the QUAST reports into a single report with MultiQC. This rule can also be used independently to execute the entire Assembly workflow:

snakemake --use-singularity assembly_multiqc_workflow

Additional options for snakemake can be found in the snakemake documentation.

To change or specify your own parameters for this or any of the workflows prior to execution, see Workflow Architecture for more information.

Output

After successful execution of the assembly workflow, you will find all of your outputs in the metagenomics/workflows/data/ directory. You should expect to see the following files for each pair of trimmed reads:

Tool Output File Name Description
metaSPAdes {sample}_1_reads_trim{quality threshold}.metaspades.contigs.fa The final metaSPAdes assembled contigs, which is the output file used by downstream analysis tools
metaSPAdes {sample}_1_reads_trim{quality threshold}.metaspades/ Directory with additional outputs from the metaSPAdes assembler
MEGAHIT {sample}_1_reads_trim{quality threshold}.megahit.contigs.fa The final MEGAHIT assembled contigs, which is the output file used by downstream analysis tools
MEGAHIT {sample}_1_reads_trim{quality threshold}.megahit/ Directory with additional outputs from the MEGAHIT assembler
QUAST {sample}_1_reads_trim{quality threshold}.{assembler}_quast/ Directory with QUAST outputs for MEGAHIT or metaSPAdes
QUAST {sample}_1_reads_trim{quality threshold}.{assembler}_quast/report.html QUAST HTML report for MEGAHIT or metaSPAdes
MultiQC {sample}_1_reads.{assembler}_multiqc_report.html MultiQC HTML report, including multiple QUAST reports
MultiQC {sample}_1_reads.{assembler}_multiqc_report_data/ MultiQC directory with additional QUAST data and statistics

The above files are the major outputs of the assembly workflow, and the *contigs.fa files are used as inputs into the Comparison and/or Functional Inference workflow pages.

Additional Information

Command Line Equivalents

To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools.

The metaSPAdes assembly of reads filtered with a quality score threshold of 2 is equivalent to running this command:

metaspades.py -1 SRR606249_subset10_trim2_1.fq.gz -2 SRR606249_subset10_trim2_2.fq.gz -o SRR606249_subset10_trim2.metaspades

The metaSPAdes assembly of reads filtered with a quality score threshold of 30 is equivalent to running this command:

metaspades.py -1 SRR606249_subset10_trim30_1.fq.gz -2 SRR606249_subset10_trim30_2.fq.gz -o SRR606249_subset10_trim30.metaspades

The QUAST evaluations of the metaSPAdes assemblies is equivalent to running these commands:

quast.py SRR606249_subset10_trim2.metaspades.contigs.fa -o SRR606249_subset10_trim2.metaspades_quast
quast.py SRR606249_subset10_trim30.metaspades.contigs.fa -o SRR606249_subset10_trim30.metaspades_quast

The MultiQC aggregation of the metaSPAdes QUAST reports is equivalent to running this command:

multiqc SRR606249_subset10_trim2.metaspades_quast/report.tsv SRR606249_subset10_trim30.metaspades_quast/report.tsv -n SRR606249_subset10.metaspades_multiqc_report -o SRR606249_subset10.metaspades_multiqc_report

The MEGAHIT assembly of reads filtered with a quality score threshold of 2 is equivalent to running this command:

megahit -1 SRR606249_subset10_trim2_1.fq.gz -2 SRR606249_subset10_trim2_2.fq.gz -o SRR606249_subset10_trim2.megahit

The MEGAHIT assembly of reads filtered with a quality score threshold of 30 is equivalent to running this command:

megahit -1 SRR606249_subset10_trim30_1.fq.gz -2 SRR606249_subset10_trim30_2.fq.gz -o SRR606249_subset10_trim30.megahit

The QUAST evaluations of the MEGAHIT assemblies is equivalent to running these commands:

quast.py SRR606249_subset10_trim2.megahit.contigs.fa -o SRR606249_subset10_trim2.megahit_quast  
quast.py SRR606249_subset10_trim30.megahit.contigs.fa -o SRR606249_subset10_trim30.megahit_quast

The MultiQC aggregation of the MEGAHIT QUAST reports is equivalent to running this command:

multiqc SRR606249_subset10_trim2.megahit_quast/report.tsv SRR606249_subset10_trim30.megahit_quast/report.tsv -n SRR606249_subset10.megahit_multiqc_report -o SRR606249_subset10.megahit_multiqc_report

Expected Output Files for the Example Dataset

Below is a more detailed description of the output files expected in the metagenomics/workflows/data/ directory after the assembly workflow has been successfully run.

Using the filtered reads generated by the Read Filtering Workflow:

File Name File Size
SRR606249_subset10_1_reads_trim2_1.fq.gz 381 MB
SRR606249_subset10_1_reads_trim2_2.fq.gz 374 MB
SRR606249_subset10_1_reads_trim30_1.fq.gz 327 MB
SRR606249_subset10_1_reads_trim30_2.fq.gz 313 MB

The following files are produced by metaSPAdes after assembling the filtered reads with the assembly_metaspades_workflow or assembly_all_workflow rule*:

File Name File Size
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa 158 MB
SRR606249_subset10_1_reads_trim2.metaspades/ 4 KB
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa 146 MB
SRR606249_subset10_1_reads_trim30.metaspades/ 4 KB

The following files are produced by MEGAHIT after assembling the filtered reads with the assembly_megahit_workflow or assembly_all_workflow rule*:

File Name File Size
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa 132 MB
SRR606249_subset10_1_reads_trim2.megahit/ 4 KB
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa 119 MB
SRR606249_subset10_1_reads_trim30.megahit/ 4 KB

*Additional files generated by the metaSPAdes and MEGAHIT assemblers are saved in the sub-directories listed above.

The following files are produced by QUAST after evaluating the assemblies with the assembly_quast_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2.metaspades_quast/ 4 KB
SRR606249_subset10_1_reads_trim2.metaspades_quast/report.html 685 KB
SRR606249_subset10_1_reads_trim30.metaspades_quast/ 4 KB
SRR606249_subset10_1_reads_trim30.metaspades_quast/report.html 670 KB
SRR606249_subset10_1_reads_trim2.megahit_quast/ 4 KB
SRR606249_subset10_1_reads_trim2.megahit_quast/report.html 687 KB
SRR606249_subset10_1_reads_trim30.megahit_quast/ 4 KB
SRR606249_subset10_1_reads_trim30.megahit_quast/report.html 668 KB

The following files are produced by MultiQC after aggregating the QUAST reports with the assembly_multiqc_workflow rule:

File Name File Size
SRR606249_subset10_1_reads.megahit_multiqc_report_data/ 4 KB
SRR606249_subset10_1_reads.megahit_multiqc_report.html 1 MB
SRR606249_subset10_1_reads.metaspades_multiqc_report_data/ 4 KB
SRR606249_subset10_1_reads.metaspades_multiqc_report.html 1 MB

The tables below summarize statistics from the QUAST evaluations of SRR606249_subset10_1_reads assemblies in the final MultiQC report.

Sample Name N50 (Kbp) N75 (Kbp) L50 (K) L75 (K) Largest contig (Kbp) Length (Mbp)
SRR606249_subset10_1_reads_trim2.metaspades.contigs 2.6 1.0 6.8 25,687.0 221.1 114.6
SRR606249_subset10_1_reads_trim30.metaspades.contigs 2.3 1.0 7.6 25,813.0 220.9 103.1
SRR606249_subset10_1_reads_trim2.megahit.contigs 2.9 1.1 6.1 22,984.0 264.4 109.6
SRR606249_subset10_1_reads_trim30.megahit.contigs 2.4 1.0 7.0 23,567.0 212.1 97.2

The statistics from the MEGAHIT and metaSPAdes assemblies for this sample are similar, although this does not assess potential differences in the taxonomic or functional content of the assembled contigs.

Clone this wiki locally