-
Notifications
You must be signed in to change notification settings - Fork 9
06. Assembly
The tools within this workflow perform metagenome assemblies with the de novo assemblers metaSPAdes in SPAdes version 3.11.1, as well as MEGAHIT version 1.1.2, on trimmed Illumina paired-end reads. QUAST version 4.5 is used to evaluate the assemblies, and MultiQC version 1.4 provides aggregated visualizations for the QUAST reports of each assembler. This workflow has been tested to run offline in an air-gapped system following the execution of the Read Filtering Workflow.

If you have not already, you will need to activate your metag environment and perform the Offline Setup for the assembly workflow:
[user@localhost ~]$ source activate metag
(metag)[user@localhost ~]$ cd metagenomics/workflows
(metag)[user@localhost workflows]$ python download_offline_files.py --workflow assembly In the metagenomics/container_images/ directory, you should see the following Singularity images that were created when running the assembly or all flag during the Offline Setup:
| File Name | File Size |
|---|---|
spades-3.11.1--py36_zlib1.2.8_0.simg |
52 MB |
megahit-1.1.2--py35_0.simg |
49 MB |
quast-4.5--boost1.61_1.simg |
657 MB |
multiqc-1.4--py35_0.simg |
474 MB |
If you are missing any of these files, you should re-run the appropriate setup command, as per instructions in the Offline Setup.
The assembly workflow uses the Illumina paired-end filtered reads (outputs from the Read Filtering Workflow) as its inputs. These files should be located in the metagenomics/workflows/data directory:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
381 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
374 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
327 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
313 MB |
If these files look good to go, then you may proceed to run the rest of the workflow offline.
Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.
After the config file is ready, be sure to specify the Singularity bindpath from the metagenomics/workflows directory before running the assembly workflow.
export SINGULARITY_BINDPATH="data:/tmp" You can then execute of the workflows through snakemake using the following command:
snakemake --use-singularity {rules} {other options}The following rules are available for execution in the assembly workflow. These rules and their parameters are also listed under "workflows" in your config/default_workflowconfig.settings config file:
| Rule | Description |
|---|---|
assembly_metaspades_workflow |
metaSPAdes assembles filtered reads |
assembly_megahit_workflow |
MEGAHIT assembles filtered reads |
assembly_all_workflow |
metaSPAdes and MEGAHIT both independently assemble filtered reads |
assembly_quast_workflow |
QUAST evaluates the assemblies |
assembly_multiqc_workflow |
MultiQC aggregates all QUAST reports into a single report |
These rules can be run independently, or run together by listing them back to back in the command as such:
snakemake --use-singularity assembly_all_workflow assembly_quast_workflow assembly_multiqc_workflowThe following command will run only the metSPAdes assembler:
snakemake --use-singularity assembly_metaspades_workflowThe following command will run only the MEGAHIT assembler:
snakemake --use-singularity assembly_megahit_workflowBoth assemblers can be run in tandem with the assembly_all_workflow rule:
snakemake --use-singularity assembly_all_workflow To evaluate the assemblies, QUAST can be run with the assembly_quast_workflow rule:
snakemake --use-singularity assembly_quast_workflowThe assembly_multiqc_workflow rule concatenates all of the QUAST reports into a single report with MultiQC. This rule can also be used independently to execute the entire Assembly workflow:
snakemake --use-singularity assembly_multiqc_workflowAdditional options for snakemake can be found in the snakemake documentation.
To change or specify your own parameters for this or any of the workflows prior to execution, see Workflow Architecture for more information.
After successful execution of the assembly workflow, you will find all of your outputs in the metagenomics/workflows/data/ directory. You should expect to see the following files for each pair of trimmed reads:
| Tool Output | File Name | Description |
|---|---|---|
| metaSPAdes | {sample}_1_reads_trim{quality threshold}.metaspades.contigs.fa |
The final metaSPAdes assembled contigs, which is the output file used by downstream analysis tools |
| metaSPAdes | {sample}_1_reads_trim{quality threshold}.metaspades/ |
Directory with additional outputs from the metaSPAdes assembler |
| MEGAHIT | {sample}_1_reads_trim{quality threshold}.megahit.contigs.fa |
The final MEGAHIT assembled contigs, which is the output file used by downstream analysis tools |
| MEGAHIT | {sample}_1_reads_trim{quality threshold}.megahit/ |
Directory with additional outputs from the MEGAHIT assembler |
| QUAST | {sample}_1_reads_trim{quality threshold}.{assembler}_quast/ |
Directory with QUAST outputs for MEGAHIT or metaSPAdes |
| QUAST | {sample}_1_reads_trim{quality threshold}.{assembler}_quast/report.html |
QUAST HTML report for MEGAHIT or metaSPAdes |
| MultiQC | {sample}_1_reads.{assembler}_multiqc_report.html |
MultiQC HTML report, including multiple QUAST reports |
| MultiQC | {sample}_1_reads.{assembler}_multiqc_report_data/ |
MultiQC directory with additional QUAST data and statistics |
The above files are the major outputs of the assembly workflow, and the *contigs.fa files are used as inputs into the Comparison and/or Functional Inference workflow pages.
To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools.
The metaSPAdes assembly of reads filtered with a quality score threshold of 2 is equivalent to running this command:
metaspades.py -1 SRR606249_subset10_trim2_1.fq.gz -2 SRR606249_subset10_trim2_2.fq.gz -o SRR606249_subset10_trim2.metaspadesThe metaSPAdes assembly of reads filtered with a quality score threshold of 30 is equivalent to running this command:
metaspades.py -1 SRR606249_subset10_trim30_1.fq.gz -2 SRR606249_subset10_trim30_2.fq.gz -o SRR606249_subset10_trim30.metaspadesThe QUAST evaluations of the metaSPAdes assemblies is equivalent to running these commands:
quast.py SRR606249_subset10_trim2.metaspades.contigs.fa -o SRR606249_subset10_trim2.metaspades_quast
quast.py SRR606249_subset10_trim30.metaspades.contigs.fa -o SRR606249_subset10_trim30.metaspades_quastThe MultiQC aggregation of the metaSPAdes QUAST reports is equivalent to running this command:
multiqc SRR606249_subset10_trim2.metaspades_quast/report.tsv SRR606249_subset10_trim30.metaspades_quast/report.tsv -n SRR606249_subset10.metaspades_multiqc_report -o SRR606249_subset10.metaspades_multiqc_reportThe MEGAHIT assembly of reads filtered with a quality score threshold of 2 is equivalent to running this command:
megahit -1 SRR606249_subset10_trim2_1.fq.gz -2 SRR606249_subset10_trim2_2.fq.gz -o SRR606249_subset10_trim2.megahitThe MEGAHIT assembly of reads filtered with a quality score threshold of 30 is equivalent to running this command:
megahit -1 SRR606249_subset10_trim30_1.fq.gz -2 SRR606249_subset10_trim30_2.fq.gz -o SRR606249_subset10_trim30.megahitThe QUAST evaluations of the MEGAHIT assemblies is equivalent to running these commands:
quast.py SRR606249_subset10_trim2.megahit.contigs.fa -o SRR606249_subset10_trim2.megahit_quast
quast.py SRR606249_subset10_trim30.megahit.contigs.fa -o SRR606249_subset10_trim30.megahit_quastThe MultiQC aggregation of the MEGAHIT QUAST reports is equivalent to running this command:
multiqc SRR606249_subset10_trim2.megahit_quast/report.tsv SRR606249_subset10_trim30.megahit_quast/report.tsv -n SRR606249_subset10.megahit_multiqc_report -o SRR606249_subset10.megahit_multiqc_reportBelow is a more detailed description of the output files expected in the metagenomics/workflows/data/ directory after the assembly workflow has been successfully run.
Using the filtered reads generated by the Read Filtering Workflow:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
381 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
374 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
327 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
313 MB |
The following files are produced by metaSPAdes after assembling the filtered reads with the assembly_metaspades_workflow or assembly_all_workflow rule*:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa |
158 MB |
SRR606249_subset10_1_reads_trim2.metaspades/ |
4 KB |
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa |
146 MB |
SRR606249_subset10_1_reads_trim30.metaspades/ |
4 KB |
The following files are produced by MEGAHIT after assembling the filtered reads with the assembly_megahit_workflow or assembly_all_workflow rule*:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa |
132 MB |
SRR606249_subset10_1_reads_trim2.megahit/ |
4 KB |
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa |
119 MB |
SRR606249_subset10_1_reads_trim30.megahit/ |
4 KB |
*Additional files generated by the metaSPAdes and MEGAHIT assemblers are saved in the sub-directories listed above.
The following files are produced by QUAST after evaluating the assemblies with the assembly_quast_workflow rule:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2.metaspades_quast/ |
4 KB |
SRR606249_subset10_1_reads_trim2.metaspades_quast/report.html |
685 KB |
SRR606249_subset10_1_reads_trim30.metaspades_quast/ |
4 KB |
SRR606249_subset10_1_reads_trim30.metaspades_quast/report.html |
670 KB |
SRR606249_subset10_1_reads_trim2.megahit_quast/ |
4 KB |
SRR606249_subset10_1_reads_trim2.megahit_quast/report.html |
687 KB |
SRR606249_subset10_1_reads_trim30.megahit_quast/ |
4 KB |
SRR606249_subset10_1_reads_trim30.megahit_quast/report.html |
668 KB |
The following files are produced by MultiQC after aggregating the QUAST reports with the assembly_multiqc_workflow rule:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads.megahit_multiqc_report_data/ |
4 KB |
SRR606249_subset10_1_reads.megahit_multiqc_report.html |
1 MB |
SRR606249_subset10_1_reads.metaspades_multiqc_report_data/ |
4 KB |
SRR606249_subset10_1_reads.metaspades_multiqc_report.html |
1 MB |
The tables below summarize statistics from the QUAST evaluations of SRR606249_subset10_1_reads assemblies in the final MultiQC report.
| Sample Name | N50 (Kbp) | N75 (Kbp) | L50 (K) | L75 (K) | Largest contig (Kbp) | Length (Mbp) |
|---|---|---|---|---|---|---|
SRR606249_subset10_1_reads_trim2.metaspades.contigs |
2.6 |
1.0 |
6.8 |
25,687.0 |
221.1 |
114.6 |
SRR606249_subset10_1_reads_trim30.metaspades.contigs |
2.3 |
1.0 |
7.6 |
25,813.0 |
220.9 |
103.1 |
SRR606249_subset10_1_reads_trim2.megahit.contigs |
2.9 |
1.1 |
6.1 |
22,984.0 |
264.4 |
109.6 |
SRR606249_subset10_1_reads_trim30.megahit.contigs |
2.4 |
1.0 |
7.0 |
23,567.0 |
212.1 |
97.2 |
The statistics from the MEGAHIT and metaSPAdes assemblies for this sample are similar, although this does not assess potential differences in the taxonomic or functional content of the assembled contigs.