-
Notifications
You must be signed in to change notification settings - Fork 9
07. Comparison
In this workflow, sourmash version 2.0.0a3 performs multi-sample metagenome comparisons. This is done by first generating MinHash signatures on the filtered reads (i.e., the Read Filtering Workflow output) or the assembled contigs (i.e., the Assembly Workflow output) with sourmash compute. The sourmash compare functionality then compares signatures from multiple datasets to each other and outputs their pairwise comparisons as Jaccard indexes in a final *.csv file. Currently, this workflow is designed to compare MinHash signatures of different quality trimmed values and assemblers for the same sample and output results in csv files and heatmap visualizations. Different samples can also be compared to each other, although confounding factors like sequencing depth can obscure the direct comparison of k-mer content in different samples.

This workflow has been tested to run offline following the execution of the Read Filtering and Assembly workflows. If you have not already, you will need to perform the Offline Setup for the comparison workflow (note: make sure you have activated you metag environment):
[user@localhost ~]$ source activate metag
(metag)[user@localhost ~]$ cd metagenomics/workflows
(metag)[user@localhost ~]$ python download_offline_files.py --workflow comparisonIn the metagenomics/container_images/ directory, you should see the following Singularity images that were created when running the comparison or all flag during the Offline Setup:
| File Name | File Size |
|---|---|
sourmash-2.0.0a3--py36_0.simg |
495 MB |
If you are missing this image, you should re-run the setup command, as per instructions in the Offline Setup.
The comparison workflow can use filtered reads (outputs from the Read Filtering Workflow) and/or assembled contigs (outputs from the Assembly Workflow) as its inputs. These input files should be located in the metagenomics/workflows/data directory:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
381 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
374 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
327 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
313 MB |
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa |
158 MB |
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa |
146 MB |
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa |
132 MB |
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa |
119 MB |
If these files look good to go, then you may proceed to run the rest of the workflow offline.
Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.
After the config file is ready, be sure to specify the Singularity bind path from the metagenomics/workflows directory before running the comparison workflow.
cd metagenomics/workflows
export SINGULARITY_BINDPATH="data:/tmp" Execution of the comparison workflow can then be performed with the following command:
snakemake --use-singularity {rules} {other options}The following rules are available for execution in the comparison workflow, and these rules are also listed with their parameters under "workflows" in the default config file -metagenomics/workflows/config/default_workflowconfig.settings:
| Rule | Description |
|---|---|
comparison_reads_workflow |
Sourmash computes the MinHash signatures of filtered reads, and then performs pairwise comparisons to generate Jaccard indexes |
comparison_assembly_workflow |
Sourmash computes the MinHash signatures of assembled contigs, and then performs pairwise comparisons to generate Jaccard indexes |
comparison_reads_assembly_workflow |
Sourmash computes the MinHash signatures of filtered reads and assembled contigs, and then performs pairwise comparisons to generate Jaccard indexes |
comparison_output_heatmap_plots_reads_workflow |
Generates a heatmap plot to visualize the csv output from the compared reads |
comparison_output_heatmap_plots_assembly_workflow |
Generates a heatmap plot to visualize the csv output from the compared assemblies |
comparison_output_heatmap_plots_reads_assembly_workflow |
Generates a heatmap plot to visualize the csv output from the compared reads and assemblies |
comparison_output_heatmap_plots_all_workflow |
Generates a heatmap plot to visualize all csv outputs. When run alone, this rule essentially performs all the former rules to generate all comparisons and data outputs. |
For this workflow, both the filtered reads and the assembled contigs can be compared with the following command:
snakemake --use-singularity comparison_reads_assembly_workflowThe following command will compare only filtered reads:
snakemake --use-singularity comparison_reads_workflowThe following command will compare only assembled contigs:
snakemake --use-singularity comparison_assembly_workflowThe following command will view the read comparisons as a heatmap visualization:
snakemake --use-singularity comparison_output_heatmap_plots_reads_workflowThe following command will view the assembly comparisons as a heatmap visualization:
snakemake --use-singularity comparison_output_heatmap_plots_assembly_workflowThe following command will view the combined read and assembly comparison as a heatmap visualization:
snakemake --use-singularity comparison_output_heatmap_plots_reads_assembly_workflowIn the event that you wish to execute all the available options for the comparison workflow, you may do so by simply executing the following rule:
snakemake --use-singularity comparison_output_heatmap_plots_all_workflowAdditional options for snakemake can be found in the snakemake documentation.
To change or specify your own parameters for this or any of the workflows prior to execution, see Workflow Architecture.
After successful execution of the comparison workflow, you will find all of your outputs in the metagenomics/workflows/data/ directory. You should expect to see the following files for each sample and quality threshold value:
| Tool Output | File Name | Description |
|---|---|---|
| Sourmash | {sample}_1_reads_trim{qual}_ scaled{scaled_value}.k{value_of_k}.sig |
MinHash signatures of filtered reads generated by sourmash compute |
| Sourmash | {sample}_1_reads_trim{qual}.{assembler}_ scaled{scaled_value}.k{value_of_k}.sig |
MinHash signatures of assembled contigs generated by sourmash compute |
| Sourmash | {sample}_1_reads_trim{qual}and{qual} _read_comparison.k{value_of_k}.csv |
Jaccard indexes generated by sourmash compare from the MinHash signatures of filtered reads |
| Sourmash | {sample}_1_reads_trim{qual}and{qual} _assembly_comparison.k{value_of_k}.csv |
Jaccard indexes generated by sourmash compare from the MinHash signatures of assemblies |
| Sourmash | {sample}_1_reads_and{sample}_1_reads_ read_assembly_comparison.k{value_of_k}.csv |
Jaccard indexes generated by sourmash compare from the MinHash signatures of filtered reads and assemblies |
| Custom R script | {sample}_1_reads_trim{qual}and{qual}_ read_comparison.{value_of_k}.png |
Heatmap visualization of the Jaccard indexes from comparisons of filtered reads |
| Custom R script | {sample}_1_reads_trim{qual}and{qual}_ assembly_comparison.{value_of_k}.png |
Heatmap visualization of the Jaccard indexes from comparisons of assembled contigs |
| Custom R script | {sample}_1_readsand{sample}_reads_ read_assembly_comparison.{value_of_k}.png |
Heatmap visualization of the Jaccard indexes from comparisons of filtered reads and assembled contigs |
MinHash signature files are not meant for human interpretation, but these are critical to the comparison workflow, as well as to sourmash gather in the Taxonomic Classification workflow.
The first line of the output *.csv file with the Jaccard indexes describes the order of sample names in the comparison matrix. A sample compared to itself will have a Jaccard index of 1, and a sample with no similarity to another will have a Jaccard index of 0.
To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools. Note that the file names in the below examples may not be exact replicates of the file naming conventions in the current workflows, but the commands are equivalent.
The comparison of filtered reads, which were previously generated with a quality score threshold of 2 and 30, is performed in a multiple step process. First, sourmash compute is run on both sets of paired-end reads to generate signatures, and then sourmash compare is run to measure the amount of similarity between these signature files. This is equivalent to running the commands listed below.
Note that MinHash signatures for different k-mer lengths (i.e., values of k) are merged in the workflow to produce a single signature file, but there is one command for each value of k in the examples below. The information is all the same, but more individual signature files are produced by the below example commands (i.e., one signature file for each value of k) than are generated by the snakemake workflow.
The MinHash signatures of reads filtered with a quality score threshold of 2 are computed with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compute --merge {sample}_reads_trim2_1.fq.gz {sample}_reads_trim2_2.fq.gz --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute --merge SRR606249_subset10_reads_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2_scaled10000.k21.sig
sourmash compute --merge SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2_scaled10000.k31.sig
sourmash compute --merge SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2_scaled10000.k51.sigThe MinHash signatures of reads filtered with a quality score threshold of 30 are computed with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compute --merge {sample}_reads_trim30_1.fq.gz {sample}_reads_trim30_2.fq.gz --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30_scaled10000.k21.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30_scaled10000.k31.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30_scaled10000.k51.sigAfter multiple MinHash signatures have been generated from the filtered reads, they are compared with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compare {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}_trim2and30_read_comparison.{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k21.sig SRR606249_subset10_trim30_scaled10000.k21.sig --csv SRR606249_subset10_trim2and30_read_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k31.sig SRR606249_subset10_trim30_scaled10000.k31.sig --csv SRR606249_subset10_trim2and30_read_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k51.sig SRR606249_subset10_trim30_scaled10000.k51.sig --csv SRR606249_subset10_trim2and30_read_comparison.k51.csvThe comparison of assembled contigs from both MEGAHIT and metaSPAdes is also performed in a multiple step process, first by running sourmash compute on both sets of assembled contigs to generate signatures, then by running sourmash compare to determine the similarity between these signature files. This is equivalent to running the commands below.
The MinHash signatures of metaSPAdes and MEGAHIT contigs assembled from reads filtered with a quality score threshold of 2 are computed with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compute {sample}_trim2.{assembler}/contigs.fasta --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim2.{assembler}_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2.megahit_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2.megahit_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2.megahit_scaled10000.k51.sigThe MinHash signatures of metaSPAdes and MEGAHIT contigs assembled from reads filtered with a quality score threshold of 30 are computed with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compute {sample}_trim30.{assembler}/contigs.fasta --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim30.{assembler}_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute SRR606249_subset10_trim30_metaspades/contigs.fasta --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim30.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim30.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30.megahit_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30.megahit_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30.megahit_scaled10000.k51.sigAfter multiple MinHash signatures have been generated from the assembled contigs, they are compared with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compare {sample}_trim2.{assembler}_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30.{assembler}_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}_trim2and30_assembly_comparison.k{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig SRR606249_subset10_trim2.megahit_scaled10000.k21.sig SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig SRR606249_subset10_trim30.megahit_scaled10000.k21.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig SRR606249_subset10_trim2.megahit_scaled10000.k31.sig SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig SRR606249_subset10_trim30.megahit_scaled10000.k31.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig SRR606249_subset10_trim2.megahit_scaled10000.k51.sig SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig SRR606249_subset10_trim30.megahit_scaled10000.k51.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k51.csvThe comparison of both reads and assembled contigs is performed by first running all the sourmash compute commands listed above to generate signatures for all of the trimmed reads and assembled contigs. Then sourmash compare is run with all of these signature files, which is equivalent to running the following command:
sourmash compare {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}and{sample}_read_assembly_comparison.k{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k21.sig SRR606249_subset10_trim30_scaled10000.k21.sig SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig SRR606249_subset10_trim2.megahit_scaled10000.k21.sig SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig SRR606249_subset10_trim30.megahit_scaled10000.k21.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.31.sig SRR606249_subset10_trim30_scaled10000.k31.sig SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig SRR606249_subset10_trim2.megahit_scaled10000.k31.sig SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig SRR606249_subset10_trim30.megahit_scaled10000.k31.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.51.sig SRR606249_subset10_trim30_scaled10000.k51.sig SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig SRR606249_subset10_trim2.megahit_scaled10000.k51.sig SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig SRR606249_subset10_trim30.megahit_scaled10000.k51.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k51.csvBelow is a more detailed description of the output files expected in the metagenomics/workflows/data/ directory after the comparison workflow has been successfully run.
Using the filtered reads generated by the Read Filtering Workflow and/or assembled contigs generated by the Assembly Workflow:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
381 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
374 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
327 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
313 MB |
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa |
158 MB |
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa |
146 MB |
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa |
132 MB |
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa |
119 MB |
The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different filtered read datasets with the comparison_workflow_reads rule:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig |
1 MB |
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig |
903 KB |
SRR606249_subset10_1_reads_trim2and30_read_comparison.k21.csv |
0.1 KB |
SRR606249_subset10_1_reads_trim2and30_read_comparison.k31.csv |
0.1 KB |
SRR606249_subset10_1_reads_trim2and30_read_comparison.k51.csv |
0.1 KB |
The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different assembled contig datasets with the comparison_workflow_assembly rule:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2.metaspades_scaled10000.k21_31_51.sig |
702 KB |
SRR606249_subset10_1_reads_trim30.metaspades_scaled10000.k21_31_51.sig |
646 KB |
SRR606249_subset10_1_reads_trim2.megahit_scaled10000.k21_31_51.sig |
605 KB |
SRR606249_subset10_1_reads_trim30.megahit_scaled10000.k21_31_51.sig |
547 KB |
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k21.csv |
0.4 KB |
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k31.csv |
0.4 KB |
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k51.csv |
0.4 KB |
The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different filtered read and assembled contig datasets with the comparison_workflow_reads_assembly rule:
| File Name | File Size |
|---|---|
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig |
1_MB |
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig |
903_KB |
SRR606249_subset10_1_reads_trim2.metaspades_scaled10000.k21_31_51.sig |
702_KB |
SRR606249_subset10_1_reads_trim30.metaspades_scaled10000.k21_31_51.sig |
646_KB |
SRR606249_subset10_1_reads_trim2.megahit_scaled10000.k21_31_51.sig |
605_KB |
SRR606249_subset10_1_reads_trim30.megahit_scaled10000.k21_31_51.sig |
547_KB |
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads_ read_assembly_comparison.k21.csv |
0.9_KB |
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads read_assembly_comparison.k31.csv |
0.9_KB |
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads_ read_assembly_comparison.k51.csv |
0.9_KB |
The tables below summarize the Jaccard indexes in the final *.csv files after analyzing the SRR606249_subset10_1_reads dataset.
| k21 results | reads_trim2 |
reads_trim30 |
megahit_trim2 |
metaspades_trim2 |
megahit_trim30 |
metaspades_trim30 |
|---|---|---|---|---|---|---|
reads_trim2 |
1.00 |
0.91 |
0.61 |
0.71 |
0.55 |
0.66 |
reads_trim30 |
0.91 |
1.00 |
0.71 |
0.81 |
0.66 |
0.78 |
megahit_trim2 |
0.61 |
0.71 |
1.00 |
0.82 |
0.89 |
0.82 |
metaspades_trim2 |
0.71 |
0.81 |
0.82 |
1.00 |
0.75 |
0.91 |
megahit_trim30 |
0.55 |
0.66 |
0.89 |
0.75 |
1.00 |
0.80 |
metaspades_trim30 |
0.66 |
0.78 |
0.82 |
0.91 |
0.80 |
1.00 |
| k31 results | reads_trim2 |
reads_trim30 |
megahit_trim2 |
metaspades_trim2 |
megahit_trim30 |
metaspades_trim30 |
|---|---|---|---|---|---|---|
reads_trim2 |
1.00 |
0.90 |
0.60 |
0.70 |
0.55 |
0.64 |
reads_trim30 |
0.90 |
1.00 |
0.70 |
0.80 |
0.66 |
0.77 |
megahit_trim2 |
0.60 |
0.70 |
1.00 |
0.81 |
0.89 |
0.81 |
metaspades_trim2 |
0.70 |
0.80 |
0.81 |
1.00 |
0.75 |
0.90 |
megahit_trim30 |
0.55 |
0.66 |
0.89 |
0.75 |
1.00 |
0.80 |
metaspades_trim30 |
0.64 |
0.77 |
0.81 |
0.90 |
0.80 |
1.00 |
| k51 results | reads_trim2 |
reads_trim30 |
megahit_trim2 |
metaspades_trim2 |
megahit_trim30 |
metaspades_trim30 |
|---|---|---|---|---|---|---|
reads_trim2 |
1.00 |
0.87 |
0.61 |
0.68 |
0.55 |
0.64 |
reads_trim30 |
0.87 |
1.00 |
0.69 |
0.76 |
0.67 |
0.76 |
megahit_trim2 |
0.61 |
0.69 |
1.00 |
0.81 |
0.87 |
0.81 |
metaspades_trim2 |
0.68 |
0.76 |
0.81 |
1.00 |
0.74 |
0.89 |
megahit_trim30 |
0.55 |
0.67 |
0.87 |
0.74 |
1.00 |
0.80 |
metaspades_trim30 |
0.64 |
0.76 |
0.81 |
0.89 |
0.80 |
1.00 |
Overall, this example showed the same trends in similarity across different k-mer lengths, regardless of trimming quality threshold or assembly status. Additional observations from this example include the following:
- Datasets showed slightly more similarity to each other when evaluated with smaller k-mer lengths (i.e., k = 21) than larger k-mer lengths (i.e., k = 51)
- Raw reads trimmed at different quality thresholds were slightly more similar to each other than the assembled contigs
- Assembled contigs shared slightly more of their k-mer content if they were created by the same assembler