This repository contains the complete analysis workflow used to benchmark the OptiFit algorithm in mothur and produce the accompanying manuscript. Find details on how to use OptiFit and descriptions of the parameter options on the mothur wiki: https://mothur.org/wiki/cluster.fit/.
Sovacool KL, Westcott SL, Mumphrey MB, Dotson GA, Schloss PD. 2022. OptiFit: An Improved Method for Fitting Amplicon Sequences to Existing OTUs. mSphere. http://dx.doi.org/10.1128/msphere.00916-21
A bibtex entry for LaTeX users:
@article{sovacool_optifit_2022,
author = {Kelly L. Sovacool and Sarah L. Westcott and M. Brodie Mumphrey and Gabrielle A. Dotson and Patrick D. Schloss},
title = {OptiFit: an Improved Method for Fitting Amplicon Sequences to Existing OTUs},
journal = {mSphere},
year = {2022},
doi = {10.1128/msphere.00916-21}
URL = {https://journals.asm.org/doi/10.1128/msphere.00916-21},
The workflow is split into five subworkflows:
- 0_prep_db — download & preprocess reference databases.
- 1_prep_samples — download, preprocess, & de novo cluster the sample datasets.
- 2_fit_reference_db — fit datasets to reference databases.
- 3_fit_sample_split — split datasets; cluster one fraction de novo and fit the remaining sequences to the de novo OTUs.
- 4_vsearch — run vsearch clustering for comparison.
The main workflow (Snakefile) creates plots from the results of
the subworkflows and renders the paper.
-
Before cloning, configure git symlinks:
git config --global core.symlinks trueOtherwise, git will create text files in place of symlinks.
-
Clone this repository.
git clone https://github.com/SchlossLab/Sovacool_OptiFit_mSphere_2022 cd Sovacool_OptiFit_mSphere_2022 -
Install the dependencies.
Almost all are listed in the conda environment file. Everything needed to run the analysis workflow is listed here.
conda env create -f config/env.simple.yaml conda activate optifit
Additionally, I used a custom version of
ggraphfor the algorithm figure. You can install it withdevtoolsfrom R:devtools::install_github('kelly-sovacool/ggraph', ref = 'iss-297_ggtext')
If you do not have LaTeX already, you'll need to install a LaTeX distribution before rendering the manuscript as a PDF. You can use
tinytexto do so:tinytex::install_tinytex()
I also used
latexdiffrto create a PDF with changes tracked prior to submitting revisions to the journal.devtools::install_github("hughjonesd/latexdiffr")
-
Run the entire pipeline.
Locally:
snakemake --cores 4Or on an HPC running slurm:
sbatch code/slurm/submit_all.sh(You will first need to edit your email and slurm account info in the submission script and cluster config.)
.
├── OptiFit.Rproj
├── README.md
├── Snakefile
├── code
│ ├── R
│ ├── bash
│ ├── py
│ ├── slurm
│ └── tests
├── config
│ ├── cluster.json
│ ├── config.yaml
│ ├── config_test.yaml
│ ├── env.export.yaml
│ ├── env.simple.yaml
│ └── slurm
│ └── config.yaml
├── docs
│ ├── paper.md
│ ├── paper.pdf
│ └── slides
├── exploratory
│ ├── 2018_fall_rotation
│ ├── 2019_winter_rotation
│ ├── 2020-05_May-Oct
│ ├── 2020-11_Nov-Dec
│ ├── 2021
│ │ ├── figures
│ │ ├── plots.Rmd
│ │ ├── plots.md
│ ├── AnalysisRoadmap.md
│ └── DeveloperNotes.md
├── figures
├── log
├── paper
│ ├── figures.yaml
│ ├── head.tex
│ ├── msphere.csl
│ ├── paper.Rmd
│ ├── preamble.tex
│ └── references.bib
├── results
│ ├── aggregated.tsv
│ ├── stats.RData
│ └── summarized.tsv
└── subworkflows
├── 0_prep_db
│ ├── README.md
│ └── Snakefile
├── 1_prep_samples
│ ├── README.md
│ ├── Snakefile
│ ├── data
│ │ ├── human
│ │ └── SRR_Acc_List.txt
│ │ ├── marine
│ │ └── SRR_Acc_List.txt
│ │ ├── mouse
│ │ └── SRR_Acc_List.txt
│ │ └── soil
│ │ └── SRR_Acc_List.txt
│ └── results
│ ├── dataset_sizes.tsv
│ └── opticlust_results.tsv
├── 2_fit_reference_db
│ ├── README.md
│ ├── Snakefile
│ └── results
│ ├── denovo_dbs.tsv
│ ├── optifit_dbs_results.tsv
│ └── ref_sizes.tsv
├── 3_fit_sample_split
│ ├── README.md
│ ├── Snakefile
│ └── results
│ ├── optifit_crit_check.tsv
│ └── optifit_split_results.tsv
└── 4_vsearch
├── README.md
├── Snakefile
└── results
└── vsearch_results.tsv
