From d1524b95f546f768f3535298baed0b058f40b364 Mon Sep 17 00:00:00 2001 From: ang037 Date: Sat, 26 Apr 2025 23:57:13 -0700 Subject: [PATCH] simplified readme --- README.md | 153 ++++++++++++++++++++++++++------------------- docs/install.md | 99 +++++++++++++++-------------- docs/quickstart.md | 52 ++++----------- docs/usage.md | 12 ++-- 4 files changed, 159 insertions(+), 157 deletions(-) diff --git a/README.md b/README.md index fb20f490..d7d9d094 100644 --- a/README.md +++ b/README.md @@ -25,21 +25,21 @@ ## Table of Contents - [Introduction](#overview) - [Quick Install](#usage) - - [Using ROADIES Bioconda package](#conda) - - [Using DockerHub](#dockerhub) - - [Using Docker locally](#docker) - - [Using Installation Script](#script) + - [Option 1: Install via Bioconda (Recommended)](#conda) + - [Option 2: Install via DockerHub](#dockerhub) + - [Option 3: Install via Local Docker Build](#docker) + - [Option 4: Install via Source Script](#script) - [Quick Start](#start) -- [Run ROADIES with your own datasets](#runpipeline) +- [Running ROADIES on your own data](#runpipeline) - [Citing ROADIES](#citation)
## Introduction -Welcome to the official repository of ROADIES, a novel pipeline designed for phylogenetic tree inference of the species directly from their raw genomic assemblies. ROADIES offers a fully automated, easy-to-use, scalable solution, eliminating any manual steps and providing unique flexibility in adjusting the tradeoff between accuracy and runtime. +Welcome to the official repository of ROADIES, a novel pipeline for inferring phylogenetic species trees directly from raw genomic assemblies. ROADIES offers a fully automated, scalable, and easy-to-use solution, eliminating manual steps and allowing flexible control over the trade-off between accuracy and runtime. -**For more detailed information on all the features and settings of ROADIES, please refer to our [Wiki](https://turakhialab.github.io/ROADIES/).** +**For a detailed overview of ROADIES' features and configuration options, please visit our [Wiki](https://turakhialab.github.io/ROADIES/).**
@@ -56,11 +56,11 @@ Welcome to the official repository of ROADIES, a novel pipeline designed for phy ## Quick Install -### Using ROADIES Bioconda package (recommended) +Please follow any of the options below to install ROADIES in your system. -To run ROADIES using Bioconda package, follow these steps: +### Option 1: Install via Bioconda (Recommended) -To install and use conda in Ubuntu machine, execute the set of commands below: +1. Install Conda (if not installed): ``` wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh @@ -69,81 +69,79 @@ chmod +x Miniconda3-latest-Linux-x86_64.sh export PATH="$HOME/miniconda3/bin:$PATH" source ~/.bashrc +``` + +2. Configure Conda channels: +``` conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge ``` -After this, try running `conda` in your terminal to check if conda is properly installed. Once it is installed, follow the steps below: +Verify the installation by running `conda` in your terminal -1. Create and activate custom conda environment with Python version 3.9, ETE3 and Seaborn. +3. Create and activate a custom environment: ``` -conda create -n myenv python=3.9 ete3 seaborn -conda activate myenv +conda create -n roadies_env python=3.9 ete3 seaborn +conda activate roadies_env ``` -2. Install ROADIES bioconda package +4. Install ROADIES: ``` conda install roadies ``` -All files of ROADIES along with dependencies will be found in `/miniconda3/envs/myenv/ROADIES`. +5. Locate the installed files: + +``` +cd $HOME/miniconda3/envs/roadies_env/ROADIES + +``` + +Now you are ready to follow the [Quick Start](#start) section to run the pipeline. -### Using DockerHub +### Option 2: Install via DockerHub -To run ROADIES using DockerHub, follow these steps: +If you would like to install ROADIES using DockerHub, follow these steps: -1. Pull the ROADIES Docker image from DockerHub: +1. Pull the ROADIES image from DockerHub: ``` docker pull ang037/roadies:latest ``` -2. Run the Docker container: +2. Launch a container: ``` docker run -it ang037/roadies:latest ``` -### Using Docker locally +Once you are able to access the ROADIES repository, refer to the [Quick Start](#start) to run the pipeline. + +### Option 3: Install via Local Docker Build -First, clone the repository (requires `git` to be installed in the system): +1. Clone the ROADIES repository: ``` git clone https://github.com/TurakhiaLab/ROADIES.git cd ROADIES ``` -Then build and run the Docker container: +2. Build and run the Docker container: ``` docker build -t roadies_image . docker run -it roadies_image ``` -### Using installation script (requires sudo access) - -First clone the repository: - -``` -git clone https://github.com/TurakhiaLab/ROADIES.git -cd ROADIES -``` - -Then, execute the installation script: - -``` -chmod +x roadies_env.sh -source roadies_env.sh -``` +Once you are able to access the ROADIES repository, refer to [Quick Start](#start) instructions to run the pipeline. -This will install and build all tools and dependencies. Once the setup is complete, it will print `Setup complete` in the terminal and activate the `roadies_env` environment with all Conda packages installed. +### Option 4: Install via Source Script -#### Required dependencies +1. Install the following dependencies (**requires sudo access**): -To run this script, ensure the following dependencies are installed: - Java Runtime Environment (Version 1.7 or higher) - Python (Version 3.9 or higher) - `wget` and `unzip` commands @@ -151,7 +149,6 @@ To run this script, ensure the following dependencies are installed: - cmake (Download here: https://cmake.org/download/) - Boost library (Download here: https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/) - zlib (Download here: http://www.zlib.net/) -- GLIBC (Version 2.29 or higher) For Ubuntu, you can install these dependencies with: @@ -159,62 +156,86 @@ For Ubuntu, you can install these dependencies with: sudo apt-get install -y wget unzip make g++ python3 python3-pip python3-setuptools git default-jre libgomp1 libboost-all-dev cmake ``` +2. Clone the repository: + +``` +git clone https://github.com/TurakhiaLab/ROADIES.git +cd ROADIES +``` + +3. Run the installation script: + +``` +chmod +x roadies_env.sh +source roadies_env.sh +``` + +After successful setup (Setup complete message), your environment roadies_env will be activated. Proceed to [Quick Start](#start). + **Note:** If you encounter issues with the Boost library, add its path to `$CPLUS_LIBRARY_PATH` and save it in `~/.bashrc`.
## Quick Start -Once setup is done, you can run the ROADIES pipeline using the provided test dataset. Follow these steps for a 16-core machine: +After installing using one of the options mentioned in [Quick Install](#usage), you're ready to run ROADIES! To get started: -1. Go to ROADIES repository directory if not there: +1. Download the test dataset (11 Drosophila genomes): ``` -cd ROADIES +mkdir -p test/test_data && cat test/input_genome_links.txt | xargs -I {} sh -c 'wget -O test/test_data/$(basename {}) {}' ``` -2. Create a directory for the test data and download the test datasets (using the following one line command): +This will save the datasets on a separate `test/test_data` folder within the repository -``` -mkdir -p test/test_data && cat test/input_genome_links.txt | xargs -I {} sh -c 'wget -O test/test_data/$(basename {}) {}' -``` -3. Run the pipeline with the following command (from ROADIES directory): +2. Run the pipeline -#### NOTE: By default, ROADIES run multiple iterations to get you the most accurate tree. --noconverge is the recommended option if you want to only test the pipeline or if you know optimal gene count to get the accurate tree. +#### IMPORTANT: ROADIES by default runs multiple iterations for generating highly accurate trees. For quick testing, use `--noconverge` to run a single iteration. ``` -python run_roadies.py --cores 16 (# for actual run) +python run_roadies.py --cores 16 # Full run (multiple iterations) ``` ``` -python run_roadies.py --cores 16 --noconverge (# for test run) +python run_roadies.py --cores 16 --noconverge # Quick test run (one iteration) ``` -These commands will download the 11 Drosophila genomic datasets (links provided in `test/input_genome_links.txt`) and save them in the `test/test_data` directory. Then it will run ROADIES pipeline for those 11 Drosophila genomes and save the final **UNROOTED** newick tree as `roadies.nwk` in a separate `output_files` folder upon completion. If `--noconverge` flag is not set, ROADIES saves the output of all other iterations in a separate `converge_files` folder. +3. Output: + + - Final **UNROOTED** newick tree saved as `roadies.nwk` in a separate `output_files` folder. + - Intermediate files (if `--noconverge` not used) saved in a separate `converge_files` folder. -#### NOTE: The final newick tree is unrooted by default. User needs to reroot the tree appropriately on their own. We provide a script saved in `ROADIES/workflow/scripts/reroot.py` which lets you reroot the tree given a reference rooted species tree as input. +#### NOTE: ROADIES outputs unrooted trees by default. You can reroot trees on your own or use the provided `reroot.py` script in `workflow/scripts/` (given a reference rooted species tree as input).
-## Run ROADIES with your own datasets +## Running ROADIES on your own data + +If you want to run ROADIES with your own datasets, follow these steps: + +1. Specify Input Dataset: + +- Edit `config.yaml` file (found in the ROADIES directory - `config` folder). +- Update the `GENOMES` field with paths to your `.fa` or `.fa.gz` genome assemblies. Ensure all input genomic assemblies are in `.fa` or `.fa.gz` format and named according to the species' name (e.g., `Aardvark.fa`). -To run ROADIES with your own datasets, follow these steps: +**IMPORTANT**: Each file must contain only one species. If needed, split multi-species files with: -1. **Specify Input Genomic Dataset**: Update the `config.yaml` file (found in the ROADIES directory - `config` folder) to include the path to your input datasets under the `GENOMES` parameter. Ensure all input genomic assemblies are in `.fa` or `.fa.gz` format and named according to the species' name (e.g., `Aardvark.fa`). +``` +faSplit byname +``` -**Note**: Each file should contain the genome assembly of one unique species. If a file contains multiple species, split it into individual genome files (`fasplit` can be used: `faSplit byname `). +2. Configure Other Parameters: -2. **Configure Other Parameters**: Adjust other parameters in `config.yaml` as needed. Detailed information on each parameter is available in the [`Usage` section](https://turakhialab.github.io/ROADIES/). +- Modify other parameters in `config.yaml` as needed. +- Refer to detailed settings on the [Wiki](https://turakhialab.github.io/ROADIES/). -3. **Run the Pipeline**: Execute the pipeline with the following command (example for 16 cores): +3. Run the Pipeline: ``` python run_roadies.py --cores 16 ``` -The output species tree (unrooted) in Newick format will be saved as `roadies.nwk` in the `output_files` folder. - -4. **Modes of operation**: ROADIES supports multiple modes of operation (`fast`, `balanced`, `accurate`) by controlling the accuracy-runtime tradeoff. Use any one of the following commands to select a mode (`accurate` mode is the default): +**Modes of operation**: ROADIES supports multiple modes of operation (`fast`, `balanced`, `accurate`) by controlling the accuracy-runtime tradeoff. Use any one of the following commands to select a mode (`accurate` mode is the default): ``` @@ -225,7 +246,9 @@ python run_roadies.py --cores 16 --mode balanced python run_roadies.py --cores 16 --mode fast ``` -### For troubleshooting and contribution details (also to know the steps of running ROADIES in a multi-node SLURM based cluster), refer to [Wiki](https://turakhialab.github.io/ROADIES/) +The output species tree (unrooted) in Newick format will be saved as `roadies.nwk` in the `output_files` folder. + +### For troubleshooting, contributing, or SLURM cluster usage, refer to [Wiki](https://turakhialab.github.io/ROADIES/)
diff --git a/docs/install.md b/docs/install.md index 03b5aaf5..11aa8f3a 100644 --- a/docs/install.md +++ b/docs/install.md @@ -1,12 +1,10 @@ # Installation Methods -## Using ROADIES Bioconda package (Recommended) +Please follow any of the options below to install ROADIES in your system. -To run ROADIES using Bioconda package, follow these steps: +## Option 1: Install via Bioconda (Recommended) -**Note:** You need to have conda installed in your system. - -To install and use conda in Ubuntu machine, execute the set of commands below: +1. Install Conda (if not installed): ```bash wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh @@ -15,92 +13,86 @@ chmod +x Miniconda3-latest-Linux-x86_64.sh export PATH="$HOME/miniconda3/bin:$PATH" source ~/.bashrc +``` + +2. Configure Conda channels: +```bash conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge ``` -After this, try running `conda` in your terminal to check if conda is properly installed. Once it is installed, follow the steps below: +Verify the installation by running `conda` in your terminal -1. Create and activate custom conda environment with Python version 3.9 +3. Create and activate a custom environment: ```bash -conda create -n myenv python=3.9 -conda activate myenv +conda create -n roadies_env python=3.9 ete3 seaborn +conda activate roadies_env ``` -2. Install ROADIES bioconda package +4. Install ROADIES: -``` +```bash conda install roadies ``` -All files of ROADIES along with dependencies will be found in `/miniconda3/envs/myenv/ROADIES`. +5. Locate the installed files: + +```bash +cd $HOME/miniconda3/envs/roadies_env/ROADIES + +``` -## Using DockerHub +Now you are ready to follow the Quick Start section to run the pipeline. -To run ROADIES using DockerHub, follow these steps: +## Option 2: Install via DockerHub -1. Pull the ROADIES Docker image from DockerHub: +If you would like to install ROADIES using DockerHub, follow these steps: + +1. Pull the ROADIES image from DockerHub: ```bash docker pull ang037/roadies:latest ``` -2. Run the Docker container: +2. Launch a container: ```bash docker run -it ang037/roadies:latest ``` -## Using Docker locally +Once you are able to access the ROADIES repository, refer to the Quick Start section to run the pipeline. + +## Option 3: Install via Local Docker Build -First, clone the repository (requires `git` to be installed in the system): +1. Clone the ROADIES repository: ```bash git clone https://github.com/TurakhiaLab/ROADIES.git cd ROADIES ``` -Then build and run the Docker container: +2. Build and run the Docker container: ```bash docker build -t roadies_image . docker run -it roadies_image ``` -## Using installation script (requires sudo access) - -First clone the repository: - -```bash -git clone https://github.com/TurakhiaLab/ROADIES.git -cd ROADIES -``` - -Then, execute the installation script: +Once you are able to access the ROADIES repository, refer to Quick Start instructions to run the pipeline. -```bash -chmod +x roadies_env.sh -source roadies_env.sh -``` - -This will install and build all tools and dependencies. Once the setup is complete, it will print `Setup complete` in the terminal and activate the `roadies_env` environment with all Conda packages installed. +## Option 4: Install via Source Script -!!! Note - ROADIES is built on [Snakemake (workflow parallelization tool)](https://snakemake.readthedocs.io/en/stable/). It also requires various tools (PASTA, LASTZ, RAxML-NG, MashTree, FastTree, ASTRAL-Pro3) to be installed before performing the analysis. To ease the process, instead of individually installing the tools, we provide `roadies_env.sh` script to automatically download all dependencies into the user system. +1. Install the following dependencies (**requires sudo access**): -### Required dependencies - -To run this script, ensure the following dependencies are installed: -- Java Runtime Environment (version 1.7 or higher) -- Python (version 3 or higher) +- Java Runtime Environment (Version 1.7 or higher) +- Python (Version 3.9 or higher) - `wget` and `unzip` commands -- GCC (version 11.4 or higher) +- GCC (Version 11.4 or higher) - cmake (Download here: https://cmake.org/download/) - Boost library (Download here: https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/) - zlib (Download here: http://www.zlib.net/) -- GLIBC (Version 2.29 or higher) For Ubuntu, you can install these dependencies with: @@ -108,5 +100,20 @@ For Ubuntu, you can install these dependencies with: sudo apt-get install -y wget unzip make g++ python3 python3-pip python3-setuptools git default-jre libgomp1 libboost-all-dev cmake ``` -!!! Warning - If you encounter issues with the Boost library, add its path to `$CPLUS_LIBRARY_PATH` and save it in `~/.bashrc`. +2. Clone the repository: + +```bash +git clone https://github.com/TurakhiaLab/ROADIES.git +cd ROADIES +``` + +3. Run the installation script: + +```bash +chmod +x roadies_env.sh +source roadies_env.sh +``` + +After successful setup (Setup complete message), your environment roadies_env will be activated. Proceed to Quick Start. + +**Note:** If you encounter issues with the Boost library, add its path to `$CPLUS_LIBRARY_PATH` and save it in `~/.bashrc`. \ No newline at end of file diff --git a/docs/quickstart.md b/docs/quickstart.md index a0c9cece..7a14a631 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -1,59 +1,31 @@ -# Quick start (with provided test dataset) +# Quick start -Once setup is done, you can run the ROADIES pipeline using the provided test dataset. Follow these steps for a 16-core machine: +After installing using one of the options mentioned in Quick Install, you're ready to run ROADIES! To get started: -**Step 1:** Go to ROADIES repository directory if not there: - -```bash -cd ROADIES -``` - -**Step 2:** Create a directory for the test data and download the test datasets (using the following one line command): +1. Download the test dataset (11 Drosophila genomes): ```bash mkdir -p test/test_data && cat test/input_genome_links.txt | xargs -I {} sh -c 'wget -O test/test_data/$(basename {}) {}' ``` -**Step 3:** Run the pipeline with the following command (from ROADIES directory): -```bash -python run_roadies.py --cores 16 -``` - -Step 2 will download the 11 Drosophila genomic datasets (links provided in `test/input_genome_links.txt`) and save them in the `test/test_data` directory. Step 3 will run ROADIES for those 11 Drosophila genomes and save the final newick tree as `roadies.nwk` in a separate `output_files` folder for the current iteration. The final output files for all iterations will be saved in `converge_files` folder upon completion. +This will save the datasets on a separate `test/test_data` folder within the repository -## Running ROADIES with different modes of operation +2. Run the pipeline -To run ROADIES in various other modes of operation (fast, balanced, accurate) (description of these modes are mentioned in [Modes of operation](index.md#modes-of-operation) section), try the following commands: +#### IMPORTANT: ROADIES by default runs multiple iterations for generating highly accurate trees. For quick testing, use `--noconverge` to run a single iteration. ```bash -python run_roadies.py --cores 16 --mode accurate +python run_roadies.py --cores 16 # Full run (multiple iterations) ``` - ```bash -python run_roadies.py --cores 16 --mode balanced +python run_roadies.py --cores 16 --noconverge # Quick test run (one iteration) ``` -```bash -python run_roadies.py --cores 16 --mode fast -``` -!!! Note - Accurate mode is the default mode of operation. If you don't specify any particular mode using `--mode` argument, default mode will run. - -For each modes, the output files for all iterations will be saved in a separate `converge_files` folder. `output_files` will save the results of the last iteration. Species tree for all iterations will be saved in `converge_files` folder with the nomenclature `iteration_.nwk`. +3. Output: -## Running ROADIES in non converge mode (single iteration mode) + - Final **UNROOTED** newick tree saved as `roadies.nwk` in a separate `output_files` folder. + - Intermediate files (if `--noconverge` not used) saved in a separate `converge_files` folder. -By default, ROADIES will run for multiple iteration until it gets a stable tree at the end (details mentioned in [convergence mechanism](index.md#convergence-mechanism) section). To run ROADIES with non converge mode (only for one iteration), execute the following command (notice the addition of `--noconverge` argument): - -```bash -python run_roadies.py --cores 16 --noconverge -``` -Try following commands for other modes: +#### NOTE: ROADIES outputs unrooted trees by default. You can reroot trees on your own or use the provided `reroot.py` script in `workflow/scripts/` (given a reference rooted species tree as input). -```bash -python run_roadies.py --cores 16 --mode balanced --noconverge -``` -```bash -python run_roadies.py --cores 16 --mode fast --noconverge -``` \ No newline at end of file diff --git a/docs/usage.md b/docs/usage.md index 069b9137..be6bf85e 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -142,7 +142,7 @@ Replace below lines: ] ``` -With below lines (you can change the value of `--jobs` based on your cluster configuration): +With below lines (you can change the value of `--jobs` and other account details based on your cluster configuration): ``` cmd = [ @@ -162,9 +162,9 @@ With below lines (you can change the value of `--jobs` based on your cluster con "--cluster", ( "sbatch " - "--job-name=ROADIES_run " - "--partition=vgl_a " - "--account=jarv_condo_bank " + "--job-name=XXX " + "--partition=XXX " + "--account=XXX " "--nodes=1 " "--ntasks-per-node=4 " "--cpus-per-task=8 " @@ -172,7 +172,7 @@ With below lines (you can change the value of `--jobs` based on your cluster con "--mem-per-cpu=11G " "--output=%x_%j.out " "--error=%x_%j.err " - "--mail-user=agupta02@rockefeller.edu " + "--mail-user=XXX " "--mail-type=ALL" ) ] @@ -182,7 +182,7 @@ After the above changes, save the following lines of code as separate file calle ``` #! /bin/bash #SBATCH -J ROADIES_XXX -#SBATCH -p vgl_a +#SBATCH -p XXX #SBATCH --account=XXX #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1