Skip to content
/ QTL-seq Public
forked from YuSugihara/QTL-seq

QTL-seq pipeline to identify causative mutations responsible for a phenotype in self and out-crosses

License

Notifications You must be signed in to change notification settings

bu20dy/QTL-seq

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

143 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QTL-seq User Guide

This is an edited version of QTL-seq published by Sugihara et al 2022. Please visit their Github for the official release and more help.

The purpose of my version is to allow for QTL-seq to be used with self-crosses. The original QTL-seq pipeline performs a filtering step that removes all SNPs where the provided parent is heterozygous, because the assumption is you are working with two fully fixed homozygous parents. It also re-assigns SNP genotype calls to be in reference to the parent (for example: if bulk 1 shares the same genotype as the parent, bulk 1 will be 0/0 at that site). What I have added is a parameter "--parental_SNPfilter" to both qtlseq and qtlplot to specify if you want the default filtering process ("homo") or to filter for heterozygous sites ("hetero"). The following README will largely be a copy and paste of Sugihara's Github page for QTL-seq, but if and where I add edits, they will be in bold.

Table of contents

What is QTL-seq?

Bulked segregant analysis, as implemented in QTL-seq (Takagi et al., 2013), is a powerful and efficient method for identifying agronomically important loci in crop plants. QTL-seq has been adapted from MutMap to identify quantitative trait loci. It utilizes sequences pooled from two segregating progeny populations with extreme opposite traits (e.g. resistant vs susceptible) and a single whole-genome resequencing of either of the parental cultivars. I suggest using the parent with the trait of interest as your provided parent, to have QTL peaks and delta SNP values in the positive direction. If you use the parent without the trait of interest, you just need to understand the peaks will be upside-down.

Citation

Installation

Dependencies

Software

Python (>=3.5) libraries

  • matplotlib
  • numpy
  • pandas
  • seaborn (optional)

Manual installation

My version can only be installed via this Github page!

git clone https://github.com/bu20dy/QTL-seq.git
cd QTL-seq
pip install -e .

Then you need to install other dependencies yourself. We highly recommend installing SnpEff and Trimmomatic via bioconda.

conda install -c bioconda snpeff
conda install -c bioconda trimmomatic

After installation, please check whether SnpEff and Trimmomatic are working using the commands below.

snpEff --help
trimmomatic --help

Usage

If your reference genome contains more than 50 contigs, only the 50 longest contigs will be plotted.

qtlseq -h

usage: qtlseq -r <FASTA> -p <BAM|FASTQ> -b1 <BAM|FASTQ>
              -b2 <BAM|FASTQ> -n1 <INT> -n2 <INT> -o <OUT_DIR>
              [-F <INT>] [-T] [-e <DATABASE>]

QTL-seq version 2.2.9

options:
  -h, --help             show this help message and exit
  -r , --ref             Reference FASTA file.
  --parental_SNPfilter   Determines how QTL-seq handles SNP filtering and reassigning of
                         SNPs based on the provided parent.
                         Options:
                         "hetero" == keeps SNPs where the parent is heterozygous (e.g. self-cross).
                         removes homozygous SNPs. does not re-assign bulk genotypes.
                         "homo" == keep SNPs where the parent has 0/0 or 1/1 genotype
                         and removes sites where parent is heterozygous.
                         this will also re-record bulk genotype assignments to be in reference
                         to the parental genotype. For example, if bulk 1 has the same genotype as
                         the parent, the bulk genotype for that SNP will be 0/0.
                         THIS IS THE DEFAULT AND HOW QTL-SEQ BY SUGIHARA 2022 OPERATES. [homo]
  -p , --parent          FASTQ or BAM file of the parent. If specifying
                         FASTQ, separate paired-end files with a comma,
                         e.g., -p fastq1,fastq2. This option can be
                         used multiple times.
  -b1 , --bulk1          FASTQ or BAM file of bulk 1. If specifying
                         FASTQ, separate paired-end files with a comma,
                         e.g., -b1 fastq1,fastq2. This option can be
                         used multiple times.
  -b2 , --bulk2          FASTQ or BAM file of bulk 2. If specifying
                         FASTQ, separate paired-end files with a comma,
                         e.g., -b2 fastq1,fastq2. This option can be
                         used multiple times.
  -n1 , --N-bulk1        Number of individuals in bulk 1.
  -n2 , --N-bulk2        Number of individuals in bulk 2.
  -o , --out             Output directory. The specified directory must not
                         already exist.
  -F , --filial          Filial generation. This parameter must be greater
                         than 1. [2]
  -t , --threads         Number of threads. If a value less than 1 is specified,
                         QTL-seq will use the maximum available threads. [2]
  -w , --window          Window size in kilobases (kb). [2000]
  -s , --step            Step size in kilobases (kb). [100]
  -D , --max-depth       Maximum depth of variants to be used. [250]
  -d , --min-depth       Minimum depth of variants to be used. [8]
  -N , --N-rep           Number of replicates for simulations to generate
                         null distribution. [5000]
  -T, --trim             Trim FASTQ files using Trimmomatic.
  -a , --adapter         FASTA file containing adapter sequences. This option
                         is used when "-T" is specified for trimming.
  --trim-params          Parameters for Trimmomatic. Input parameters
                         must be comma-separated in the following order:
                         Phred score, ILLUMINACLIP, LEADING, TRAILING,
                         SLIDINGWINDOW, MINLEN. To remove Illumina adapters,
                         specify the adapter FASTA file with "--adapter".
                         If not specified, adapter trimming will be skipped.
                         [33,<ADAPTER_FASTA>:2:30:10,20,20,4:15,75]
  -e , --snpEff          Predict causal variants using SnpEff. Check
                         available databases in SnpEff.
  --line-colors          Colors for threshold lines in plots. Specify a
                         comma-separated list in the order of SNP-index,
                         p95, and p99. ["#FE5F55,#6FD08C,#E3B505"]
  --dot-colors           Colors for dots in plots. Specify a
                         comma-separated list in the order of bulk1,
                         bulk2, and delta. ["#74D3AE,#FFBE0B,#40476D"]
  --mem                  Maximum memory per thread when sorting BAM files;
                         suffixes K/M/G are recognized. [1G]
  -q , --min-MQ          Minimum mapping quality for mpileup. [40]
  -Q , --min-BQ          Minimum base quality for mpileup. [18]
  -C , --adjust-MQ       Adjust the mapping quality for mpileup. The default
                         setting is optimized for BWA. [50]
  -v, --version          show program's version number and exit

QTL-seq can be run from FASTQ (without or with trimming) and BAM. If you want to run QTL-seq from VCF, please use QTL-plot (example 5). Once you run QTL-seq, QTL-seq automatically completes the subprocesses.

Example 1 : run QTL-seq from FASTQ without trimming

qtlseq -r reference.fasta \
       -p parent.1.fastq.gz,parent.2.fastq.gz \
       -b1 bulk_1.1.fastq.gz,bulk_1.2.fastq.gz \
       -b2 bulk_2.1.fastq.gz,bulk_2.2.fastq.gz \
       -n1 20 \
       -n2 20 \
       -o example_dir

-r : reference fasta

-p : FASTQs of parent. Please input paired-end reads separated by a comma. FASTQ files can be gzipped.

-b1 : FASTQs of bulk 1. Please input paired-end reads separated by a comma. FASTQ files can be gzipped.

-b2 : FASTQs of bulk 2. Please input paired-end reads separated by a comma. FASTQ files can be gzipped.

-n1 : number of individuals in bulk 1.

-n2 : number of individuals in bulk 2.

-o : name of output directory. The specified directory should not already exist.

Example 2 : run QTL-seq from FASTQ with trimming

qtlseq -r reference.fasta \
       -p parent.1.fastq.gz,parent.2.fastq.gz \
       -b1 bulk_1.1.fastq.gz,bulk_1.2.fastq.gz \
       -b2 bulk_2.1.fastq.gz,bulk_2.2.fastq.gz \
       -n1 20 \
       -n2 20 \
       -o example_dir \
       -T \
       -a adapter.fasta

-r : reference fasta

-p : FASTQs of parent. Please input paired-end reads separated by a comma. FASTQ files can be gzipped.

-b1 : FASTQs of bulk 1. Please input paired-end reads separated by a comma. FASTQ files can be gzipped.

-b2 : FASTQs of bulk 1. Please input paired-end reads separated by a comma. FASTQ files can be gzipped.

-n1 : number of individuals in bulk 1.

-n2 : number of individuals in bulk 2.

-o : name of output directory. The specified directory should not already exist.

-T : trim your reads using trimmomatic.

-a : FASTA of adapter sequences for trimmomatic.

If you are using TrueSeq3, you can find the adapter sequences in the Github page of Trimmomatic. This thread is also helpful to preprare the adapter file.

Example 3 : run QTL-seq from BAM

qtlseq -r reference.fasta \
       -p parent.bam \
       -b1 bulk_1.bam \
       -b2 bulk_2.bam \
       -n1 20 \
       -n2 20 \
       -o example_dir

-r : reference fasta

-p : BAM of parent.

-b1 : BAM of bulk 1.

-b2 : BAM of bulk 2.

-n1 : number of individuals in bulk 1.

-n2 : number of individuals in bulk 2.

-o : name of output directory. The specified directory should not already exist.

Example 4 : run QTL-seq from multiple FASTQs and BAMs

qtlseq -r reference.fasta \
       -p parent_1.1.fastq.gz,parent_1.2.fastq.gz \
       -p parent_1.bam \
       -b1 bulk_11.1.fastq.gz,bulk_11.2.fastq.gz \
       -b1 bulk_12.bam \
       -b1 bulk_13.bam \
       -b2 bulk_21.1.fastq.gz,bulk_21.2.fastq.gz \
       -b2 bulk_22.bam \
       -b2 bulk_23.bam \
       -n1 20 \
       -n2 20 \
       -o example_dir

QTL-seq automatically merges multiple FASTQ and BAM files. Of course, you can merge FASTQ or BAM files using cat or samtools merge before inputting them into QTL-seq. If you specify -p multiple times, please make sure that those files include only one individual. On the other hand, -b1 and -b2 can include more than one individuals because those are bulked samples. QTL-seq automatically classifies FASTQ and BAM files based on whether comma exits or not.

Example 5 : run QTL-plot from VCF

usage: qtlplot -v <VCF> -n1 <INT> -n2 <INT> -o <OUT_DIR>
               [-F <INT>] [-t <INT>] [-w <INT>] [-s <INT>] [-D <INT>]
               [-d <INT>] [-N <INT>] [-m <FLOAT>] [-S <INT>] [-e <DATABASE>]
               [--igv] [--indel]

QTL-plot version 2.2.9

options:
  -h, --help              show this help message and exit
  -v , --vcf              VCF file which contains parent, bulk1, and bulk2
                          in this order. This VCF file must have AD field.
  -n1 , --N-bulk1         Number of individuals in bulk 1.
  -n2 , --N-bulk2         Number of individuals in bulk 2.
  -o , --out              Output directory. The specified directory can already
                          exist.
  --parental_SNPfilter    Determines how QTL-seq handles SNP filtering and reassigning of
                          SNPs based on the provided parent.
                          Options:
                          "hetero" == keeps SNPs where the parent is heterozygous (e.g. self-cross).
                          removes homozygous SNPs. does not re-assign bulk genotypes.
                          "homo" == keep SNPs where the parent has 0/0 or 1/1 genotype
                          and removes sites where parent is heterozygous.
                          this will also re-record bulk genotype assignments to be in reference
                          to the parental genotype. For example, if bulk 1 has the same genotype as
                          the parent, the bulk genotype for that SNP will be 0/0.
                          THIS IS THE DEFAULT AND HOW QTL-SEQ BY SUGIHARA 2022 OPERATES. [homo]
  -F , --filial           Filial generation. This parameter must be
                          greater than 1. [2]
  -t , --threads          Number of threads. If a value less than 1 is specified,
                          QTL-seq will use the maximum available threads. [2]
  -w , --window           Window size (kb). [2000]
  -s , --step             Step size (kb). [100]
  -D , --max-depth        Maximum depth of variants to be used. [250]
  -d , --min-depth        Minimum depth of variants to be used. [8]
  -N , --N-rep            Number of replicates for simulations to generate
                          null distribution. [5000]
  -m , --min-SNPindex     Cutoff of minimum SNP-index for clear results. [0.3]
  -S , --strand-bias      Filter out spurious homozygous genotypes in the cultivar
                          based on strand bias. If ADF (or ADR) is higher than
                          this cutoff when ADR (or ADF) is 0, that SNP will be
                          filtered out. If you want to disable this filtering,
                          set this cutoff to 0. [7]
  -e , --snpEff           Predict causal variants using SnpEff. Check
                          available databases in SnpEff.
  --igv                   Output IGV format file to check results on IGV.
  --indel                 Plot SNP-index with INDEL.
  --line-colors           Colors for threshold lines in plots. Specify a
                          comma-separated list in the order of SNP-index,
                          p95, and p99. ["#C3310F,#009E72,#FDB003"]
  --dot-colors            Colors for dots in plots. Specify a
                          comma-separated list in the order of bulk1,
                          bulk2, and delta. ["#74D3AE,#FFBE0B,#B3B8DD"]
  --fig-width             Width allocated in chromosome figure. [7.5]
  --fig-height            Height allocated in chromosome figure. [4.0]
  --white-space           White space between figures. (This option
                          only affects vertical direction.) [0.6]
  -f , --format           Specify the format of an output image.
                          eps/jpeg/jpg/pdf/pgf/png/rgba/svg/svgz/tif/tiff
  --version               show program's version number and exit

QTL-plot is included in QTL-seq. QTL-seq run QTL-plot after making VCF. Then, QTL-plot will work with default parameters. If you want to change some parameters, you can use VCF inside of (OUT_DIR/30_vcf/QTL-seq.vcf.gz) to retry plotting process like below.

qtlplot -v OUT_DIR/30_vcf/QTL-seq.vcf.gz \
        -o ANOTHER_DIR_NAME \
        -n1 20 \
        -n2 20 \
        -w 2000 \
        -s 100

Using QTL-plot for a VCF which was made by yourself

In this case:

  1. Ensure that your VCF includes the AD format.
  2. Ensure that your VCF includes columns for the parent, bulk1, and bulk2 in this order

If you got an error, please try to run QTL-seq from FASTQ or BAM before asking in issues.

Outputs

Inside of OUT_DIR is like below.

├── 10_ref
│  ├── reference.fasta
│  ├── reference.fasta.amb
│  ├── reference.fasta.ann
│  ├── reference.fasta.bwt
│  ├── reference.fasta.fai
│  ├── reference.fasta.pac
│  └── reference.fasta.sa
├── 20_bam
│  ├── bulk1.bam
│  ├── bulk1.bam.bai
│  ├── bulk2.bam
│  ├── bulk2.bam.bai
│  ├── parent.bam
│  └── parent.bam.bai
├── 30_vcf
│  ├── qtlseq.vcf.gz
│  └── qtlseq.vcf.gz.tbi
├── 40_qtlseq
│  ├── bulk1_SNPindex.png
│  ├── bulk2_SNPindex.png
│  ├── delta_SNPindex.png
│  ├── sliding_window.tsv
│  ├── sliding_window.p95.tsv
│  ├── sliding_window.p99.tsv
│  ├── snp_index.tsv
│  ├── snp_index.p95.tsv
│  └── snp_index.p99.tsv
└── log
   ├── alignment.log
   ├── bcftools.log
   ├── bwa.log
   ├── samtools.log
   └── tabix.log
  • If you run QTL-seq with trimming, you will get the directory of 00_fastq which includes FASTQs after trimming.
  • You can check the results in 40_qtlseq.
    • snp_index.tsv : columns in this order.
      • CHROM : chromosome name
      • POSI : position in chromosome
      • VARIANT : SNP or INDEL
      • DEPTH 1 : depth of bulk 1
      • DEPTH 2 : depth of bulk 2
      • p99 : 99% confidence interval of simulated delta SNP-index (absolute value)
      • p95 : 95% confidence interval of simulated delta SNP-index (absolute value)
      • SNP-index 1 : real SNP-index of bulk 1
      • SNP-index 2 : real SNP-index of bulk 2
      • DELTA SNP-index : real delta SNP-index (bulk2 - bulk1)
    • sliding_window.tsv : columns in this order.
      • CHROM : chromosome name
      • POSI : central position of window
      • MEAN p99 : mean of p99 (absolute value)
      • MEAN p95 : mean of p95 (absolute value)
      • MEAN SNP-index 1 : mean SNP-index of bulk 1
      • MEAN SNP-index 2 : mean SNP-index of bulk 2
      • MEAN DELTA SNP-index : mean delta SNP-index
    • snp_index.p95.tsv and snp_index.p99.tsv contain only the SNPs that exceed the respective thresholds (95% or 99%). Similarly, sliding_window.p95.tsv and sliding_window.p99.tsv contain only the windows that exceed the respective thresholds.
    • delta_SNPindex.png : resulting plot (like below)
      • BLUE dot : variant
      • RED line : mean SNP-index
      • ORANGE line : mean p99
      • GREEN line : mean p95
    • If you run QTL-seq with SnpEff, the following additional outputs will be generated:
      • qtlseq.snpEff.vcf: The updated VCF file after annotation by SnpEff, located in the 40_qtlseq directory.
      • snp_index.tsv: This file will contain a new column, impact, which includes the mutation impact information predicted by SnpEff.
      • When plotting the results, variants classified as MODERATE by SnpEff are marked with a + symbol, while variants classified as HIGH are marked with an x symbol in the plot.

Additional Resources

QTL-seq commands breakdown

For a detailed breakdown of the commands used in QTL-seq, including explanations of each step, parameters, and best practices for troubleshooting, please refer to the QTL-seq Commands Breakdown document.

Build a custom SnpEff database

If you are working with a non-model organism or your own reference genome, you may need to build a custom SnpEff database. For detailed instructions on how to build a custom SnpEff database, please refer to the Build a Custom SnpEff Database document.

Frequently Asked Questions

Choosing the reference genome

You can use a line that was not involved in the cross as the reference genome. In the version of QTL-seq published by Takagi et al., 2013, the reference fasta was rebuilt using one of the parents' reads to create a consensus sequence. However, starting from version 2, that step has been omitted. Instead, the VCF file is used to determine which parent contributed each segregating mutation found in the progeny.

When there are many contigs

The current setting has been updated to pick the top 50 contigs based on length, so only these contigs will be displayed in the plot. However, the table contains SNP-index information for all contigs, allowing you to confirm significant SNPs even for contigs not shown in the plot. You can also generate plots for these SNPs independently if needed. Since contigs smaller than the sliding window size often produce less reliable results, excluding them from the analysis should not be an issue.

How to filter for higher-confidence SNPs

If the phenotype is clear and the sequence data is clean, the default settings should already show some results. If you want to focus on higher-confidence SNPs, you can sequence both parents and retain only the SNPs that are clearly 0/0 in one parent and 1/1 in the other in the VCF file, which should produce cleaner results. However, keep in mind that the current version of QTL-seq only allows input for one parent’s sequence. The linked page explains default QTL-seq commands, which involve only one parent in the workflow, but it may still be helpful as a reference when creating VCFs that include both parents.

About multiple testing correction

This function has been deprecated since v2.2.5. We highly recommend running QTL-seq without this function. However, if you would like to use this function, you can use it with versions of QTL-seq older than v2.2.5.

About

QTL-seq pipeline to identify causative mutations responsible for a phenotype in self and out-crosses

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%