Skip to content

Latest commit

 

History

History
407 lines (321 loc) · 16.4 KB

File metadata and controls

407 lines (321 loc) · 16.4 KB

Practice Session #1: Sequencing (March 23, 2023)

In this session, we will learn how to convert raw unmapped read files (FASTQ) to analysis-ready files (VCF).
The overall process in this session is mainly based on the GATK Best Practice.
This document was created on March 23, 2023 and the following contents were tested on local WSL (Ubuntu 22.04.1 LTS) + GSDS Cluster.

0. Installing Linux and Anaconda in Windows

Using Linux has become easy in Windows with WSL.
To start, launch windows powershell in administration mode and run following.

wsl --install

After system restart, linux can be run from terminal app. (Note that hard drive is mounted under /mnt)


# find latest release for Linux-x86 and copy link from https://www.anaconda.com/products/distribution
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh

# install anaconda by running bash and follow instructions
bash Anaconda3-2022.10-Linux-x86_64.sh

Access GSDS cluster computational server via leelabguest account

ID : leelabguest

1. Setting up the environment

We will use the Anaconda environment on the GSDS cluster.
It is already created on the GSDS cluster, but you can create the environment on your local machine with the following command
In this session, OpenJDK, samtools, GATK and BWA are installed in creation of conda environment and Picard is downloaded as java package
All files and tools are included in '~/GCDA/1_sequencing/' folder

# Create conda environment and install softwares 
conda create -n SEQ samtools bwa -c anaconda -c bioconda
conda activate SEQ

# Install jdk 17 version (Required after picard 3.0.0)
wget https://download.java.net/java/GA/jdk17.0.2/dfd4a8d0985749f896bed50d7138ee7f/8/GPL/openjdk-17.0.2_linux-x64_bin.tar.gz
tar xvf openjdk-17.0.2_linux-x64_bin.tar.gz
export JAVA_HOME=$JAVA_HOME/:~/jdk-17.0.2/
export PATH=$JAVA_HOME/bin:$PATH
alias java17="~/jdk-17.0.2/bin/java"

# install gatk4 
conda install jatk4 -c bioconda
# Download Picard (Find Latest Release: https://github.com/broadinstitute/picard/releases/latest)
cd ~/GCDA/1_sequencing/utils
wget https://github.com/broadinstitute/picard/releases/download/3.0.0/picard.jar

2. Preparing data

To map our raw unmapped reads, we need the reference panel and the information for known variants.
Here, we will use the FASTA file of 1000 Genome Phase 3 (GRCh37 build) and the VCF file for known variants.
You can browse FTP server (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) of 1000 Genome Project.

cd ~/GCDA/1_sequencing/reference
# Download 1000 Genome reference panel
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
gzip -d human_g1k_v37.fasta.gz

# Download VCF file and its index (tbi) file for known variants
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf.gz
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf.gz.tbi

You can check the contents of the FASTA file by the following command:

# View first 200 rows of FASTA file
head -200 human_g1k_v37.fasta

We need to preprocess (create index/dict files) the reference genome files.

# Create index (.fai)
samtools faidx human_g1k_v37.fasta

# Create dict (.dict)
gatk CreateSequenceDictionary -R human_g1k_v37.fasta

# Construct files with Burrows-Wheeler Transformation (5 files)
bwa index -a bwtsw human_g1k_v37.fasta

And we need a sequence read file (FASTQ) for the sample individual (HG00096).

cd ~/GCDA/1_sequencing/raw_reads
# Download sequence read file
# sample HG00096
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR062634.filt.fastq.gz
gzip -d SRR062634.filt.fastq.gz
# sample HG00097
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00097/sequence_read/SRR741384.filt.fastq.gz
gzip -d SRR741384.filt.fastq.gz
# sample HG00099
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00099/sequence_read/SRR741411.filt.fastq.gz
gzip -d SRR741411.filt.fastq.gz

We can check the contents of the FASTQ file by head command, and you will see the following:

@SRR062634.321 HWI-EAS110_103327062:6:1:1446:951/2
TGATCATTTGATTAATACTGACATGTAGACAAGAAGAAAAGTATGTTTCATGCTATTTTGAGTAACTTCCATTTAGAAGCCTACTCCTGAGCACAACATT
+
B5=BD5DAD?:CBDD-DDDDDCDDB+-B:;?A?CCE?;D3A?B?DB??;DDDEEABD+>DAC?A-CD-=D?C5A@::AC-?AB?=:>CA@##########
@SRR062634.488 HWI-EAS110_103327062:6:1:1503:935/2
AATGTTATTAAAAATGGACACCTTTTTCTCACACATTCAGTTTCATTGTCTCGCACCCCATCGTTTTACTTTTCTTCCTTCAGAAAATGATAAATGTGGG
+
AAAA?5D?BD==ADBD:DBDDDDD5D=;@>AD-CD?D=C5=@4<7CCAA5?=?>5@BC?*<:=>>:D:B5?B?5?'3::5?5<:;*97:<A#########
@SRR062634.849 HWI-EAS110_103327062:6:1:1587:921/2
CAGATCAGAATAATTTTTGTGTTATGTACGTGTAAGAAAACATAGCTATTATGATATGGAAACTAGGAGTGAAATATGAGGAATTTGTGACTTTTCTGAA

A FASTQ file normally uses four lines per sequence.

  • Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
  • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

You can find the information of SRR062634 in the Sequence Read Archive (SRA) page of NCBI website.

Here are the quality value characters in left-to-right increasing order of quality (ASCII):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

You can learn more about FASTQ files on Wikipedia.

1000 Genome Project at https://www.internationalgenome.org/faq/about-fastq-sequence-read-files/

Many of individuals in 1000 Genome Project have multiple FASTQ files, because many of them were sequenced using more than one run of a sequencing machine.
Each set of files named like SRR062634_1.filt.fastq.gz, SRR062634_2.filt.fastq.gz and SRR062634.filt.fastq.gz represent all the sequence from a sequencing run.

The labels with _1 and _2 represent paired-end files, and the files which do not have a number in their name are single-ended reads. (or if one of a pair of reads gets rejected the other read gets placed in the single file.)

3. Overall Process

This practice session consists of 4 steps.

  1. Preprocess the FASTQ file
  • FastqToSam
  • AddOrReplaceReadGroups
  • MarkIlluminaAdapters
  • SamToFastq
  1. Convert FASTQ to BAM
  1. Convert BAM to GVCF
  • HaplotypeCaller
  1. Convert GVCF to VCF
  • GenotypeGVCFs

4. (Optional) Preprocessing the FASTQ file

Convert the raw FASTQ file to an unmapped BAM file

Using FastqToSam function of Picard, we can convert the FASTQ file to an unmapped BAM file.

SID=HG00096
java17 -jar ~/GCDA/1_sequencing/utils/picard.jar FastqToSam \
F1=~/GCDA/1_sequencing/raw_reads/SRR062634.filt.fastq \
O=~/GCDA/1_sequencing/data/fastq_to_bam_${SID}.bam \
SM=${SID}

Add read group information in BAM file

We can add read groups in BAM file by the following command:

cd ~/GCDA/1_sequencing/data/
java17 -jar ~/GCDA/1_sequencing/utils/picard.jar AddOrReplaceReadGroups \
I=fastq_to_bam_${SID}.bam \
O=add_read_groups_${SID}.bam \
RGID=4 \
RGLB=lib1 \
RGPL=ILLUMINA \
RGPU=unit1 \
RGSM=20

Mark adapter sequences

Using MarkIlluminaAdapters, we can mark adapter sequences.

java17 -Xmx8G -jar ~/GCDA/1_sequencing/utils/picard.jar MarkIlluminaAdapters \
I=add_read_groups_${SID}.bam \
O=mark_adapter_${SID}.bam \
M=mark_adapter_${SID}.metrics.txt

Convert the preprocessed BAM file to a FASTQ file

java17 -Xmx8G -jar ~/GCDA/1_sequencing/utils/picard.jar SamToFastq \
I=mark_adapter_${SID}.bam \
FASTQ=fastq_input_${SID}.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \
NON_PF=true

5. Convert the preprocessed FASTQ file to an aligned BAM file

Align reads using BWA MEM

GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm.

bwa mem -M -t 7 -p ~/GCDA/1_sequencing/reference/human_g1k_v37.fasta ~/GCDA/1_sequencing/data/fastq_input_${SID}.fq > aligned_${SID}.sam

You can check the contents of the SAM file:

@SQ	SN:1	LN:249250621
@SQ	SN:2	LN:243199373
@SQ	SN:3	LN:198022430
...
SRR062634.10000020	16	1	246002445	60	100M	*	0	0	ACAGCACCAGGCCAGCCTTTTTATTTTATTTTAATTTTTATTATTTTGAGACATTCTCGCTCTTTCGCCCAGGCCGGACTGCAGTGGTGCTATCTCAGCT	<C==C:@?A=9:5,=?>=?C=?;EE<CCBCAC=/?@FFAEEEEBEDEEABCFCBBGGEBFEDFCEGFGDF?GAGGGGFGGGGGGGGGGGGFGGGFGGGGG	NM:i:0	MD:Z:100	AS:i:100	XS:i:32
SRR062634.10000713	16	2	39841450	60	100M	*	0	0	TTAGCCATTCTAGTAGCTGTGTAGCAATTATGCTAGTTAACTGGTCAAATCTAATAGAGATGCTATCTAAAATGTGTTATAAAGAATGTGACTTGAGAGT	==:=C@A??DC@CC@CEBCCCCEDEGD=EECEF?EEBEGEFEFEEEGFGEGEGGDGFGFGEDGFGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGG	NM:i:0	MD:Z:100	AS:i:100	XS:i:19
SRR062634.10000906	0	5	97444533	60	100M	*	0	0	CAGTTTGATCCTTCTGAATTAGATTTTCCATACATGAAGCCTATGGGACTCTGGTGGGCAGTAGAAGATAAACTGTAATTTAAGTGAGGTTTTTATAAGC	EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE5BDDCC?EEEEEEDBEEEEEEEDEEEEA?DEEEEBE@EEEEADEDEE@AD?E	NM:i:1	MD:Z:48T51	AS:i:95	XS:i:20

The header section must be prior to the alignment section if it is present. Headings begin with the '@' symbol, which distinguishes them from the alignment section.
Alignment sections have 11 mandatory fields, as well as a variable number of optional fields.
SAM format explained: https://samtools.github.io/hts-specs/SAMv1.pdf
SAM, BAM, and CRAM https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats

The information of some columns are as follows:

  • Column 1: Query Template Name (QNAME)
  • Column 2: Bitwise Flag
  • Column 3: Reference sequence name (RNAME), often contains the Chromosome name
  • Column 4: Leftmost position of where this alignment maps to the reference (POS)
  • Column 5: Mapping Quality
  • Column 6: Concise Idiosyncratic Gapped Alignment Report (CIGAR) string (Wikipedia)
  • Column 10: Segment sequence
  • Column 11: ASCII representation of phred-scale base quality

Add information to BAM file using MergeBamAlignment

java17 -Xmx10G -jar ~/GCDA/1_sequencing/utils/picard.jar MergeBamAlignment \
R=~/GCDA/1_sequencing/reference/human_g1k_v37.fasta \
UNMAPPED=add_read_groups_${SID}.bam \
ALIGNED=aligned_${SID}.sam \
O=preprocessed_${SID}.bam \
CREATE_INDEX=true \
ADD_MATE_CIGAR=true \
CLIP_ADAPTERS=false \
CLIP_OVERLAPPING_READS=true \
INCLUDE_SECONDARY_ALIGNMENTS=true \
MAX_INSERTIONS_OR_DELETIONS=-1  \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
ATTRIBUTES_TO_RETAIN=XS

Mark Duplicates

java17 -jar ~/GCDA/1_sequencing/utils/picard.jar MarkDuplicates \
I=preprocessed_${SID}.bam \
O=mark_dup_${SID}.bam \
M=mark_dup_${SID}.metrics.txt

Sort, index and convert alignment to a BAM using SortSam

java17 -jar ~/GCDA/1_sequencing/utils/picard.jar SortSam \
I=mark_dup_${SID}.bam \
O=sorted_${SID}.bam \
SO=coordinate 

Create Recalibration Table using BaseRecalibrator

gatk --java-options '-Xmx10g' BaseRecalibrator \
-I sorted_${SID}.bam \
-R ~/GCDA/1_sequencing/reference/human_g1k_v37.fasta \
--known-sites ~/GCDA/1_sequencing/reference/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf.gz \
-O recal_data_${SID}.table

Base Quality Score Recalibration (BQSR)

gatk --java-options '-Xmx10g' ApplyBQSR \
-I sorted_${SID}.bam \
-R ~/GCDA/1_sequencing/reference/human_g1k_v37.fasta \
--bqsr-recal-file recal_data_${SID}.table \
-O bqsr_${SID}.bam

IGV Viewer (Software for visualization of BAM file)

We can visualize the aligned BAM file with the IGV viewer.

For example, we can observe high coverage around SUMO1P1 gene. (HG00096.chrom20.ILLUMINA.bwa.GBR.exome.20120522.bam)

cd ~/GCDA/1_sequencing/raw_reads
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/exome_alignment/HG00096.chrom20.ILLUMINA.bwa.GBR.exome.20120522.bam
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/exome_alignment/HG00096.chrom20.ILLUMINA.bwa.GBR.exome.20120522.bam.bai
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00097/exome_alignment/HG00097.chrom20.ILLUMINA.bwa.GBR.exome.20130415.bam
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00097/exome_alignment/HG00097.chrom20.ILLUMINA.bwa.GBR.exome.20130415.bam.bai
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00099/exome_alignment/HG00099.chrom20.ILLUMINA.bwa.GBR.exome.20130415.bam
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00099/exome_alignment/HG00099.chrom20.ILLUMINA.bwa.GBR.exome.20130415.bam.bai

6. Converting BAM to GVCF

Convert the individual BAM files to GVCF files

In this section, we will use the real data of 3 individuals in 1000 Genome Project (HG00096, HG00097, HG00099).
We can convert these BAM files to GVCF files.

cd ~/GCDA/1_sequencing/data
# Convert BAM for the first individual
gatk --java-options "-Xms4g" HaplotypeCaller \
-R ~/GCDA/1_sequencing/reference/human_g1k_v37.fasta \
-I ~/GCDA/1_sequencing/raw_reads/HG00096.chrom20.ILLUMINA.bwa.GBR.exome.20120522.bam \
-L 20 \
-ERC GVCF \
-O sample01_20.g.vcf
# Convert BAM for the second individual
gatk --java-options "-Xms4g" HaplotypeCaller \
-R ~/GCDA/1_sequencing/reference/human_g1k_v37.fasta \
-I ~/GCDA/1_sequencing/raw_reads/HG00097.chrom20.ILLUMINA.bwa.GBR.exome.20130415.bam \
-L 20 \
-ERC GVCF \
-O sample02_20.g.vcf
# Convert BAM for the third individual
gatk --java-options "-Xms4g" HaplotypeCaller \
-R ~/GCDA/1_sequencing/reference/human_g1k_v37.fasta \
-I ~/GCDA/1_sequencing/raw_reads/HG00099.chrom20.ILLUMINA.bwa.GBR.exome.20130415.bam \
-L 20 \
-ERC GVCF \
-O sample03_20.g.vcf

Combine individual GVCF files

gatk CombineGVCFs \
-R ~/GCDA/1_sequencing/reference/human_g1k_v37.fasta \
--variant sample01_20.g.vcf \
--variant sample02_20.g.vcf \
--variant sample03_20.g.vcf \
-O sample_all.g.vcf.gz

7. Converting GVCF to VCF

gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R ~/GCDA/1_sequencing/reference/human_g1k_v37.fasta \
-V sample_all.g.vcf.gz \
-O sample_all.vcf

You can check the contents of the final VCF file.

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=LowQual,Description="Low quality">
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099
20	61795	.	G	T	39.24	.	AC=2;AF=1.00;AN=2;DP=2;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=19.62;SOR=0.693	GT:AD:DP:GQ:PL	1/1:0,2:2:6:49,6,0	./.:0,0:0:0:0,0,0	./.:0,0:0:0:0,0,0
20	68749	.	T	C	193.81	.	AC=2;AF=0.500;AN=4;DP=11;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;QD=25.36;SOR=3.611	GT:AD:DP:GQ:PL	1/1:0,5:5:15:208,15,0	./.:0,0:0:0:0,0,0	0/0:4,0:4:12:0,12,115
20	76962	.	T	C	19086.73	.	AC=6;AF=1.00;AN=6;BaseQRankSum=1.80;DP=539;ExcessHet=0.0000;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=59.41;MQRankSum=-3.240e-01;QD=28.73;ReadPosRankSum=1.42;SOR=0.260	GT:AD:DP:GQ:PL	1/1:0,357:357:99:14330,1073,0	1/1:1,65:66:99:2125,188,0	1/1:0,84:84:99:2645,250,0

We use this VCF file in the analysis!

8. Variant Calling with Deepvariant (Optional)

DeepVariant is a deep learning-based variant caller that takes aligned reads (in BAM or CRAM format), produces pileup image tensors from them, classifies each tensor using a convolutional neural network, and finally reports the results in a standard VCF or gVCF file. (https://github.com/google/deepvariant)

The simple concept of Deepvariant is explained and visualized in following post: Looking Through DeepVariant's Eyes