Skip to content

yswutan/Smart-seq-Processing-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Smart-seq-Processing-Pipeline

This is the pipeline of downstream analysis for Smart-seq data, including QC and profiling.

Part I Introduction

i. Workflow

The Smart-seq data processing workflow begins with the raw FASTQ files, which are first assessed for quality using FastQC to identify sequencing issues and overall data quality. Next, UMI-tools is applied to extract the UMI sequences from the reads, preparing them for duplicate removal. Adapter sequences and low-quality bases are then trimmed using TrimGalore, followed by a second round of quality assessment with FastQC to evaluate the cleaned reads. The processed reads are aligned to the reference genome using STAR, generating BAM files, which are indexed with samtools. UMI-tools is subsequently used to deduplicate the aligned BAM, removing PCR duplicates, and the deduplicated BAM is again indexed with samtools. Gene-level counts are obtained from the deduplicated BAM using featureCounts, while RSEM can be applied to the same BAM to calculate TPM values for genes or transcripts, completing the Smart-seq quantification workflow. We will proceed by structuring our workflow according to (ii).

ii. File Structure

Smart-seq
├─ 1.rawdata
│    ├─ *rep1_1.fastq
│    ├─ *rep1_2.fastq
│    ├─ *rep2_1.fastq
│    └─ *rep2_2.fastq
├─ 2.fastqc
│    ├─ *rep1_1_fastqc.html
│    ├─ *rep1_2_fastqc.zip
│    ├─ *rep2_1_fastqc.html
│    └─ *rep2_2_fastqc.zip
├─ 3.ectractumi
│    ├─ *rep1_1_umiext.fastq
│    ├─ *rep1_2_umiext.fastq
│    ├─ *rep2_1_umiext.fastq
│    └─ *rep2_2_umiext.fastq
├─ 4.trim 
│    ├─ *rep1_1_umiext_cuta.fastq
│    ├─ *rep2_1_umiext_cuta.fastq
│    ├─ *rep1_1_umiext_cuta.html
│    ├─ *rep1_1_umiext_cuta.zip
│    ├─ *rep2_1_umiext_cuta.html
│    └─ *rep2_1_umiext_cuta.zip
├─ 5.star
├─ 6.featureCounts
├─ 7.RSEM
└─ 8.bw

iii. Software Required

fastqc
umi_tools
trim_galore
STAR
samtools
RSEM
featureCounts
deepTools

Part II Codes

Step 1: Preprocessing

In this section, you will convert the raw FASTQ files into .bam files, which can be used for subsequent analysis.

umiextractdir=3.ectractumi
fastqcdir=2.fastqc
rawdatadir=1.rawdata
trimdir=4.trim 
stardir=5.star
starindex=starindex
cd ${rawdatadir}
for id in `ls * | sort -d '_' -f 1 | uniq`
do
    ### step1: fastqc
		fastqc -o ${fastqcdir} -t 1 -q ${rawdatadir}/${id}_1.fastq
		### step2: extract umi
		umi_tools extract -I ${rawdatadir}/${id}_2.fastq -S ${umiextractdir}/${id}_2_umiext.fastq --read2-in=${rawdatadir}/${id}_1.fastq --read2-out=${umiextractdir}/${id}_1_umiext.fastq --bc-pattern=NNNNNNNN
		### step3: trim R1
		trim_galore -q 20 --phred33 --length 20 -e 0.1 --stringency 3 -j 30 --fastqc ${umiextractdir}/${id}_1_umiext_cuta.fastq -o ${trimdir}
		fastqc -o ${trimdir} -t 1 -q ${trimdir}/${id}_1_umiext_cuta.fq
		####### step4: mapping
		STAR --runThreadN 20 --genomeDir ${starindex} --readFilesIn ${trimdir}/${id}_1_umiext_cuta.fq --outFileNamePrefix ${stardir}/${id}_1. --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --outFilterMultimapNmax 1 --outFilterMismatchNmax 2 --quantMode TranscriptomeSAM GeneCounts --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 15
		####### step5: deduplication
		samtools index ${stardir}/${id}_1.Aligned.sortedByCoord.out.bam
		umi_tools dedup -I ${stardir}/${id}_1.Aligned.sortedByCoord.out.bam -S ${stardir}/${id}_1.deduplicated.bam
		samtools index ${stardir}/${id}_1.deduplicated.bam
done

Step 2: compute gene counts

Get gene count profile

featureCountsdir=6.featureCounts
refgtf=gtf
featureCounts -a ${refgtf} -s 2 -B -T 30 -o ${featureCountsdir}/TotalCounts.txt ${stardir}/*_1.deduplicated.bam > ${featureCountsdir}/TotalCounts.log 2>&1 &

Step 3: compute gene tpm

Get gene count profile

rawdatadir=1.rawdata
rsemdir=7.RSEM
rsemindex=rsemindex
cd ${rawdatadir}
for id in `ls * | sort -d '_' -f 1 | uniq`
do
  rsem-calculate-expression --no-bam-output --alignments -p 12 -q ${stardir}/${id}_1.deduplicated.bam ${rsemindex} ${rsemdir}/${id}
done

Step 4: convert to bw for visualization

Get gene count profile

rawdatadir=1.rawdata
bwdir=8.bw
cd ${rawdatadir}
for id in `ls * | sort -d '_' -f 1 | uniq`
do
  bamCoverage -b ${stardir}/${id}_1.deduplicated.bam -o ${stardir}/${id}_1.fwd.bw --samFlagExclude 16 -p 4 --normalizeUsing CPM
  bamCoverage -b ${stardir}/${id}_1.deduplicated.bam -o ${stardir}/${id}_1.rev.bw --samFlagInclude 16 -p 4 --normalizeUsing CPM
done

About

This is the pipeline of downstream analysis for Smart-seq data, including QC and profiling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors