This is the pipeline of downstream analysis for Smart-seq data, including QC and profiling.
The Smart-seq data processing workflow begins with the raw FASTQ files, which are first assessed for quality using FastQC to identify sequencing issues and overall data quality. Next, UMI-tools is applied to extract the UMI sequences from the reads, preparing them for duplicate removal. Adapter sequences and low-quality bases are then trimmed using TrimGalore, followed by a second round of quality assessment with FastQC to evaluate the cleaned reads. The processed reads are aligned to the reference genome using STAR, generating BAM files, which are indexed with samtools. UMI-tools is subsequently used to deduplicate the aligned BAM, removing PCR duplicates, and the deduplicated BAM is again indexed with samtools. Gene-level counts are obtained from the deduplicated BAM using featureCounts, while RSEM can be applied to the same BAM to calculate TPM values for genes or transcripts, completing the Smart-seq quantification workflow. We will proceed by structuring our workflow according to (ii).
Smart-seq
├─ 1.rawdata
│ ├─ *rep1_1.fastq
│ ├─ *rep1_2.fastq
│ ├─ *rep2_1.fastq
│ └─ *rep2_2.fastq
├─ 2.fastqc
│ ├─ *rep1_1_fastqc.html
│ ├─ *rep1_2_fastqc.zip
│ ├─ *rep2_1_fastqc.html
│ └─ *rep2_2_fastqc.zip
├─ 3.ectractumi
│ ├─ *rep1_1_umiext.fastq
│ ├─ *rep1_2_umiext.fastq
│ ├─ *rep2_1_umiext.fastq
│ └─ *rep2_2_umiext.fastq
├─ 4.trim
│ ├─ *rep1_1_umiext_cuta.fastq
│ ├─ *rep2_1_umiext_cuta.fastq
│ ├─ *rep1_1_umiext_cuta.html
│ ├─ *rep1_1_umiext_cuta.zip
│ ├─ *rep2_1_umiext_cuta.html
│ └─ *rep2_1_umiext_cuta.zip
├─ 5.star
├─ 6.featureCounts
├─ 7.RSEM
└─ 8.bw
fastqc
umi_tools
trim_galore
STAR
samtools
RSEM
featureCounts
deepTools
In this section, you will convert the raw FASTQ files into .bam files, which can be used for subsequent analysis.
umiextractdir=3.ectractumi
fastqcdir=2.fastqc
rawdatadir=1.rawdata
trimdir=4.trim
stardir=5.star
starindex=starindex
cd ${rawdatadir}
for id in `ls * | sort -d '_' -f 1 | uniq`
do
### step1: fastqc
fastqc -o ${fastqcdir} -t 1 -q ${rawdatadir}/${id}_1.fastq
### step2: extract umi
umi_tools extract -I ${rawdatadir}/${id}_2.fastq -S ${umiextractdir}/${id}_2_umiext.fastq --read2-in=${rawdatadir}/${id}_1.fastq --read2-out=${umiextractdir}/${id}_1_umiext.fastq --bc-pattern=NNNNNNNN
### step3: trim R1
trim_galore -q 20 --phred33 --length 20 -e 0.1 --stringency 3 -j 30 --fastqc ${umiextractdir}/${id}_1_umiext_cuta.fastq -o ${trimdir}
fastqc -o ${trimdir} -t 1 -q ${trimdir}/${id}_1_umiext_cuta.fq
####### step4: mapping
STAR --runThreadN 20 --genomeDir ${starindex} --readFilesIn ${trimdir}/${id}_1_umiext_cuta.fq --outFileNamePrefix ${stardir}/${id}_1. --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --outFilterMultimapNmax 1 --outFilterMismatchNmax 2 --quantMode TranscriptomeSAM GeneCounts --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 15
####### step5: deduplication
samtools index ${stardir}/${id}_1.Aligned.sortedByCoord.out.bam
umi_tools dedup -I ${stardir}/${id}_1.Aligned.sortedByCoord.out.bam -S ${stardir}/${id}_1.deduplicated.bam
samtools index ${stardir}/${id}_1.deduplicated.bam
doneGet gene count profile
featureCountsdir=6.featureCounts
refgtf=gtf
featureCounts -a ${refgtf} -s 2 -B -T 30 -o ${featureCountsdir}/TotalCounts.txt ${stardir}/*_1.deduplicated.bam > ${featureCountsdir}/TotalCounts.log 2>&1 &Get gene count profile
rawdatadir=1.rawdata
rsemdir=7.RSEM
rsemindex=rsemindex
cd ${rawdatadir}
for id in `ls * | sort -d '_' -f 1 | uniq`
do
rsem-calculate-expression --no-bam-output --alignments -p 12 -q ${stardir}/${id}_1.deduplicated.bam ${rsemindex} ${rsemdir}/${id}
doneGet gene count profile
rawdatadir=1.rawdata
bwdir=8.bw
cd ${rawdatadir}
for id in `ls * | sort -d '_' -f 1 | uniq`
do
bamCoverage -b ${stardir}/${id}_1.deduplicated.bam -o ${stardir}/${id}_1.fwd.bw --samFlagExclude 16 -p 4 --normalizeUsing CPM
bamCoverage -b ${stardir}/${id}_1.deduplicated.bam -o ${stardir}/${id}_1.rev.bw --samFlagInclude 16 -p 4 --normalizeUsing CPM
done