Description of the methods and R scripts used to analyze miRNA isoforms trimming and tailing (as well as nucleotide composition of non-templated tails).
- GEO accession GSE139567
- Currently only available under reviewer's token (10/29/2019)
- The Cancer Genome Atlas (TCGA) miRNA-seq datasets used used to analyze the impact of tumor mutations can be retrieved under dbGAP license (phs000178).
This code was tested under:
- MacBook Pro (15-inch, 2016)
- Processor: 2.7 GHz Intel Core i7
- Memory: 16 GB 2133 MHz LPDDR3
- R version 3.5.1 (2018-07-02)
- RStudio version 1.1.456 – © 2009-2018
- Download and installation
- Expected run time: 30-40 min
All the cloud computing tools can be found in the Cancer Genomics Cloud (CGC).
The Cancer Genomics Cloud (CGC), powered by Seven Bridges, is one of three systems funded by the National Cancer Institute to explore the paradigm of colocalizing massive public datasets, like The Cancer Genomics Atlas (TCGA), alongside secure and scalable computational resources to analyze them.
- QuagmiR
- Expected run time: ~20 min sample
- Picard Sam-to-Fastq
- Bowtie
“The Seven Bridges Cancer Genomics Cloud has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Contract No. HHSN261201400008C and ID/IQ Agreement No. 17X146 under Contract No. HHSN261201500003I.”
The small RNA sequencing data were analyzed using an in-house pipeline. Briefly, adaptors were removed, reads were mapped using Bowtie and visualized using IGV. More detailed study of the isomiR profile was done using QuagmiR. This software uses a unique algorithm to pull specific reads and aligns them against a consensus sequence in the middle of a miRNA, allowing mismatches on the ends to capture 3’ isomiRs. The reports included tabulated analysis of miRNA expression, length, number of nucleotides trimmed and tail composition at individual read level.
In this manuscript, QuagmiR's parameter "Levenshtein or edit distances" for the 5' and 3' segments were set to 2 and -1 (no restriction), respectively. This particular setting allowed a high stringency on indetifying the miRNA, while leaving the 3' end of the miRNA unrestrained to detect any trimming and/or tailing event.
Customized R scripts were used to calculate percentages of canonical miRNA (defined as the most abundant templated read) and 3’ isomiRs, a well as percentages of tailing and trimming. Long tail composition was calculated by counting the number of non-templated nucleotides present in the tail of each isomiR read. Reads with equal number of non-templated nucleotides in the tail were added together and cumulative distribution was calculated for all the oligo-tailed isomiRs going from ones with longer to shorter tails.
Tumoral samples from TCGA bearing genomic mutations in either AGO1 or AGO2 leading to missense and synonymous amino acid changes were identified from Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/, accessed during May 2019). GDC uses combined reports from several variant callers (mutect2, varscan, muse and somaticsniper).
Selected Case ID were: P295L TCGA-53-A4EZ, R315M TCGA-HU-A4G8 and E299K TCGA-Z6-A8JE (AGO2), F310L TCGA-94-7033 (AGO1). The analysis of selected patient samples was also performed using QuagmiR, with a previous conversion of the bam files to fastq files by Picard Sam-to-Fastq, using Amazon cloud instances through the Seven Bridges Genomics implementation of the NCI Cancer Genomics Cloud. Mutations were plotted into the PDB structures of AGO1 and AGO2 using pymol.
-
AGO2
-
AGO1
The examples shown here are just to illustrate the logic implemented in the analysis and calculations used in the R scripts.
miRBase reference
>hsa-miR-7-5p MIMAT0000252 (mature miRNA)
UGGAAGACUAGUGAUUUUGUUGUU
>hsa-mir-7-1 MI0000263 (pri-miRNA paralog 1)
<--mature-miRNA--------><---------templated (genomic reference)---------------------->
UGGAAGACUAGUGAUUUUGUUGUUUUUAGAUAACUAAAUCGACAACAAAUCACAGUCUGCCAUAUGGCACAGGCCAUGCCUCUACAG
>hsa-mir-7-2 MI0000264 (pri-miRNA paralog 2)
<--mature-miRNA--------><---------templated (genomic reference)--------------->
UGGAAGACUAGUGAUUUUGUUGUUGUCUUACUGCGCUCAACAACAAAUCCCAGUCUACCUAAUGGUGCCAGCCAUCGCA
>hsa-mir-7-3 MI0000265 (pri-miRNA paralog 3)
<--mature-miRNA--------><---------templated (genomic reference)---------------->
UGGAAGACUAGUGAUUUUGUUGUUCUGAUGUACUACGACAACAAGUCACAGCCGGCCUCAUAGCGCAGACUCCCUUCGAC
Minimum number of "N" nucleotide in tail
Example long tailed read:
<--templated-----------><--non-templated-->
UGGAAGACUAGUGAUUUUGUUGUUUUUUUUUAAUUUUGUCUUU
........................UUUUUUUAAUUUUGUCUUU
Number of U in tail: 15
Number of A in tail: 2
Number of G in tail: 1
Number of C in tail: 1
Weighted Average of the Minimum number of U in oligo-tail
Example long tailed reads:
<--templated-----------><--non-templated--> U_in_tail Counts Fraction Weighted_U_in_tail
UGGAAGACUAGUGAUUUUGUUGUU
........................UUUUUUUAAUUUUGUCUUU 15 100 0.2 3
........................UUUAUUU 6 100 0.2 1.2
........................UUUUUUU 7 100 0.2 1.4
........................UUU 3 100 0.2 0.6
........................UU 2 100 0.2 0.4
Weighted Average of the Minimum number of U in oligo-tail: 6.6
The bioinformatic prediction of target RNAs with extensive 3' pairing with miRNAs that could induce the dislocation of the 3' end of the miRNAs from the PAZ domain, and therefore induce trimming-tailing decay was done following this algorithm:
- RNAs with a 7mer seed were selected from TargetScan7.2 list of human 3'UTRs.
- RNAduplex from the ViennaRNA Package 2.0 was used to calculate the minimum free energy (MFE) of hybridization between each miRNA and target RNA.
- MFE for each miRNA-RNA hybrid was plotted against the abundance of the target RNA in HEK293 cells, as previously reported by Yang et al. Mol Cell (2019), data available at GEO:GSE121327.
- QuagmiR: A Cloud-based Application for IsomiR Big Data Analytics.Bofill-De Ros X, Chen K, Chen S, Tesic N, Randjelovic D, Skundric N, Nesic S, Varjacic V, Williams EH, Malhotra R, Jiang M, Gu S. Bioinformatics. 2018 Oct 8. doi: 10.1093/bioinformatics/bty843.(Pubmed link)
- The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized—A New Paradigm in Large-Scale Computational Research. Lau JW, Lehnert E, Sethi A, Malhotra R, Kaushik G, Onder Z, Groves-Kirkby N, Mihajlovic A, DiGiovanna J, Srdic M, Bajcic D, Radenkovic J, Mladenovic V, Krstanovic D, Arsenijevic V, Klisic D, Mitrovic M, Bogicevic I, Kural D, Davis-Dusenbery B; Seven Bridges CGC Team. The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research. Cancer Res. 2017 Nov 1;77(21)(Pubmed link)
- ViennaRNA Package 2.0. Lorenz R, Bernhart SH, Höner Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL. Algorithms Mol Biol. 2011 Nov 24;6:26. doi: 10.1186/1748-7188-6-26. PubMed PMID: 22115189; PubMed Central PMCID: PMC3319429.