Question

miRNAseq isoform-to-mature miRNA quantification

8

Entering edit mode

9.9 years ago

TJ ▴ 80

Hello,

I have a conceptual question regarding miRNAseq data from TCGA and the relationship between isoform quantitation and gene expression. I want to get the total number of reads for each mature miRNA sequence. So, for hsa-let-7a-5p (MIMAT0000062) which has 3 isoforms (i.e. let-7a-1, let-7a-2, and let-7a-3), I want to sum the read counts for each isoform to get an aggregate number. I have downloaded data for PRAD from GDAC Firehose which is in TCGA format, but all samples are in the same file. When I sum the read counts for each miR isoform and the reads-per-million in the same manner, the latter matches the GDAC Firehose mature pre-process file which has RPM data for a particular mature miRNA. This suggests I am analyzing the data correctly to get count data, assuming the Broad people know their stuff. I've read a paper that does the same. What I don't understand is how it is possible to map an identical sequence back to a unique location, and I don't want to double- or triple-count (in this case) for each isoform when assigning counts to a mature miRNA.

For example, these regions for let-7a-5p are identical mature miRNA sequences:

hsa-let-7a-1 isoform: hg19:9:96938244-96938265:+ (21362 reads in the miR isoform file)

hsa-let-7a-3 isoform: hg19:22:46508632-46508653:+ (21189 reads in the miR isoform file)

In fact, there are no identical read counts between the two isoforms, and these reads are not flagged as cross-mapped. How are the reads assigned to the correct isoform?

Why are the read counts not identical, since the sequences are identical and of the same length? There are reads for hg19:9:96938244-96938266:+ and other adjacent sequences, so it is not like additional nucleotides are being used to help map the sequence.

I have searched biostars, the TCGA website, the GDAC website, and Google to no avail. I even read the data processing description from the Synapse website, but that didn't help me, given the fact that the reads are not cross-mapped.

If nobody knows the answer, any ideas on where to ask next?

Thanks in advance!

RNA-Seq • 6.0k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 9.9 years ago by TJ ▴ 80

Ram · Answer 1 · 2015-07-29

6

Entering edit mode

8.7 years ago

r.ptashkin ▴ 60

This may be useful in working with isoform level miRNASeq data from TCGA: https://github.com/rptashkin/TCGA_miRNASeq_matrix

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by r.ptashkin ▴ 60

Ram · Answer 2 · 2017-10-19

I think there is a misconception here. According to miRBase, "Distinct precursor sequences and genomic loci that express identical mature sequences get names of the form hsa-mir-121-1 and hsa-mir-121-2". (http://www.mirbase.org/help/nomenclature.shtml) So let-7a-1, let-7a-2, and let-7a-3 are stemloop sequences, not the isoforms of mature miRNA.

Isoforms of miRNAs are simply one or two nucleotide variation on either end of the mature sequence, or with a few nucleotides substituted. "IsomiRs appear as a variation in length from the canonical sequence annotated in miRBase, due to an addition or deletion of one or more nucleotides at the 5(') or 3(') ends or both." (https://www.ncbi.nlm.nih.gov/pubmed/26277662). This nature article has a more comprehensive definition of miRNA isoforms (https://www.nature.com/articles/nrm3611).

Therefore I think the correct way to process the isoform data is to take the max or sum for all counts associated with each mature transcript ID. Here is my script: https://github.com/teng-gao/genomics_utils/blob/master/README.md#process-tcga-mirnaseq-isoform-quantifications

Please correct me if you think I'm wrong!