Question: miRNAseq isoform-to-mature miRNA quantification
gravatar for TJ
6.7 years ago by
United States
TJ70 wrote:


I have a conceptual question regarding miRNAseq data from TCGA and the relationship between isoform quantitation and gene expression.  I want to get the total number of reads for each mature miRNA sequence.  So, for hsa-let-7a-5p (MIMAT0000062) which has 3 isoforms (i.e. let-7a-1, let-7a-2, and let-7a-3), I want to sum the read counts for each isoform to get an aggregate number.  I have downloaded data for PRAD from GDAC Firehose which is in TCGA format, but all samples are in the same file.  When I sum the read counts for each miR isoform and the reads-per-million in the same manner, the latter matches the GDAC Firehose mature pre-process file which has RPM data for a particular mature miRNA.  This suggests I am analyzing the data correctly to get count data, assuming the Broad people know their stuff.  I've read a paper that does the same.  What I don't understand is how it is possible to map an identical sequence back to a unique location, and I don't want to double- or triple-count (in this case) for each isoform when assigning counts to a mature miRNA.   

For example, these regions for let-7a-5p are identical mature miRNA sequences:

hsa-let-7a-1 isoform: hg19:9:96938244-96938265:+ (21362 reads in the miR isoform file)

hsa-let-7a-3 isoform: hg19:22:46508632-46508653:+ (21189 reads in the miR isoform file)

In fact, there are no identical read counts between the two isoforms, and these reads are not flagged as cross-mapped.  How are the reads assigned to the correct isoform?

Why are the read counts not identical, since the sequences are identical and of the same length?  There are reads for hg19:9:96938244-96938266:+ and other adjacent sequences, so it is not like additional nucleotides are being used to help map the sequence. 

I have searched biostars, the TCGA website, the GDAC website, and Google to no avail.  I even read the data processing description from the Synapse website, but that didn't help me, given the fact that the reads are not cross-mapped.

If nobody knows the answer, any ideas on where to ask next?

Thanks in advance!

rna-seq • 4.6k views
ADD COMMENTlink modified 3.3 years ago by gaoteng70 • written 6.7 years ago by TJ70
gravatar for r.ptashkin
5.5 years ago by
United States
r.ptashkin60 wrote:

This may be useful in working with isoform level miRNASeq data from TCGA:

ADD COMMENTlink written 5.5 years ago by r.ptashkin60
gravatar for gaoteng
3.3 years ago by
gaoteng70 wrote:

I think there is a misconception here. According to miRBase, "Distinct precursor sequences and genomic loci that express identical mature sequences get names of the form hsa-mir-121-1 and hsa-mir-121-2". ( So let-7a-1, let-7a-2, and let-7a-3 are stemloop sequences, not the isoforms of mature miRNA.

Isoforms of miRNAs are simply one or two nucleotide variation on either end of the mature sequence, or with a few nucleotides substituted. "IsomiRs appear as a variation in length from the canonical sequence annotated in miRBase, due to an addition or deletion of one or more nucleotides at the 5(') or 3(') ends or both." ( This nature article has a more comprehensive definition of miRNA isoforms (

Therefore I think the correct way to process the isoform data is to take the max or sum for all counts associated with each mature transcript ID. Here is my script:

Please correct me if you think I'm wrong!

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by gaoteng70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2134 users visited in the last hour