Question

RSEM, interpreting quantification results

0

Entering edit mode

3.4 years ago

argonvibio • 0

Hello everyone!

Last week I have been trying to compare RNA-seq isoform-level quantification to Nanostring data in order to assess the reproducibility between both platforms. I have two samples with both types of data available and a gene signature in which I'm specially interested. The tools I have been using are STAR for mapping and RSEM for quantification, with hg38 as reference genome and hg38.ensGene.gtf (downloaded from UCSC site).

The pipeline runs without problems but the results do not match my expectations at all. For some of this genes, I have observed that the quantification of "expected_counts" and "TPM" is 0, even though when I open the bam files in IGV I can see reads mapping to these isoforms. An example is VEGFA, for which I link a screenshot.

VEGFA isoforms and mapping reads seen with IGV

Reads are clearly mapping to VEGFA, yet some of the isoforms have 0 as their TPM (the one marked in grey for example, which is the largest). This results reproduce when using MapSplice as alignment algorithm instead of STAR.

Why is this happening? Why is RSEM assigning reads to some isoforms and not others when they are so similar? Please help.

Yours truly, Arturo

RNA-Seq rna-seq RSEM • 2.0k views

ADD COMMENT • link updated 3.4 years ago by i.sudbery 19k • written 3.4 years ago by argonvibio • 0

score 4 · Accepted Answer · 2020-12-18

The purpose of RSEM, over a simplier quantification methodology, such as counting reads that overlap the genomic region of the gene, is to estimate which isoforms of a gene reads are coming from.Methods like RSEM, Salmon and Kalisto do this using a statistical approach called expectation maximalization (EM). They basically choose the transcript composition that best explains the data.

In the highlighted transcript in your IGV screen short notice that what sets it apart from other transcripts is its 6th exon. While other transcripts do have exons that overlap this exon (and indeed, shared a 5' end), the 3' exon junction varies. Now note that very few reads map to the region of this exon, and many of those are not compatible with having come from that exon (they overlap the intron as well as the exon). When choosing which of the transcripts - the highlighted one, and the three below it, are the most likely source of the few reads that do overlap this region, RSEM is effectively looking at the ratio of 5' spliced reads to 3' spliced reads - it has probably found that on the evidence of the 3' spliced reads, the other transcripts better account for the reads mapping to the region of the exon than the highlighted transcript. (the most obvious combination would be that there are no reads that support the 3' end of the highlighted transcript, and the number of reads that support the 3' end of the other transcripts sum to the number of reads supporting the shared 5' junction).

Its worth noting that this sort of EM estimation of transcript expression is quite impressive, but it is not 100% accurate. Most these estimates come with error bars, and estimate of transcript expression is definitely not as good as at the gene level.