RSEM, interpreting quantification results
1
0
Entering edit mode
10 months ago
argonvibio • 0

Hello everyone!

Last week I have been trying to compare RNA-seq isoform-level quantification to Nanostring data in order to assess the reproducibility between both platforms. I have two samples with both types of data available and a gene signature in which I'm specially interested. The tools I have been using are STAR for mapping and RSEM for quantification, with hg38 as reference genome and hg38.ensGene.gtf (downloaded from UCSC site).

The pipeline runs without problems but the results do not match my expectations at all. For some of this genes, I have observed that the quantification of "expected_counts" and "TPM" is 0, even though when I open the bam files in IGV I can see reads mapping to these isoforms. An example is VEGFA, for which I link a screenshot.

VEGFA isoforms and mapping reads seen with IGV

Reads are clearly mapping to VEGFA, yet some of the isoforms have 0 as their TPM (the one marked in grey for example, which is the largest). This results reproduce when using MapSplice as alignment algorithm instead of STAR.

Why is this happening? Why is RSEM assigning reads to some isoforms and not others when they are so similar? Please help.

Yours truly, Arturo

RNA-Seq rna-seq RSEM • 515 views
ADD COMMENT
3
Entering edit mode
10 months ago

The purpose of RSEM, over a simplier quantification methodology, such as counting reads that overlap the genomic region of the gene, is to estimate which isoforms of a gene reads are coming from.Methods like RSEM, Salmon and Kalisto do this using a statistical approach called expectation maximalization (EM). They basically choose the transcript composition that best explains the data.

In the highlighted transcript in your IGV screen short notice that what sets it apart from other transcripts is its 6th exon. While other transcripts do have exons that overlap this exon (and indeed, shared a 5' end), the 3' exon junction varies. Now note that very few reads map to the region of this exon, and many of those are not compatible with having come from that exon (they overlap the intron as well as the exon). When choosing which of the transcripts - the highlighted one, and the three below it, are the most likely source of the few reads that do overlap this region, RSEM is effectively looking at the ratio of 5' spliced reads to 3' spliced reads - it has probably found that on the evidence of the 3' spliced reads, the other transcripts better account for the reads mapping to the region of the exon than the highlighted transcript. (the most obvious combination would be that there are no reads that support the 3' end of the highlighted transcript, and the number of reads that support the 3' end of the other transcripts sum to the number of reads supporting the shared 5' junction).

Its worth noting that this sort of EM estimation of transcript expression is quite impressive, but it is not 100% accurate. Most these estimates come with error bars, and estimate of transcript expression is definitely not as good as at the gene level.

ADD COMMENT
0
Entering edit mode

As a small aside: is there any reason to prefer RSEM over Salmon/Kallisto?

ADD REPLY
0
Entering edit mode

I seem to remember that the latest paper from Rob Patro's lab showed that most of the differences between STAR/RSEM and Salmon were down to the STAR part, rather than the RSEM part, and that the latest version of salmon, which includes decoy sequences, is more similar than earlier verisons.

ADD REPLY
0
Entering edit mode

Thank you for your clear and concise answer. It was very helpful!

ADD REPLY

Login before adding your answer.

Traffic: 2531 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6