isoform compostion (psi) question
1
0
Entering edit mode
18 months ago
shinyjj ▴ 50

Here are my couple thoughts what I am trying to figure out. I want to estimate the proportion of isoforms (psi) in a biological sample given RNA seq data. Basically I’m trying to simulate a psi value.

What would be a good way to know the realistic psi? How do tools like salmon and kallisto measure psi?

What I understand is that these applications don’t really know “true” psi except by alignment to the reference. They often multi-map overlapping portions of transcript spliceoforms to infer relative expression.

I is really impossible to know without long-read sequencing? Is it all a guess?

composition isoform • 847 views
2
Entering edit mode
18 months ago

Okay, firstly PSI does not measure isoform composition. I don't know if there is a standard term for transcript composition.

PSI (percent spliced in) measures the usage of an individual splice junction. It is measured as the number of reads that support the usage of a junction divided by the sum of (the number of reads that support the usage of the junction and the number of reads that are not compatible with the usage of the junction).

Consider the following gene and transcript expressions:

Transcript A:    |>>>1>>>|-------|>>>2>>>|------------|>>3>>>|---------|>>>4>>>>>>>>>>>>>>>>5>>>>>>>>|  TPM = 10
Transcript B:    |>>>1>>>|----------------------------|>>3>>>|---------|>>>4>>>>>>>|                    TPM = 7
Transcript C:    |>>>1>>>|-------|>>>2>>>|-----------------------------|>>>4>>>>>>>|                    TPM = 3


The composition is 50% transcript A, 35% transcript B and 15% transcript C.

The PSIs are exon 1: 100%, exon 2: 65%, exon 3: 85%, exon part 4: 100% and exon part 5: 50%.

Measuring PSIs is much easier than measuring transcript composition. As I said, you count the reads that are compatible with a particular splice event, and the ones that are incompatible, and then PSI = compatible/(compatible + incompatible). There are some corrections to apply but that is the basic idea. Splice junction focused tools such as MISO and rMATs calculate, and look at differences in, PSI.

Measuring transcript composition is much harder. Kalisto, Salmon and RSEM work in subtly different ways, but a simplified version (as best as I understand it) is:

Reads are assigned to "compatibility groups" based on what transcripts that are compatible with. So a read mapping entirely to exon one or to exon 4 might be in the (A,B,C) group. A read mapping entirely to exon 2 would be in the (A,C) group. A read mapping to both exon 1 and exon 3 (but not exon 2) would be in the (B) group.

Having done this, the tools then use the Expectation Maximisation (EM) method to find the set of transcript expressions that is most compatible with the distribution of reads between "compatibility groups".

Is it all a guess?

The answer is yes. And its well known that its often not that great a guess. This is why both Salmon and Kalisto with produce uncertainty estimates for the quantifications. But it's worse than this, because its a guess even with long-read sequencing. The problem with long read sequencing is that the depth is quite low compared to short read sequencing, so while assignment uncertainty (which transcript did this read come from) is much lower, sampling error is much higher.

Pretty much everything in biology is a guess to some degree. We have almost no "true" values for anything quantitative, only values that are close enough to address hypotheses of interests and values that aren't.