1
1
Entering edit mode
7.6 years ago
Ekarl2 ▴ 120

Hi,

I have an de novo RNA-seq dataset that I have run RSEM on the gene-level and gotten out FPKM for all Trinity genes in the assembly. Unfortunately, the assembler split up a real gene (say, a transporter) into two separate Trinity genes in the assembly, so the first Trinity gene contains the first half of the sequences and the second Trinity gene contain the second half. Kind of like this:

-------------------------------------------------
|--------------------|   |------------------------|


I have the FPKM values for both of these Trinity genes separate. How can I calculate the FPKM for both of these Trinity genes if I would like to get the combined FPKM for both of them? Can FPKM simply be added together so FPKM(tot) = FPKM(Trinity gene 1) + FPKM(Trinity gene 2)? Or does it require a more complicated procedure?

I ask since I am not sure if it matters in what order length and library size normalization and combination of expression data from two Trinity genes are done. Are these operations commutative? If not, is it better to first add the raw expression values from the two Trinity genes and then normalize?

FPKM RNA-seq • 2.5k views
2
Entering edit mode
9
Entering edit mode
7.6 years ago
michael.ante ★ 3.8k

Let's say you have 1M reads in total, and two assemblies (1kb and 2kb). Each assembly has 100 reads:

FPKM1  100/(1*1) = 100
FPKM2  100/(2*1) = 50


The sum would be 150.

Merging the to assemblies to a 3 kb (plus a little gap in between):

FPKM3 200/(3*1) = 66.67


You need to re-calculate your expression values. If you have a couple of these wrongly separated assemblies, you might use Bedtools cluster to combine all assemblies with a certain distance.

Cheers,
Michael

0
Entering edit mode

This was very helpful! Does the information you mentioned above also apply to cases were I would like to combine FPKM for two separate paralogs too where both are full length (say 2000 nt)? In those cases, I do not see an intuitive way to calculate a new combined length. Would I just divide the combined number of reads by 2*1 (assuming 1M reads in total assembly) instead of 4*1 then?