How to do FPKM differential analysis?
2
4
Entering edit mode
3.0 years ago

Hello, I want to do diffrential analysis on rnaseq FPKM data (can't get count). DEseq and edgR need raw count as input, so how should I do? Is it suitable for limma package?

RNA-Seq FPKM DEseq limma • 8.7k views
3
Entering edit mode

Yes you can use limma although It is not as good as if you had counts - so double check you cannot get the counts and try writing the people behind the data to see if they will provide the counts (most people are quite friendly in my experince). If you cannot get the counts you can log2 transform the FPKM values (use a pseudocount of 1) and use the limma trend approach described on page 71 of the limma vignette. For a comparison of limma-trend vs limma-voom take a look at this article.

Hope this helps

0
Entering edit mode

Thanks a lot. It really helps☺

0
Entering edit mode

Hi, You can get read count or No. of mapped reads by using samtools idxstats command, where the 3rd column in the output file is the read count.

The command is :

\$ samtools idxstats input.bam > output.txt

(Note: The input bam is the alignment bam file generated by any alignment tool (Hisat2) or assembly tool (Spades) which is coordinate sorted and indexed)

3
Entering edit mode

I think that you may have mis-understood the question. The user does not want total and aligned read counts per sample. They need raw read counts per gene per sample.

4
Entering edit mode
3.0 years ago

I agree with Lila - you should not perform differential expression analysis on FPKM expression levels. If you don't believe me, then take it from the developer of limma, where some suggestions are also made: https://support.bioconductor.org/p/56275/#56299

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

Remarkably, I still see publications coming out where people are comparing groups of samples based on FPKM counts, even though this makes no sense. As an example, FPKM of 10 in one sample may be the equivalent of 50 in another, due to the way that FPKM counts are produced, i.e., with no cross-library / sample normalisation.

0
Entering edit mode

If you cannot get counts-limma trend it is a viable approach (see the voom article by the limma authors). Please also note that the Bioconductor post you refrence only states you should not use limma-voom on FPKM values. Actually in that exact post Gordon highlights (as option 3) if you only have FPKM you can use limma-trend via a log2 transformation with a pseudocount.

If you have different library sizes you are right they are not comparable (also what the voom article shows) but then you can just do a inter-library normalization of the FPKM counts first (exactly like you are doing for count data).

1
Entering edit mode

Yes, but, that is the third option in a list that is ordered "in decreasing order of desirability"

2
Entering edit mode
3.0 years ago
Lila M ★ 1.2k

Hi, I do not recommend to use FPKM for DE analysis. Is better to use the normalized counts (vst/vsd) generated by DESeq2. (For more info have a look in here). However, there is a function in DESeq2 called fpkm() that allow you to do that

1
Entering edit mode

Is better to use the normalized counts (vst/vsd) generated by DESeq2.

Better for what? DE?

However, there is a function in DESeq2 called fpkm() that allow you to do that

This is only for calculation of FPKM not for DE.

0
Entering edit mode

In most published articles and studies is hardly recommended to use normalized counts rather than fpkm for DE analysis. You do not agree with that?

0
Entering edit mode

Absolutely not. Please list the respective articles. FPKM regularily fails in benchmarking studies towards differential analysis. Use the established tools to perform differential analysis. The only true advantage of per-million scaling is that you can later add new samples without that norm. counts of other samples change which is not the case with e.g. TMM from edgeR. This is why databases usually use TPM or RPKM/FPKM to illustrate counts.+

Still, it is not that per-million scaling is completely useless. If library compositions between samples do not dramatically change, a simple FPKM might be sufficient. If I am not wrong cuffdiff used FPKM as the standard normalization method and I've seen datasets where agreement between FPKM values and TMM was pretty good. Still, as there are advanced methods, why not using them. I always start from raw counts as this ensures that I can stand up for the lowlevel pipeline. if you are provided with FPKM you have no idea how they were created.

0
Entering edit mode

Hi, If I understand right, you mean I can merge two FPKM data from two batches without considering the batch effect? I have two batch rnaseq FPKM data now (counts not avaliable) and I want to merge them into a whole dataset. I know Deseq can deal with it by treat batch as variable for counts data. So what about FPKM? Thanks