Question

How to do FPKM differential analysis?

5

Entering edit mode

4.8 years ago

zhaoliang0302 ▴ 50

Hello, I want to do diffrential analysis on rnaseq FPKM data (can't get count). DEseq and edgR need raw count as input, so how should I do? Is it suitable for limma package?

RNA-Seq FPKM DEseq limma • 15k views

ADD COMMENT • link updated 19 months ago by ATpoint 81k • written 4.8 years ago by zhaoliang0302 ▴ 50

3

Entering edit mode

Yes you can use limma although It is not as good as if you had counts - so double check you cannot get the counts and try writing the people behind the data to see if they will provide the counts (most people are quite friendly in my experince). If you cannot get the counts you can log2 transform the FPKM values (use a pseudocount of 1) and use the limma trend approach described on page 71 of the limma vignette. For a comparison of limma-trend vs limma-voom take a look at this article.

Hope this helps

ADD REPLY • link 4.8 years ago by Kristoffer Vitting-Seerup ★ 4.0k

2

Entering edit mode

Page 71, limma, where do they state you can use rpm/fpkm/tpm for differential expression analysis.

15.4 Differential expression: limma-trend

If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential exis to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold. In the limma-trend approach, the counts are converted to logCPM values using edgeR’s cpm function:

> logCPM <- cpm(dge, log=TRUE, prior.count=3)

The prior count is used here to damp down the variances of logarithms of low counts. The logCPM values can then be used in any standard limma pipeline, using the trend=TRUE argument when running eBayes or treat. For example:

> fit <- lmFit(logCPM, design)
> fit <- eBayes(fit, trend=TRUE)
> topTable(fit, coef=ncol(design))

Or, to give more weight to fold-changes in the gene ranking, one might use:

> fit <- lmFit(logCPM, design)
> fit <- treat(fit, lfc=log2(1.2), trend=TRUE)
> topTreat(fit, coef=ncol(design))

15.5 Differential expression: voom

When the library sizes are quite variable between samples, then the voom approach is theoretically more powerful than limma-trend. In this approach, the voom transformation is applied to the normalized and filtered DGEList object:

v <- voom(dge, design, plot=TRUE)

The voom transformation uses the experiment design matrix, and produces an EList object. It is also possible to give a matrix of counts directly to voom without TMM normalization, by

> v <- voom(counts, design, plot=TRUE)

If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example:

> v <- voom(counts, design, plot=TRUE, normalize="quantile")

After this, the usual limma pipelines for differential expression can be applied, for example:

> fit <- lmFit(v, design)
> fit <- eBayes(fit)
> topTable(fit, coef=ncol(design))

Or, to give more weight to fold-changes in the ranking, one could use say:

> fit <- treat(fit, lfc=log2(1.2))
> topTreat(fit, coef=ncol(design))

ADD REPLY • link 19 months ago by DareDevil ★ 4.3k

1

Entering edit mode

That is wrong. If you read that closely you see they use cpm. That is a critical difference because CPM is the properly-normalized counts without length correction is fine for limma-trend while length-corrected data such as FPKM completely distort the mean-variance trend which is the whole point of methods such as limma which gain power by sharing information across genes. At no point they recommend FPKM for anything.

ADD REPLY • link 19 months ago by ATpoint 81k

0

Entering edit mode

Thanks a lot. It really helps☺

ADD REPLY • link 4.8 years ago by zhaoliang0302 ▴ 50

0

Entering edit mode

Hi, You can get read count or No. of mapped reads by using samtools idxstats command, where the 3rd column in the output file is the read count.

The command is :

$ samtools idxstats input.bam > output.txt

(Note: The input bam is the alignment bam file generated by any alignment tool (Hisat2) or assembly tool (Spades) which is coordinate sorted and indexed)

Hope this will help you!!

ADD REPLY • link 4.8 years ago by krishnajandhyalaa • 0

3

Entering edit mode

I think that you may have mis-understood the question. The user does not want total and aligned read counts per sample. They need raw read counts per gene per sample.

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k

score 4 · Answer 1 · 2019-07-16

I agree with Lila - you should not perform differential expression analysis on FPKM expression levels. If you don't believe me, then take it from the developer of limma, where some suggestions are also made: https://support.bioconductor.org/p/56275/#56299

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

Remarkably, I still see publications coming out where people are comparing groups of samples based on FPKM counts, even though this makes no sense. As an example, FPKM of 10 in one sample may be the equivalent of 50 in another, due to the way that FPKM counts are produced, i.e., with no cross-library / sample normalisation.

score 2 · Answer 2 · 2019-07-16

2

Entering edit mode

4.8 years ago

Lila M ★ 1.2k

Hi, I do not recommend to use FPKM for DE analysis. Is better to use the normalized counts (vst/vsd) generated by DESeq2. (For more info have a look in here). However, there is a function in DESeq2 called fpkm() that allow you to do that

ADD COMMENT • link 4.8 years ago by Lila M ★ 1.2k

1

Entering edit mode

Is better to use the normalized counts (vst/vsd) generated by DESeq2.

Better for what? DE?

However, there is a function in DESeq2 called fpkm() that allow you to do that

This is only for calculation of FPKM not for DE.

ADD REPLY • link 4.8 years ago by ATpoint 81k

0

Entering edit mode

In most published articles and studies is hardly recommended to use normalized counts rather than fpkm for DE analysis. You do not agree with that?

ADD REPLY • link 4.8 years ago by Lila M ★ 1.2k

0

Entering edit mode

Absolutely not. Please list the respective articles. FPKM regularily fails in benchmarking studies towards differential analysis. Use the established tools to perform differential analysis. The only true advantage of per-million scaling is that you can later add new samples without that norm. counts of other samples change which is not the case with e.g. TMM from edgeR. This is why databases usually use TPM or RPKM/FPKM to illustrate counts.+

Still, it is not that per-million scaling is completely useless. If library compositions between samples do not dramatically change, a simple FPKM might be sufficient. If I am not wrong cuffdiff used FPKM as the standard normalization method and I've seen datasets where agreement between FPKM values and TMM was pretty good. Still, as there are advanced methods, why not using them. I always start from raw counts as this ensures that I can stand up for the lowlevel pipeline. if you are provided with FPKM you have no idea how they were created.

ADD REPLY • link 4.8 years ago by ATpoint 81k

0

Entering edit mode

Hi, If I understand right, you mean I can merge two FPKM data from two batches without considering the batch effect? I have two batch rnaseq FPKM data now (counts not avaliable) and I want to merge them into a whole dataset. I know Deseq can deal with it by treat batch as variable for counts data. So what about FPKM? Thanks

ADD REPLY • link 4.8 years ago by zhaoliang0302 ▴ 50