I want to apply ComBat function in the sva package to an RNA-Seq dataset containing FPKM values. I first added 1 to all counts and then log-transformed the data followed by calling the ComBat function. However, I have no actual zero counts in the cleaned data while there were many zeros in the original data. This is expected since ComBat standardizes the data. All zeros are mapped to values between -0.36 and 4.45 (after exp-transformation and subtracting 1), and there are no exact zeros. However, it is kind of weird to have negative values and also no zero counts in the RNASeq data. So, my question is "what is the best way to use ComBat on RNA-Seq data?". Thanks.
Logging FPKM counts does not make things better. The combination of using ComBat and FPKM data is, in addition, akin to throwing your data in the trash and testing noise.
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units
The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.
Its natural to have negative values after using Combat.
Some of the posts that could be helpful : https://support.bioconductor.org/p/88522/ Remove Batch Effect From RNAseq with SVAseq and Combat Combat normalization returns negative values
Can you specifify whether you are using counts or FPKM ? You did not mention FPKM after first sentence.
Yes, I am using FPKM. So the values are continuous, but there are also actual zeros in the original data containing FPKMs.
Here is a paper here using log2(FPKM+1) as input for Combat with some discussion. I think there is no need to worry about below zero values, and I will recommend keep using log-transformed values. The real problem with Combat may be whether the prior distribution fits RNA-seq data well (check the plot). Also make sure you only use Combat corrected value for exploratory purpose (PCA, clustering), not for differential expression analysis.
One does have to worry about the negative values from ComBat. Negative values make no sense in RNA-seq (but they make sense in microarray studies).
In the manuscript that you mention, the following is stated:
They then go on to normalise by TMM, as per EdgeR.
In the reviewers' comments, it becomes even more interesting.
Lior Pachter writes:
The text in bold appears to see Pachter admitting that FPKM is neither optimal for differential expression analysis nor clustering.
Mick Watson writes:
Rafael Irizarry writes: