I want to apply ComBat function in the sva package to an RNA-Seq dataset containing FPKM values. I first added 1 to all counts and then log-transformed the data followed by calling the ComBat function. However, I have no actual zero counts in the cleaned data while there were many zeros in the original data. This is expected since ComBat standardizes the data. All zeros are mapped to values between -0.36 and 4.45 (after exp-transformation and subtracting 1), and there are no exact zeros. However, it is kind of weird to have negative values and also no zero counts in the RNASeq data. So, my question is "what is the best way to use ComBat on RNA-Seq data?". Thanks.
Logging FPKM counts does not make things better. The combination of using ComBat and FPKM data is, in addition, akin to throwing your data in the trash and testing noise.
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units
The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.