Question

How to handle the 'Inf' and '-Inf' in cufflinks output when use clusterprofiler gseGO

0

Entering edit mode

6.5 years ago

yaotianran • 0

Hi, I often use cufflinks and R/clusterProfiler to analyze RNA-seq data.

In the cuffdiff output file gene_exp.diff, sometime lots of genes ( both significant yes and significant no ) have 'Inf' or '-Inf' as log2(fold_change). If I leave them there and use them as a input of gene set enrichment analysis, clusterprofiler gseGO function constantly raises an error:

Error in if (maxP > -minP) { : missing value where TRUE/FALSE needed

If I remove the genes whose log2(fold_change) is 'Inf' or '-Inf' from input gseGO will work. But should they really be removed ? Or is there any better way to handle this situation ?

RNA-Seq clusterProﬁler cufflinks gseGO GSEA • 4.9k views

ADD COMMENT • link updated 6.2 years ago by garcia.nacho • 0 • written 6.5 years ago by yaotianran • 0

score 3 · Answer 1 · 2017-10-18

I'd recommend changing infinities to the corresponding maximum positive and negative finite values, may be multiplied by a constant > 1.

GSEA here is about weighted ranking: you can assign it whatever you like and it would be fine from the method's standpoint. However, when you remove some genes based on their rank (e.g. when you remove infinities), there statistics will get wrong.

score 1 · Answer 2 · 2017-10-09

1

Entering edit mode

6.5 years ago

Kevin Blighe 87k

I do not see any problem in removing these. A log base 2 of nil (zero or nought) results in a negative infinity (-inf). We can probably debate all day about what 'negative infinity' actually means, but, for your purposes, I believe that these are best removed. It is a common practice in expression studies to 'clean' the dataset prior to final analysis by removing zero or low-count transcripts.

ADD COMMENT • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for the suggestion. I also consulted with my colleagues. Still no consensus on the solutions. But I read the article from the guy that invented the GSEA ( 2005-Gene set enrichment analysis - A knowledge-based approach for interpreting genome-wide expression profiles ). It seems the fold change is only used for ranking the universal/background gene list. So what about replacing the 'Inf' and '-Inf' with some big numbers, say '99999' and '-99999' ?

Nonetheless since I can't read the source code of clusterProfiler and cufflinks I will remove the 'Inf' and '-Inf' gene from GSEA input.

ADD REPLY • link 6.5 years ago by yaotianran • 0

1

Entering edit mode

Hey,

That is one approach, but normally you replace the infinities with half the lowest non-zero value. Other people 'impute' these numbers with the median value, which is only valid if you are aiming to do univariate analysis. For expression data, my view is that they should be recorded and then removed from the analysis.

You will not find consensus on this issue. Even if you ask 2 statisticians on how best to deal with this type of problem, they'll give separate answers.

The conclusion, therefore, is to try them all. If your experimental set-up is solid and you have normalised correctly your data, then you should get more or less the same result.

Kevin

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

score 0 · Answer 3 · 2018-02-02

I would assume that the 0.00000 value means:

The real value is below the detection level (that is common for ChIP-Seq, I think)
The real value is an absolute 0 (like in 0 molecules of mRNA in the cell)

In both cases replacing the 0.000000 by 0.000001 (or adding 0.000001 to each value of whole dataset) doesn't look incorrect to me and you end up with non-infinite values.

I think that removing the infinite values is an incorrect approach: imagine that you have a huge effect that makes the values for one condition dropping below the detection levels (I'm currently working in a dataset like that) if you remove those values you could be missing part of the effect. Same applies for replacing those values by the median, although I think that that is even more incorrect since you are introducing noise in your data.

Anyway I have to say that I'm not a statistician.