Dealing With Null Expression Values
4
7
Entering edit mode
9.7 years ago
Pasta ★ 1.3k

Hi there,

This question might sound trivial but how do you deal with null gene expression values, eg RNA-seq, when you are calculating fold-changes ? That's a simple question I try to figure out before each analysis of mine.

Do you simply set them to some low arbitrary value? If this value is set to 0.1 you can get pretty high fold-changes with low expression values, like 10/0.1 = 100 fold-change ! Maybe set this arbitrary value to 1 then...

Or maybe do you simply discard genes with a null value in a at least one condition ?

Thanks

gene analysis microarray rna • 12k views
7
Entering edit mode
9.7 years ago
Alf ▴ 490

Several alternatives (do what you like more), some of them already answered:

• As someone told you, ignore the expresion profile for that gene.
• Substitute the value by zero or the minimum expression value (or by any other arbitrary value).
• Substitute the value by the average of the values in the column (better than the previous one).
• Do something more sophisticated, like using the Expectation-Maximization algorithm.

There are many survey papers in statistics, ML and bioinformatics. Just search for "missing values".

(Edited) Found this paper, "Missing value estimation methods for DNA microarrays", Bioinformatics 2001, >1000 citations. Perhaps is the best answer to your question: http://www.ncbi.nlm.nih.gov/pubmed/11395428

6
Entering edit mode
9.7 years ago

I do not have a simple answer, but rather some questions for you to think about.

1. Fold change is not very informative, also for this very reason: when expression is low, the ratio is far too noisy. More important is a p-value. p-value deal with noisy data better.
2. Ask yourself why you see that your gene expression from geneA is null. Because there is actually no RNA for that gene? Not at all? or because you, by chance, didn't sequence it? And the one you sequenced, how many times? 5? 10? what is the probability that, although they have the SAME expression, you observe DIFFERENT number of reads?
3. How many biological replicates do you have? What sort of variability do you observe within the test and the controls?
0
Entering edit mode

You raise interesting questions, thanks. Usually, I do work with replicates but for some reason my lab did not bother doing replicates with the last experiment.... So I am here trying to analyze what I can with no replicates :( Moreover I am comparing 2 very different conditions (mRNA). In the first condition the cell is grown in a "normal" medium, in the 2nd condition the cell is almost dormant ie with few mRNA transcribed. The last condition explains why I got plenty of null values.

0
Entering edit mode

0
Entering edit mode

no replicates? then the data is not worth your time. Your collaborator should do things seriously or not at all

3
Entering edit mode
9.7 years ago
Neilfws 49k

The simplest approach is to discard null values. As Stefano says, if you don't know why they are null, it's difficult to justify assigning values in the absence of a statistical model.

2
Entering edit mode
9.7 years ago
seidel 7.6k

I have a simple, pragmatic approach, simply for the purpose of being able to generate a ratio, which is the following: add a small number to all values. e.g. for RNA Seq data I might add 0.01 to all values. Thus you can then calculate a ratio such that large values are virtually unaffected, and small values are also not affected very much. If you then plot the data in log2 space (i.e. using an MA plot where M is log2(value1/value2) on the y-axis, and A is log2(sqrt(value1*value2)) you can evaluate ratios, and the magnitude of the numbers that make up those ratios, and weight things accordingly.

I disagree that fold change is not very informative, and that null values for one of the conditions should be discarded or ignored. As a biologist and an experimentalist, I can get a lot of insight if a given gene has no counts in one sample, and many counts in another, and indeed it's what I might expect from a gene differentially regulated under a given set of conditions. Using the MA plot, one can also avoid the folly of large ratios that come from small numbers (i.e. I can distinguish a ratio composed of one reasonable component and one small component - which is interesting, from a ratio that comes from two small but different components - where a large ratio may result but neither component would be a trustworthy measurement).

Of course, if you have p-values, then those are what should be used for evaluation. Also, the trick above is simply pragmatic so the ratios can be evaluated for ideas, and should be explicitly noted as such. I leave the individual measurements untouched, so that once an interesting set of ratios is selected, the genes can be evaluate for those that have zero counts.

This approach is likely to make a statistician groan in pain, but not more loudly than the experimentalist being told that he can't evaluate the data because one of the measurements was zero. (as an analogy, if the scientific question is: am I wealthy? And I'm trying to detect any fold change between myself and a person standing next to me, if I have \$100 in my pocket and they have zero, it's more useful to add a penny to both our pockets and answer the question, than to be told the question can't be evaluated because their pockets are empty).

0
Entering edit mode

Approach from the last section is implemented in limma package in R. @pasta - see limma's page on bioconductor, especially backgroundCorrect function with normexp method and offset option