Question: Dealing With Null Expression Values
6
gravatar for Pasta
8.2 years ago by
Pasta1.3k
Switzerland
Pasta1.3k wrote:

Hi there,

This question might sound trivial but how do you deal with null gene expression values, eg RNA-seq, when you are calculating fold-changes ? That's a simple question I try to figure out before each analysis of mine.

Do you simply set them to some low arbitrary value? If this value is set to 0.1 you can get pretty high fold-changes with low expression values, like 10/0.1 = 100 fold-change ! Maybe set this arbitrary value to 1 then...

Or maybe do you simply discard genes with a null value in a at least one condition ?

Thanks

gene analysis rna microarray • 8.9k views
ADD COMMENTlink modified 8.2 years ago by Alf450 • written 8.2 years ago by Pasta1.3k
6
gravatar for Stefano Berri
8.2 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

I do not have a simple answer, but rather some questions for you to think about.

  1. Fold change is not very informative, also for this very reason: when expression is low, the ratio is far too noisy. More important is a p-value. p-value deal with noisy data better.
  2. Ask yourself why you see that your gene expression from geneA is null. Because there is actually no RNA for that gene? Not at all? or because you, by chance, didn't sequence it? And the one you sequenced, how many times? 5? 10? what is the probability that, although they have the SAME expression, you observe DIFFERENT number of reads?
  3. How many biological replicates do you have? What sort of variability do you observe within the test and the controls?
ADD COMMENTlink written 8.2 years ago by Stefano Berri4.1k

You raise interesting questions, thanks. Usually, I do work with replicates but for some reason my lab did not bother doing replicates with the last experiment.... So I am here trying to analyze what I can with no replicates :( Moreover I am comparing 2 very different conditions (mRNA). In the first condition the cell is grown in a "normal" medium, in the 2nd condition the cell is almost dormant ie with few mRNA transcribed. The last condition explains why I got plenty of null values.

ADD REPLYlink written 8.2 years ago by Pasta1.3k

no replicates? tell your boss to stop wasting your time.

ADD REPLYlink written 8.2 years ago by Yannick Wurm2.3k

no replicates? then the data is not worth your time. Your collaborator should do things seriously or not at all

ADD REPLYlink written 8.2 years ago by Yannick Wurm2.3k
5
gravatar for Alf
8.2 years ago by
Alf450
UK
Alf450 wrote:

Several alternatives (do what you like more), some of them already answered:

  • As someone told you, ignore the expresion profile for that gene.
  • Substitute the value by zero or the minimum expression value (or by any other arbitrary value).
  • Substitute the value by the average of the values in the column (better than the previous one).
  • Do something more sophisticated, like using the Expectation-Maximization algorithm.

There are many survey papers in statistics, ML and bioinformatics. Just search for "missing values".

(Edited) Found this paper, "Missing value estimation methods for DNA microarrays", Bioinformatics 2001, >1000 citations. Perhaps is the best answer to your question: http://www.ncbi.nlm.nih.gov/pubmed/11395428

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Alf450
2
gravatar for seidel
8.2 years ago by
seidel6.8k
United States
seidel6.8k wrote:

I have a simple, pragmatic approach, simply for the purpose of being able to generate a ratio, which is the following: add a small number to all values. e.g. for RNA Seq data I might add 0.01 to all values. Thus you can then calculate a ratio such that large values are virtually unaffected, and small values are also not affected very much. If you then plot the data in log2 space (i.e. using an MA plot where M is log2(value1/value2) on the y-axis, and A is log2(sqrt(value1*value2)) you can evaluate ratios, and the magnitude of the numbers that make up those ratios, and weight things accordingly.

I disagree that fold change is not very informative, and that null values for one of the conditions should be discarded or ignored. As a biologist and an experimentalist, I can get a lot of insight if a given gene has no counts in one sample, and many counts in another, and indeed it's what I might expect from a gene differentially regulated under a given set of conditions. Using the MA plot, one can also avoid the folly of large ratios that come from small numbers (i.e. I can distinguish a ratio composed of one reasonable component and one small component - which is interesting, from a ratio that comes from two small but different components - where a large ratio may result but neither component would be a trustworthy measurement).

Of course, if you have p-values, then those are what should be used for evaluation. Also, the trick above is simply pragmatic so the ratios can be evaluated for ideas, and should be explicitly noted as such. I leave the individual measurements untouched, so that once an interesting set of ratios is selected, the genes can be evaluate for those that have zero counts.

This approach is likely to make a statistician groan in pain, but not more loudly than the experimentalist being told that he can't evaluate the data because one of the measurements was zero. (as an analogy, if the scientific question is: am I wealthy? And I'm trying to detect any fold change between myself and a person standing next to me, if I have $100 in my pocket and they have zero, it's more useful to add a penny to both our pockets and answer the question, than to be told the question can't be evaluated because their pockets are empty).

ADD COMMENTlink written 8.2 years ago by seidel6.8k

Approach from the last section is implemented in limma package in R. @pasta - see limma's page on bioconductor, especially backgroundCorrect function with normexp method and offset option

ADD REPLYlink written 7.9 years ago by boczniak767680
1
gravatar for Neilfws
8.2 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

The simplest approach is to discard null values. As Stefano says, if you don't know why they are null, it's difficult to justify assigning values in the absence of a statistical model.

ADD COMMENTlink written 8.2 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1104 users visited in the last hour