Question

Normal Distribution Of (Log) Microarray Results

1

Entering edit mode

13.4 years ago

Assa Yeroslaviz ★ 1.9k

Hi,

I have a problem interpreting my results and would like to ask for your help.

I have a table of expression values which looks like that:

           ctrl.high    ctrl.low   log_ratio
  gene1    9.572083    6.461176    3.1109074
  gene2    2.725700    3.354198   -0.6284985
  gene3    10.002005    8.190133    1.8118717
  gene4    3.812149    1.90948        1.9026686
  gene5    5.561375    3.16058        2.4007949
  gene6    5.515633    3.394174    2.1214594

The goal is to try and identify the differentially regulated genes between the two fractions (high vs. low). to do so, I calculated the log-ratio for each of the genes (high- low as this are log values) to identify the fold-changes between the two of them.

At first we thought about taking the mean+/- twice the standard deviation of the means as a threshold to decide which genes are significantly deregulated, mainly because this is how the biologist wanted it to be analyzed, but after looking at the distribution of the data I am not certain anymore that this is the right choice.

So I have a couple of questions regarding this kind of analysis:

Is it possible to discriminate differentially regulated genes based on the mean of the log-values of their expression? Are there any papers for or against this method of calculations?
I upload the image of the distribution of the log-values. I expected it to be a normal-distributed around 1, but as it looks like, there is a second peak at the left side of the plot. I know that this is probably a very vague question, but is there a way to explain this kind second peak or to find out how it happens?

Thanks

A.

microarray • 5.7k views

ADD COMMENT • link updated 13.4 years ago by Stefano Berri 4.4k • written 13.4 years ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

My recommendation, tell the biologists they need to invest a bit more into replication, otherwise the experiment cannot be analyzed and published(!). Refuse to analyze the experiment otherwise, it is not worth wasting your and your clients time with sub-par analysis attempts, when investing a few hundred eur/dollars you can get so much more.

ADD REPLY • link 13.4 years ago by Michael 56k

0

Entering edit mode

Good news with about replicates, follow Stefano's recommendations then. Start normalization and then run statistical test(s) from scratch if at all possible. Correct for multiple testing.

ADD REPLY • link 13.4 years ago by Michael 56k

Ram · Answer 1 · 2012-01-31

5

Entering edit mode

13.4 years ago

Stefano Berri 4.4k

I am afraid I have bad news for you. You need biological replicates

1 - I don't think there are paper that explain how to analyse the data without biological replicates...

2 - those genes in the second "bump" might well be those that are regulated, but could also be genes in a region of the array where the hybridisation didn't work very well.

About your approach, here is the catch. Even if no gene is regulated, you expect, by definition, that about 4.6% of the log(ratio) are outside the mean +- 2SD

And because you probably have very large number of genes, it is a very high number of "candidates" which are just there by chance. Furthermore, the standard deviation of the signal depends on the intensity of the signal

Also, the distribution of log2(ratio) should be centered on 0. If it is not, as in your case, it make me suspect you haven't normalised the signal properly.

I hope this gives you some keyword to look up or have some thoughts about.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 13.4 years ago by Stefano Berri 4.4k

2

Entering edit mode

Try to get the data as "raw" as possible (CEL file for Affymetrix, intensity values otherwise) Then search for a tutorial for microarray analysis and/or give a look to LIMMA (http://www.bioconductor.org/packages/release/bioc/html/limma.html). Maybe somebody here can suggest a nice tutorial about microarray analysis. Take a couple of weeks to study and practise and understand how the analysis is done. For microarrays there are e few well accepted workflow depending on the platform.

ADD REPLY • link 13.4 years ago by Stefano Berri 4.4k

0

Entering edit mode

This is a great answer.

ADD REPLY • link 13.4 years ago by Malachi Griffith 20k

0

Entering edit mode

This is a great answer. Another document on some best practices that might be a good read: http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf

ADD REPLY • link 13.4 years ago by Malachi Griffith 20k

0

Entering edit mode

It's like with cars, you need at least 2 wheels (then it's a motorbike), better have 3 to 4 wheels...

ADD REPLY • link 13.4 years ago by Michael 56k

0

Entering edit mode

I have replicates! each of this values are the mean values of three replicates, which were normalized. The values in the first two columns are from after the normalization.

The only difficulty is, that they were normalized separately, as they were compared with other arrays. In the original assay, the ctrl.high was compared with the treated.high and not with the ctrl.low. Do I need to normalize them again? Should I ran the complete workflow including normalization from the beginning?

Btw, is it a problem to normalize the data morethan once?

ADD REPLY • link 13.4 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

I have a more general question - does it make sense at all to calculate the differentially regulated genes using the mean+SD or is it better to use the fold-change?

ADD REPLY • link 13.4 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

You need to do a statistical test. Fold change will not do it. Mean+SD is closer, but you need a measure of statistical certainty. I'd suggest taking a couple of weeks to work through some microarray analysis tutorials. Better yet, work with a bioinformatics person who has worked with microarrays before, at least for your first try.

ADD REPLY • link 13.4 years ago by Sean Davis 27k

0

Entering edit mode

I Secon Sean, exept that I don't think Mean+SD is any better. Basically you have to ask this question. You work with the null hypothesis that no gene is regulated. Then you ask. What is the probability that a gene wich should have log(ratio) = 0 provides three log(ratio) as the one observed? Using a t.test And finally you correct for the multiple tests.

ADD REPLY • link 13.4 years ago by Stefano Berri 4.4k