Question: Normal Distribution Of (Log) Microarray Results
1
gravatar for Assa Yeroslaviz
6.7 years ago by
Assa Yeroslaviz1.1k
Munich
Assa Yeroslaviz1.1k wrote:

Hi,

I have a problem interpreting my results and would like to ask for your help.

I have a table of expression values which looks like that:

           ctrl.high    ctrl.low   log_ratio
  gene1    9.572083    6.461176    3.1109074
  gene2    2.725700    3.354198   -0.6284985
  gene3    10.002005    8.190133    1.8118717
  gene4    3.812149    1.90948        1.9026686
  gene5    5.561375    3.16058        2.4007949
  gene6    5.515633    3.394174    2.1214594

The goal is to try and identify the differentially regulated genes between the two fractions (high vs. low). to do so, I calculated the log-ratio for each of the genes (high- low as this are log values) to identify the fold-changes between the two of them.

At first we thought about taking the mean+/- twice the standard deviation of the means as a threshold to decide which genes are significantly deregulated, mainly because this is how the biologist wanted it to be analyzed, but after looking at the distribution of the data I am not certain anymore that this is the right choice.

So I have a couple of questions regarding this kind of analysis:

  1. Is it possible to discriminate differentially regulated genes based on the mean of the log-values of their expression? Are there any papers for or against this method of calculations?

  2. I upload the image of the distribution of the log-values. I expected it to be a normal-distributed around 1, but as it looks like, there is a second peak at the left side of the plot. I know that this is probably a very vague question, but is there a way to explain this kind second peak or to find out how it happens?

Thanks

A.

microarray • 2.5k views
ADD COMMENTlink modified 6.7 years ago by Stefano Berri4.0k • written 6.7 years ago by Assa Yeroslaviz1.1k
1

My recommendation, tell the biologists they need to invest a bit more into replication, otherwise the experiment cannot be analyzed and published(!). Refuse to analyze the experiment otherwise, it is not worth wasting your and your clients time with sub-par analysis attempts, when investing a few hundred eur/dollars you can get so much more.

ADD REPLYlink written 6.7 years ago by Michael Dondrup44k

Good news with about replicates, follow Stefano's recommendations then. Start normalization and then run statistical test(s) from scratch if at all possible. Correct for multiple testing.

ADD REPLYlink written 6.6 years ago by Michael Dondrup44k
5
gravatar for Stefano Berri
6.7 years ago by
Stefano Berri4.0k
Cambridge, UK
Stefano Berri4.0k wrote:

I am afraid I have bad news for you. You need biological replicates

1 - I don't think there are paper that explain how to analyse the data without biological replicates...

2 - those genes in the second "bump" might well be those that are regulated, but could also be genes in a region of the array where the hybridisation didn't work very well.

About your approach, here is the catch. Even if no gene is regulated, you expect, by definition, that about 4.6% of the log(ratio) are outside the mean +- 2SD

And because you probably have very large number of genes, it is a very high number of "candidates" which are just there by chance. Furthermore, the standard deviation of the signal depends on the intensity of the signal

Also, the distribution of log2(ratio) should be centered on 0. If it is not, as in your case, it make me suspect you haven't normalised the signal properly.

I hope this gives you some keyword to look up or have some thoughts about.

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Stefano Berri4.0k
2

Try to get the data as "raw" as possible (CEL file for Affymetrix, intensity values otherwise) Then search for a tutorial for microarray analysis and/or give a look to LIMMA (http://www.bioconductor.org/packages/release/bioc/html/limma.html). Maybe somebody here can suggest a nice tutorial about microarray analysis. Take a couple of weeks to study and practise and understand how the analysis is done. For microarrays there are e few well accepted workflow depending on the platform.

ADD REPLYlink written 6.6 years ago by Stefano Berri4.0k

This is a great answer.

ADD REPLYlink written 6.7 years ago by Malachi Griffith16k

This is a great answer. Another document on some best practices that might be a good read: http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf

ADD REPLYlink written 6.7 years ago by Malachi Griffith16k

It's like with cars, you need at least 2 wheels (then it's a motorbike), better have 3 to 4 wheels...

ADD REPLYlink written 6.7 years ago by Michael Dondrup44k

I have replicates! each of this values are the mean values of three replicates, which were normalized. The values in the first two columns are from after the normalization.

The only difficulty is, that they were normalized separately, as they were compared with other arrays. In the original assay, the ctrl.high was compared with the treated.high and not with the ctrl.low. Do I need to normalize them again? Should I ran the complete workflow including normalization from the beginning?

Btw, is it a problem to normalize the data morethan once?

ADD REPLYlink written 6.6 years ago by Assa Yeroslaviz1.1k

I have a more general question - does it make sense at all to calculate the differentially regulated genes using the mean+SD or is it better to use the fold-change?

ADD REPLYlink written 6.6 years ago by Assa Yeroslaviz1.1k

You need to do a statistical test. Fold change will not do it. Mean+SD is closer, but you need a measure of statistical certainty. I'd suggest taking a couple of weeks to work through some microarray analysis tutorials. Better yet, work with a bioinformatics person who has worked with microarrays before, at least for your first try.

ADD REPLYlink written 6.6 years ago by Sean Davis25k

I Secon Sean, exept that I don't think Mean+SD is any better. Basically you have to ask this question. You work with the null hypothesis that no gene is regulated. Then you ask. What is the probability that a gene wich should have log(ratio) = 0 provides three log(ratio) as the one observed? Using a t.test And finally you correct for the multiple tests.

ADD REPLYlink written 6.6 years ago by Stefano Berri4.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1053 users visited in the last hour