Question: Easy Microarray Statistics Question
gravatar for User 7352
8.3 years ago by
User 73520
User 73520 wrote:

Let's say I'm looking at 5 independent microarrays, and some number of genes are upregulated on each microarray. If 200 of the same genes are upregulated on every microarray, what's the statistical test to prove that it's a significant enrichment? What if the genes are upregulated on 4 out of 5 microarrays?

microarray • 2.3k views
ADD COMMENTlink written 8.3 years ago by User 73520

I think Stefano's answer is on track, reason: to find up-regulation you need to run a statistical test across replicates (aka ttest, limma, anova) if all arrays stem from the same experiment. Then, your test options are sort of used up as there is no way to test if a gene is regulated on one array only. However, did you instead intend to check differential regulation between unrelated arrays without replicates? Not a good idea.

ADD REPLYlink written 8.3 years ago by Michael Dondrup46k

But, on the other hand, ignoring whether or not it makes actually sense, the hypergeometric distribution could be applicable.

ADD REPLYlink written 8.3 years ago by Michael Dondrup46k

I agree, if your 5 arrays are simply replicates of the same condition then Micheal and Stefano are correct. However if you wanted to check if say 5 different conditions affect the same gene set, then it's a question of set overlap, and a different test is used. For instance if you independently knocked out 5 genes thought to be part of the same protein complex, and then did microarrays (with replicates for each condition), you might expect similar genes to be affected for each condition, and it would be a very interesting question to compare them.

ADD REPLYlink written 8.3 years ago by seidel6.8k

So, I'm asking a slightly different question. I'm not looking to identify genes which are upregulated, I can do that fine. What I want to show is that in these 5 different yeast strains, a significant number of the same genes are upregulated. So, to simplify, let's say that the array has 5000 genes. On each array 500 genes are upregulated. 100 of the same genes are upregulated on every array. If you find 100 of the same gene upregulated on two arrays, you can use the hypergeometric test to show that that's significant. But how do you factor in all 5 arrays?

ADD REPLYlink written 8.3 years ago by User 73520

Yes, Seidel, that's the question I'm trying to answer

ADD REPLYlink written 8.3 years ago by User 73520

That cracks me up, our comments are 1 second apart, and I was using yeast as an example - and that's what you're actually using. I think to extend the analysis across the 5 data sets you multiply the p-values, because each is like asking for the probability of a given event, and you have 4 events (so in my mind it seems like the odds of 4 successive dice rolls).

ADD REPLYlink written 8.3 years ago by seidel6.8k

If the the experiments are comparable and you have different strains, you could still treat them as biological replicates. Just ask you a slightly different question: What genes are regulated among these strains of yeast? If log2 of the ratio of expression of geneX among the different strain is statistically different from zero, it would be picked up.

ADD REPLYlink written 8.3 years ago by Stefano Berri4.1k
gravatar for Stefano Berri
8.3 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

Maybe I am getting this wrong, but I do not think the hypergeometric is the way to go. Am I right you are talking about 5 microarray for the same "experiment" like 5 biological replicates?

Install LIMMA from bioconductor, load the microarray, follow the documentation and perform a "standard" analysis. It is a linear model, and it does not use the hypergeometric, but the t-test (or a derivate...). If your array are Affymetrix, use package affy first and then LIMMA.

The hypergeometric doesn't take into account HOW MUCH they are upregulated nor how consistent your up-regulation is. The t-test does.

Then, of course, correct for multiple test.

I would use the hypergeometric only when comparing results of different experiments (using different platform or different conditions), but it does not sound like your case.

In general, try to learn about microarray analysis as much as you can before starting the analysis.

Good luck

ADD COMMENTlink written 8.3 years ago by Stefano Berri4.1k

If the arrays are simply 5 replicates of the same thing..then of course you are right, the hypergeometric idea is NOT the way to test for significance of a set of genes. I assumed he was talking about 5 different conditions, for instance asking whether a similar set of genes is changed when 5 different yeast mutants are each compared to wt.

ADD REPLYlink written 8.3 years ago by seidel6.8k
gravatar for seidel
8.3 years ago by
United States
seidel6.8k wrote:

I think I know the answer, but let me say up front, I'm not a statistician. I think you use the hypergeometric distribution, and the first array forms the basis of a question that you then use to evaluate against the other arrays. Using the phyper function in R, you can calculate the probability of obtaining the same gene set between two array results, and I think you then simply repeat the process and multiply the resulting p-values (the same way you would multiply the odds of a given repeated dice roll). The help for phyper uses the Urn analogy, so that's what I'll use. Say that a given array has 10,000 spots, and you identify 300 top genes. Then you perform a second array, and you also select 300 top genes. When you examine the overlap, it is 200 genes. What is the likelihood of getting a 200 gene overlap by chance? The first array sets up the Urn as follows: there are 10,000 balls total, 300 of them are white. Doing the second array asks the question, what is the likelihood of drawing 300 balls from such an Urn and having 200 of them be white? (or more generally, for a top gene set of a given size from the second array, what are the chances that 200 of them will be white?). In R, he phyper function takes arguments of x = # white balls drawn (number of genes from array 2 that were found in common with array 1), m = # white balls total in the Urn (size of the original top gene set from array 1), n = # of black balls total in the Urn (# of array spots - the top gene set from array 1), k = # of balls drawn (size of the top gene set from array 2). So the the answer for the overlap between array 1 and 2 is:

# phyper function in R for geometric distribution    
1 - phyper(x,m,n,k)

Then you calculate the same thing for array 1 and 3, and 1 and 4, and 1 and 5, and then you multiply them. That's my guess.

ADD COMMENTlink written 8.3 years ago by seidel6.8k

I have the impression that the hypergeometric distribution is appropriate to model the case. Try to apply the urn model to this case, it simply doesn't fit. In fact you need to know how many true regulated vs. non-regulated cases there are. Further, the model doesn't apply to the idea of the same set of elements being drawn each time.

ADD REPLYlink modified 7.6 years ago • written 8.3 years ago by Michael Dondrup46k

I assumed he was comparing 5 independent conditions (not 5 replicates of the same thing as Stefano mentioned above). If the conditions are independent, and we simply want to compare two of them for gene set overlap, I thought the hypergeometric was the way to go, so I just extended the idea. The array has a set number of spots, and you may choose different size "top sets" from each (e.g. 300 from one, 500 from the other), but once you observe the overlap (200), you can apply a statistical test to assess it.

ADD REPLYlink written 8.3 years ago by seidel6.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2501 users visited in the last hour