7.8 years ago by

United States

I think I know the answer, but let me say up front, I'm not a statistician. I think you use the hypergeometric distribution, and the first array forms the basis of a question that you then use to evaluate against the other arrays. Using the phyper function in R, you can calculate the probability of obtaining the same gene set between two array results, and I think you then simply repeat the process and multiply the resulting p-values (the same way you would multiply the odds of a given repeated dice roll). The help for phyper uses the Urn analogy, so that's what I'll use. Say that a given array has 10,000 spots, and you identify 300 top genes. Then you perform a second array, and you also select 300 top genes. When you examine the overlap, it is 200 genes. What is the likelihood of getting a 200 gene overlap by chance? The first array sets up the Urn as follows: there are 10,000 balls total, 300 of them are white. Doing the second array asks the question, what is the likelihood of drawing 300 balls from such an Urn and having 200 of them be white? (or more generally, for a top gene set of a given size from the second array, what are the chances that 200 of them will be white?). In R, he phyper function takes arguments of x = # white balls drawn (number of genes from array 2 that were found in common with array 1), m = # white balls total in the Urn (size of the original top gene set from array 1), n = # of black balls total in the Urn (# of array spots - the top gene set from array 1), k = # of balls drawn (size of the top gene set from array 2). So the the answer for the overlap between array 1 and 2 is:

```
# phyper function in R for geometric distribution
1 - phyper(x,m,n,k)
```

Then you calculate the same thing for array 1 and 3, and 1 and 4, and 1 and 5, and then you multiply them. That's my guess.

I think Stefano's answer is on track, reason: to find up-regulation you need to run a statistical test across replicates (aka ttest, limma, anova) if all arrays stem from the same experiment. Then, your test options are sort of used up as there is no way to test if a gene is regulated on one array only. However, did you instead intend to check differential regulation between unrelated arrays

without replicates? Not a good idea.46kBut, on the other hand, ignoring whether or not it makes actually sense, the hypergeometric distribution could be applicable.

46kI agree, if your 5 arrays are simply replicates of the same condition then Micheal and Stefano are correct. However if you wanted to check if say 5 different conditions affect the same gene set, then it's a question of set overlap, and a different test is used. For instance if you independently knocked out 5 genes thought to be part of the same protein complex, and then did microarrays (with replicates for each condition), you might expect similar genes to be affected for each condition, and it would be a very interesting question to compare them.

6.8kSo, I'm asking a slightly different question. I'm not looking to identify genes which are upregulated, I can do that fine. What I want to show is that in these 5 different yeast strains, a significant number of the same genes are upregulated. So, to simplify, let's say that the array has 5000 genes. On each array 500 genes are upregulated. 100 of the same genes are upregulated on every array. If you find 100 of the same gene upregulated on two arrays, you can use the hypergeometric test to show that that's significant. But how do you factor in all 5 arrays?

0Yes, Seidel, that's the question I'm trying to answer

0That cracks me up, our comments are 1 second apart, and I was using yeast as an example - and that's what you're actually using. I think to extend the analysis across the 5 data sets you multiply the p-values, because each is like asking for the probability of a given event, and you have 4 events (so in my mind it seems like the odds of 4 successive dice rolls).

6.8kIf the the experiments are comparable and you have different strains, you could still treat them as biological replicates. Just ask you a slightly different question: What genes are regulated among these strains of yeast? If log2 of the ratio of expression of geneX among the different strain is statistically different from zero, it would be picked up.

4.1k