I am curious if anyone has good ideas for this problem I have.
I have a starting data set of ~4000 genes and 113 experimental conditions (data_frame of 'data'). I also have four predicted regulons made up of genes from the starting data set (26 genes ('a'), 16 genes ('b'), 16 genes ('c'), and 6 genes ('d')). My idea was that if the predicted regulons show a good correlation of expression with each other across all the experiments, they are more likely to be real than those genes in a predicted regulon with a weaker correlation.
What I have done is compute the correlation in R:
data.cor <- cor(data) a.cor <- cor(a) b.cor <- cor(b) c.cor <- cor(c) d.cor <- cor(d)
and then used a t.test to compare the correlations of the predicted regulons to the overall data, with the idea that if the predicted regulons are statistically different than the overall data, they are more likely to be real:
This provides p.values very much below 0.05. However, I am concerned that this is likely due to comparing two groups of very different sizes (~4000 rows in 'data' vs 26 rows in 'a' or 6 in 'd').
Can anyone recommend a better way to compare these groups in R? Are the t.test results reliable? I've done Wilcoxon tests too and gotten the same results. Any help or advice would be greatly appreciated!