Question: Statistical Tests for Two Data Tables
gravatar for kmyers2
18 months ago by
University of Wisconsin-Madison
kmyers240 wrote:

I am curious if anyone has good ideas for this problem I have.

I have a starting data set of ~4000 genes and 113 experimental conditions (data_frame of 'data'). I also have four predicted regulons made up of genes from the starting data set (26 genes ('a'), 16 genes ('b'), 16 genes ('c'), and 6 genes ('d')). My idea was that if the predicted regulons show a good correlation of expression with each other across all the experiments, they are more likely to be real than those genes in a predicted regulon with a weaker correlation.

What I have done is compute the correlation in R:

data.cor <- cor(data)
a.cor <- cor(a)
b.cor <- cor(b)
c.cor <- cor(c)
d.cor <- cor(d)

and then used a t.test to compare the correlations of the predicted regulons to the overall data, with the idea that if the predicted regulons are statistically different than the overall data, they are more likely to be real:

t.test(data.cor, a.cor)

This provides p.values very much below 0.05. However, I am concerned that this is likely due to comparing two groups of very different sizes (~4000 rows in 'data' vs 26 rows in 'a' or 6 in 'd').

Can anyone recommend a better way to compare these groups in R? Are the t.test results reliable? I've done Wilcoxon tests too and gotten the same results. Any help or advice would be greatly appreciated!


R • 303 views
ADD COMMENTlink written 18 months ago by kmyers240
gravatar for Jean-Karim Heriche
18 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

The t-test is applicable if your data's distribution is reasonably close to normal. It doesn't make any assumption based on sample size, only on variance. If you think variances are different between the two groups then use Welch's version of the t-test (in R, set var.equal = F which actually is the default). It is often recommended to always use Welch's test regardless of whether the variances are equal or not.
If you're still concerned you could also just do a bootstrap test.
If I understand your data correctly, each gene is represented by a 113-d vector so cor(a) returns a matrix. However t.test(data.cor, a.cor) will flatten the matrices to vectors meaning that all values will be duplicated because the matrices are symmetric. While this doesn't affect the mean, this affects the variance.

ADD COMMENTlink modified 18 months ago • written 18 months ago by Jean-Karim Heriche23k


You are interpreting my data correctly, yes. Is there a better way to compare two matrices?

ADD REPLYlink written 18 months ago by kmyers240

At the moment you're comparing the average correlation of the two groups which is simple but may be not entirely satisfying because for large sample sizes even small differences become significant. So unless there is a large difference, the p-value is, in my view, meaningless in terms of biological relevance. I think you can use Mantel's test by comparing a matrix of regulon assignment (i.e. using a 0/1 coding) to the correlation matrix. The question this answers is: are genes in the same regulon also similar in terms of their expression pattern ? Another approach could be to take a feature selection approach to find out which experimental conditions are good indicators/predictors of regulon membership.

ADD REPLYlink written 18 months ago by Jean-Karim Heriche23k

Thanks! I hadn't heard of Mentel's test. I've tried but need to read more to understand the results. I compared the matrix of all data and the matrix of the individual regulon and got a significants of 0.001 and a statistic r: 0.9113. I'll read more, but thanks for your help!

ADD REPLYlink written 18 months ago by kmyers240
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1439 users visited in the last hour