Question

Statistical Tests for Two Data Tables

1

Entering edit mode

5.2 years ago

kmyers2 ▴ 80

I am curious if anyone has good ideas for this problem I have.

I have a starting data set of ~4000 genes and 113 experimental conditions (data_frame of 'data'). I also have four predicted regulons made up of genes from the starting data set (26 genes ('a'), 16 genes ('b'), 16 genes ('c'), and 6 genes ('d')). My idea was that if the predicted regulons show a good correlation of expression with each other across all the experiments, they are more likely to be real than those genes in a predicted regulon with a weaker correlation.

What I have done is compute the correlation in R:

data.cor <- cor(data)
a.cor <- cor(a)
b.cor <- cor(b)
c.cor <- cor(c)
d.cor <- cor(d)

and then used a t.test to compare the correlations of the predicted regulons to the overall data, with the idea that if the predicted regulons are statistically different than the overall data, they are more likely to be real:

t.test(data.cor, a.cor)

This provides p.values very much below 0.05. However, I am concerned that this is likely due to comparing two groups of very different sizes (~4000 rows in 'data' vs 26 rows in 'a' or 6 in 'd').

Can anyone recommend a better way to compare these groups in R? Are the t.test results reliable? I've done Wilcoxon tests too and gotten the same results. Any help or advice would be greatly appreciated!

Thanks!

R • 938 views

ADD COMMENT • link 5.2 years ago by kmyers2 ▴ 80

score 4 · Answer 1 · 2019-02-07

4

Entering edit mode

5.2 years ago

Jean-Karim Heriche 27k

The t-test is applicable if your data's distribution is reasonably close to normal. It doesn't make any assumption based on sample size, only on variance. If you think variances are different between the two groups then use Welch's version of the t-test (in R, set var.equal = F which actually is the default). It is often recommended to always use Welch's test regardless of whether the variances are equal or not.
If you're still concerned you could also just do a bootstrap test.
If I understand your data correctly, each gene is represented by a 113-d vector so cor(a) returns a matrix. However t.test(data.cor, a.cor) will flatten the matrices to vectors meaning that all values will be duplicated because the matrices are symmetric. While this doesn't affect the mean, this affects the variance.

ADD COMMENT • link 5.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks.

You are interpreting my data correctly, yes. Is there a better way to compare two matrices?

ADD REPLY • link 5.2 years ago by kmyers2 ▴ 80

0

Entering edit mode

At the moment you're comparing the average correlation of the two groups which is simple but may be not entirely satisfying because for large sample sizes even small differences become significant. So unless there is a large difference, the p-value is, in my view, meaningless in terms of biological relevance. I think you can use Mantel's test by comparing a matrix of regulon assignment (i.e. using a 0/1 coding) to the correlation matrix. The question this answers is: are genes in the same regulon also similar in terms of their expression pattern ? Another approach could be to take a feature selection approach to find out which experimental conditions are good indicators/predictors of regulon membership.

ADD REPLY • link 5.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks! I hadn't heard of Mentel's test. I've tried but need to read more to understand the results. I compared the matrix of all data and the matrix of the individual regulon and got a significants of 0.001 and a statistic r: 0.9113. I'll read more, but thanks for your help!

ADD REPLY • link 5.2 years ago by kmyers2 ▴ 80