Correlation test for multiple variables and adjusted p values
2
0
Entering edit mode
4.8 years ago
ASid ▴ 40

So I am performing correlation test (pearson) on 500 genes data. I want to check the associations between every pair of gene and in turn getting the r value and p value for each of them of course. In total we get 500 * 499/2 = 124750 tests/gene pairs to compare.

I know next step is to perform multiple comparison check using FDR or Bonferroni procedures. Let's say we have chosen FDR for getting adjusted p value.

My question is regarding the filtering and calculating adjusted p values. if we first filter the comparisons based on a specific r value like 0.4 (lets assume for a moment say we have now filtered 1000 comparisons out because the absolute r value was greater than 0.4) and now we need to run fdr for multiple comparisons, then it is going to use 1000 p values only of course. Are we being biased here? can we do this actually? Because actually we performed 124750 tests and I am not sure if I am going the right way.

correlation gene to gene associations • 2.6k views
ADD COMMENT
1
Entering edit mode
4.8 years ago

You should use all pvalues for multi testing correction

One good idea is to plot nominal pvalues and check the shape of the distribution . Check here for explanation: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

ADD COMMENT
0
Entering edit mode

ok thank you.but i am telling this 500 genes number juat as an example. my actual data is much larger. its for 20k genes. in that case what do you suggest!

ADD REPLY
1
Entering edit mode

The suggestion is make a histogram of all your p-values and if the shape of that histogram doesn't indicate any issue then apply a correction using all p-values.

ADD REPLY
0
Entering edit mode
4.8 years ago

Filtering data before statistical testing as a means to increase sensitivity is often done but is tricky if one wants to still adequately control the false positive rate. See for example this paper. I would hesitate to do it and would only consider it based on independent information, not on any variable that is not clearly independent of the test statistics. So in your case, definitely don't filter on r.

ADD COMMENT

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6