I am running regressions using limma on 450k array DNA methylation data.
I'm interested in the relationship between circulating levels of a hormone and DNAm in ~144,000 CpG sites I've preselected based on variability (non variable sites likely won't tell us anything about our variable of interest).
I run regressions using limma, and plot the p-values, but they don't look uniform at all. The odd thing is that this is exactly the same for all 3 hormone measurements. I tried to remove covariates in case there is collinearity going on, many different versions of these models look exactly the same.
or here if that link isn't working.
What would explain this kind of distribution - sparse at high and low p-values, but uniform across the rest?
Is there some way I can 'fix' this? More robust tests I can run etc. to see if there is a relationship between my hormones and DNAm?
Hey, the histogram is quite 'blocky' - can you increase the bin resolution? Also, I am not sure that there is anything inherently incorrect about the distribution. Have you additionally checked a QQ plot?
The 'cause' could be this:
You should probably not filter based on variance due to the fact that the empirical Bayes approach of limma 'feeds on' this variation. As you have removed it, this would affect the derived test statistics.
Thanks for your reply!
I filter on variability AFTER the Bayes variance adjustments. Probably didn't need to mention the 'filtered on variability' thing, just thought that someone might wonder why there are ~150k pvalues not ~450k pvalues.
Yes I looked at QQplots - some are ok, some look a bit deflated. They run under the line as they move up expected quantiles, and maybe a few start to run 'under' the confidence intervals.
But most of the problem (I figured it out) is because my histogram limits crossed 0 and 1, making it look like those bins were lower than expected. Once I fix this they actually they don't look like they are.