Question

Confounding Factors In Microarray Analysis

2

Entering edit mode

10.2 years ago

int11ap1 ▴ 470

Hello,

I was told that by doing a histogram of the distribution of P-values I have after analysing my data, I could be able to say: well, there are confounding factors in my data. Those are my histograms:

enter image description here

Do you see any confounding factor here?

microarray • 4.2k views

ADD COMMENT • link updated 10.2 years ago by matted 7.8k • written 10.2 years ago by int11ap1 ▴ 470

3

Entering edit mode

Wouldn't it make much more sense to use the SVA package?

ADD REPLY • link 10.2 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, I did! However, distributions are more or less the same.

ADD REPLY • link 10.2 years ago by int11ap1 ▴ 470

0

Entering edit mode

Err, does that mean that SVA indicated that there might be a significant component to control for or not?

I should note that lacking any other information, one can't simply look at these plots and determine what might have gone wrong. All we can say is that something is off. Perhaps you have a batch effect. Perhaps you forgot to include an interaction in the model matrix. Perhaps most of those coefficients are trying to measure the same thing, which will also cause issues. These plots won't magically answer these questions.

ADD REPLY • link 10.2 years ago by Devon Ryan 104k

0

Entering edit mode

What confounding factors are you afraid of? To me it doesn't make sense to plot histograms of p-values. Maybe plotting p-values (or better -log10(p-values)) against the mean expression across the complete cohort to inspect that low expression genes are not overrepresented in low p-values.

ADD REPLY • link 10.2 years ago by Irsan ★ 7.8k

2

Entering edit mode

We'd need to know more about your experiment. If your five experimental groups of interest are alpha, cdc15, cdc38, clb, cln, and elu, and you want to know if there were other confounding variables in your experiment that affected comparisons between those groups, then you cannot determine that from looking at these histograms alone. For example, were the cln samples collected or measured in a different batch than the alpha samples? If so, that would be a potential confounder. The histograms tell you about the distribution of p-values from your particular comparisons, but say nothing about confounding variables in your experimental design.

ADD REPLY • link 10.2 years ago by Ahill ★ 1.9k

score 4 · Answer 1 · 2014-02-01

It's better to understand the reasoning or assumptions behind the procedures you're implementing. In this case, I can only guess that someone was telling you this because with a proper statistical model (matched to the obtained data without too many problems such as confounding factors), p-values obtained under the null hypothesis should follow a uniform distribution.

A secondary assumption may be that many or most genes follow the null hypothesis (and therefore have uniformly-distributed p-values), and that only a small set depart from that (see e.g. "Towards the uniform distribution of null P values on Affymetrix microarrays"). With these two assumptions, you could graphically interpret the p-value histograms to see if any depart seriously from uniformity.

With these assumptions of someone suggesting a semi-standard practice, I can say that your plots of "alpha vs. clb" and "alpha vs. cln" are the only ones that look somewhat reasonable (i.e. "close" to uniform). Of course, the details of your experiment and statistical analysis could change this completely... this is only a guess based on the limited context you've given.

As an aside, this is the same motivation for looking at q-q plots in genome-wide association studies, where only a small set of SNPs are assumed to relate to the phenotype of interest. There are many methods to control confounding effects and correct biased p-value distributions, but the broad approach of "test statistic correction" (e.g. "Genomic control for association studies") is maybe the most instructive here, if you're interested in more reading.

But to summarize, you should really understand the statistical details of what you're trying to do, and then the motivation and procedures should (will?) be much clearer.

score 1 · Answer 2 · 2014-02-01

1

Entering edit mode

10.2 years ago

Neilfws 49k

You were misinformed :) or at least, you should have followed up that advice with the question "Why?"

Using R? Try searching the Bioconductor website for "batch effect" and try some of the packages that come up.

ADD COMMENT • link 10.2 years ago by Neilfws 49k