Hello,

I was told that by doing a histogram of the distribution of P-values I have after analysing my data, I could be able to say: well, there are confounding factors in my data. Those are my histograms:

Do you see any confounding factor here?

Confounding Factors In Microarray Analysis

2

Entering edit mode

9.4 years ago

int11ap1
▴
470

Hello,

I was told that by doing a histogram of the distribution of P-values I have after analysing my data, I could be able to say: well, there are confounding factors in my data. Those are my histograms:

Do you see any confounding factor here?

4

Entering edit mode

9.4 years ago

matted
7.7k

It's better to understand the reasoning or assumptions behind the procedures you're implementing. In this case, I can only guess that someone was telling you this because with a proper statistical model (matched to the obtained data without too many problems such as confounding factors), p-values obtained under the null hypothesis should follow a uniform distribution.

A secondary assumption may be that many or most genes follow the null hypothesis (and therefore have uniformly-distributed p-values), and that only a small set depart from that (see e.g. "Towards the uniform distribution of null P values on Affymetrix microarrays"). With these two assumptions, you could graphically interpret the p-value histograms to see if any depart seriously from uniformity.

With these assumptions of someone suggesting a semi-standard practice, I can say that your plots of "alpha vs. clb" and "alpha vs. cln" are the only ones that look somewhat reasonable (i.e. "close" to uniform). Of course, the details of your experiment and statistical analysis could change this completely... this is only a guess based on the limited context you've given.

As an aside, this is the same motivation for looking at q-q plots in genome-wide association studies, where only a small set of SNPs are assumed to relate to the phenotype of interest. There are many methods to control confounding effects and correct biased p-value distributions, but the broad approach of "test statistic correction" (e.g. "Genomic control for association studies") is maybe the most instructive here, if you're interested in more reading.

But to summarize, you should really understand the statistical details of what you're trying to do, and then the motivation and procedures should (will?) be much clearer.

0

Entering edit mode

Well, you have to look at the assumptions I outlined and ask yourself if they apply to your experiment. I don't know your experimental setup or statistical procedure, so I can't answer that.

For example, if you're doing a test where it's reasonable to expect that most or all tests are significant, then the p-value distribution won't be uniform. If you expect it to be uniform and it isn't, then you should investigate confounding factors or other systematic problems.

In general, this isn't a rigorous approach... it's more of a graphical "sniff test" to see whether to dig deeper. So I wouldn't treat any results from it (positive or negative) as perfect truth without more thinking/analysis.

1

Entering edit mode

9.4 years ago

Neilfws
49k

You were misinformed :) or at least, you should have followed up that advice with the question "Why?"

Using R? Try searching the Bioconductor website for "batch effect" and try some of the packages that come up.

Similar Posts

Loading Similar Posts

Traffic: 2221 users visited in the last hour

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Wouldn't it make much more sense to use the SVA package?

Yes, I did! However, distributions are more or less the same.

Err, does that mean that SVA indicated that there might be a significant component to control for or not?

I should note that lacking any other information, one can't simply look at these plots and determine what might have gone wrong. All we can say is that something is off. Perhaps you have a batch effect. Perhaps you forgot to include an interaction in the model matrix. Perhaps most of those coefficients are trying to measure the same thing, which will also cause issues. These plots won't magically answer these questions.

What confounding factors are you afraid of? To me it doesn't make sense to plot histograms of p-values. Maybe plotting p-values (or better -log10(p-values)) against the mean expression across the complete cohort to inspect that low expression genes are not overrepresented in low p-values.

We'd need to know more about your experiment. If your five experimental groups of interest are alpha, cdc15, cdc38, clb, cln, and elu, and you want to know if there were other confounding variables in your experiment that affected comparisons between those groups, then you cannot determine that from looking at these histograms alone. For example, were the cln samples collected or measured in a different batch than the alpha samples? If so, that would be a potential confounder. The histograms tell you about the distribution of p-values from your particular comparisons, but say nothing about confounding variables in your experimental design.