Question

Inconsistencies and QQ plot inflation in CMH Test Results for case-control Pool Seq: Underestimating Geographical Effects?

0

Entering edit mode

6 months ago

C • 0

Hello,

I'm working on a project that involves paired (case-control) pool sequencing to investigate the genetic factors in trees. I've observed inflation in QQ plots and some inconsistencies in my results when using the Cochran-Mantel-Haenszel (CMH) test, and I'm seeking guidance and suggestions.

Pooling Design & Sequencing:

I have a total of 12 paired pools: 6 cases and 6 controls. Each pool comprises 11-13 samples. We aimed to group samples with similar DNA quality and from the same locations as closely as possible. We performed whole-genome sequencing on these pools, achieving an average coverage of around 15X.

Issues:

After processing the raw data and conducting the CMH test with Popoolation 2 (--populaton 1-2,3-4,5-6,7-8,9-10,11-12), the Manhattan plot seemed promising, identifying 17 potential genes. However, the QQ plot was inflated. QQ plot of CMH test on all pools

Upon some search, I suspect that this inflation might be due to unaccounted confounding factors in the model. My primary suspicion is the geographical effect. PCA on low-coverage individual sequencing of these samples revealed some clustering by location, and PCA on the Pool-Seq data showed a distinct separation between pools 1 and 2 from the other pools.

I attempted the CMH test on:

A subset of pools from the same region (Pools 1-2, 3-4, 9-10).
A subset of pools excluding pools 1 and 2.

In both scenarios, the QQ plots remained inflated. Additionally, the identified SNPs from the different CMH tests varied. For instance, the CMH test on the regional subset (Pools 1-2, 3-4, 9-10) yielded strong signals in the Manhattan plot, identifying 24 potential genes. However, only three of these overlapped with the results from the analysis of all 12 pools. QQ plot of CMH test on regional subset

Moreover, I also tried to set the quality control steps stricter, however, the QQ plots were not any better. So I think the quality of reads might not be the main issue here.

Questions:

Could we be underestimating the impact of geographical factors when design the pools, leading to inconsistent results?

How can I account for this potential confounder? In traditional GWAS, principal components are often used as covariates. However, it seems that Popoolation2 doesn't offer the option to include such covariates.

p.s. I am now trying to run CMH tests with separating the pools based on the presence of samples from site_A: 1. Pools 1-2 (7/11 samples from site_A), 3-4 (2/11 samples from site_A). 2. Pools 5-6,7-8,9-10,11-12: without samples from site_A. Although I am not sure how much difference this makes, and I feel this is for sure not what we wanted when we designed the pools...

I would greatly appreciate any insights, recommendations, or relevant references. I aim to understand the possible shortcomings of our methodology and determine how best to address them in our analysis and subsequent manuscript.

Thank you!

QQ-plot CMH-test Pool-Seq • 345 views

ADD COMMENT • link 6 months ago by C • 0