Hello,
I'm working on a project that involves paired (case-control) pool sequencing to investigate the genetic factors in trees. I've observed inflation in QQ plots and some inconsistencies in my results when using the Cochran-Mantel-Haenszel (CMH) test, and I'm seeking guidance and suggestions.
Pooling Design & Sequencing:
I have a total of 12 paired pools: 6 cases and 6 controls. Each pool comprises 11-13 samples. We aimed to group samples with similar DNA quality and from the same locations as closely as possible. We performed whole-genome sequencing on these pools, achieving an average coverage of around 15X.
Issues:
After processing the raw data and conducting the CMH test with Popoolation 2 (--populaton 1-2,3-4,5-6,7-8,9-10,11-12), the Manhattan plot seemed promising, identifying 17 potential genes. However, the QQ plot was inflated.
Upon some search, I suspect that this inflation might be due to unaccounted confounding factors in the model. My primary suspicion is the geographical effect. PCA on low-coverage individual sequencing of these samples revealed some clustering by location, and PCA on the Pool-Seq data showed a distinct separation between pools 1 and 2 from the other pools.
I attempted the CMH test on:
- A subset of pools from the same region (Pools 1-2, 3-4, 9-10).
- A subset of pools excluding pools 1 and 2.
In both scenarios, the QQ plots remained inflated. Additionally, the identified SNPs from the different CMH tests varied. For instance, the CMH test on the regional subset (Pools 1-2, 3-4, 9-10) yielded strong signals in the Manhattan plot, identifying 24 potential genes. However, only three of these overlapped with the results from the analysis of all 12 pools.
Moreover, I also tried to set the quality control steps stricter, however, the QQ plots were not any better. So I think the quality of reads might not be the main issue here.
Questions:
Could we be underestimating the impact of geographical factors when design the pools, leading to inconsistent results?
How can I account for this potential confounder? In traditional GWAS, principal components are often used as covariates. However, it seems that Popoolation2 doesn't offer the option to include such covariates.
p.s. I am now trying to run CMH tests with separating the pools based on the presence of samples from site_A: 1. Pools 1-2 (7/11 samples from site_A), 3-4 (2/11 samples from site_A). 2. Pools 5-6,7-8,9-10,11-12: without samples from site_A. Although I am not sure how much difference this makes, and I feel this is for sure not what we wanted when we designed the pools...
I would greatly appreciate any insights, recommendations, or relevant references. I aim to understand the possible shortcomings of our methodology and determine how best to address them in our analysis and subsequent manuscript.
Thank you!