plink : Batch effect issues after merge of two datasets
0
0
Entering edit mode
7 months ago

Hi,

I merged two plink dataset using

# take only SNP present in both datasets
plink --keep-allele-order --bfile dataset_A --extract snp_in_common.txt --make-bed --out dataset_A_common
plink --keep-allele-order --bfile dataset_B --extract snp_in_common.txt --make-bed --out dataset_B_common

echo dataset_A_common > merge.txt
echo dataset_B_common >> merge.txt

# merge datasets
plink --merge-list merge.txt --make-bed --out dataset_merge

# filter out SNP with low freq and low genotyping rate
plink --maf 0.01 --geno 0.05 --hwe 0.00001 --bfile dataset_merge --out dataset_merge


I perfomed a PCA (after pruning the merged dataset)

# pruning
plink --bfile dataset_merge --exclude high-ld-regions.txt --range --indep-pairwise 50 5 0.2 --out dataset_merge
plink --bfile datase_merge dataset_merge.prune.in --make-bed --out dataset_merge_pruned

# pca
plink --pca --bfile dataset_merge_pruned --out dataset_merge_pruned


When I plot PCA shows clearly a strong batch effect between both datasets

I continued the analysis by performing a logistic :

plink --bfile dataset_merge --covar pca_file.txt --covar-name PC1,PC2 --logistic --out dataset_merge


Looking at the manhattan and p-value histogram, there is clearly something not correct ... most of p-values are close to 1..

Any idea how to solve this ?

Thank you

P.S. : I also posted this on plink google group. Sorry for the cross post. I can remove this post if needed..

0
Entering edit mode

I am not an expert at all but it looks like PCs explain your dataset perfectly so mutations do not matter anymore. Thus, taking PCs as covariates, mutations are not needed for discrimination anymore. Maybe PC1 does not actually segregate between these 2 datasets and only PC2 is enough to correct for? Sorry if I said something not so smart. How does a PCA of 0/1s look? If it separates well via PC1 - then correcting for PC1 kills all the meaning in mutations...

0
Entering edit mode

Here adjusting for PC is to take into account population stratification. I would expect not to have such big discrimination between both datasets as both are based on germline data (already pre filtered for caucasian ancestry)

0
Entering edit mode

I would say (from my experience) such a great difference can be explained by different enrichment kits, used to generate 2 datasets. The EUR population PCA usually looks like an angle |_ - so the picture that you have is not typical for population separation, more for a technical batch. But what's the most important is how your cases and controls are distributed across this merged dataset. I'd depict them as different colors, I bet since you have such large p-values cases/controls are distributed across batches along PC1 line - thus, PC1 already explains the case/control separation and there is no variance to be explained by mutations remained.

0
Entering edit mode

Thanks German.M.Demidov . One important piece of information I miss in my main thread is that dataset_A are the cases ; and dataset_B are the controls in my case/control logisitic analysis. I'm still struggling to understand why the p-value distribution is so skewed towards 1 ( I thought p-values should be uniformly distributed under the Null ). Thanks

0
Entering edit mode

Oh, then it will be problematic. The logistic regression looks if a frequency difference in mutation X can discriminate between cases and controls. But when it is given PC2 as a covariate it does not need the mutation X at all to discriminate cases and controls, it says "everything with PC2 > 0 is a case, everything less is a control". It is already enough for the logistic regression. Thus, p-values are shifted towards 1 because PC2 already separates 2 sets and no mutation is needed!

I am afraid this is the situation without a good solution. If cases come from one population and controls from another, there is no way to distuinguish real case/control differences from population differences...