Batch effect in population stratification with 1000 genome project data
0
0
Entering edit mode
2.2 years ago

Hi all,

I have an exome vcf dataset which I am trying to do produce a PCA plot with data of 1000 Genomes Project Phase III (1000G_2504_high_coverage - WGS). Using plink, I transformed my dataset into .bed/.bim/.fam and I filtered both datasets with --maf 0.1 and --indep 50 5 1.5. I merged both datasets with the common variants and run a pca with plink. However, plotting the results, my samples do not overlap with any datapoints from the reference panel.

I am using variants from all chromosomes except chromosome X and I have used different filtering thresholds in plink but still I get this batch effect:

enter image description here

Any suggestions on what am I doing wrong?

Thank you very much for your help!

pca plink 1kg • 618 views
ADD COMMENT
0
Entering edit mode

My guess is strand flips. Did you make sure to check the reference alleles are the same between the two datasets?

ADD REPLY
0
Entering edit mode

I checked for strand issues but there weren't in both datasets. I also checked and adjusted the reference build but still I am getting the same isolated cluster for my dataset. A remark is that before working with plink the exome dataset was joint genotyped with gl_nexus, and then normalized and decomposed. I also imputed it using BEAGLE. Is there a chance that I introduced a technical error before convert the vcf dataset into plink format?

ADD REPLY

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6