Question: A Shift In Pca Plot For Population Stratification
gravatar for User 1933
3.7 years ago by
User 1933300
User 1933300 wrote:

To do the population comparison between cohort of patient and 1KG. So, I have converted their VCF file to PED format;

vcftools --gzvcf 1000g/1000g_myvariants.vcf.gz --plink --out 1000g
vcftools --vcf myvariants.vcf --plink --out myvariants

and then, I took variant with snp ID

grep -o 'rs[0-9]*' > rs.snplist.raw

and sorted and removed those were duplicated

sort rs.snplist.raw | uniq > rs.snplist.dedup

then I removed those were not matched allele codes

plink --file myvariants --extract rs.snplist.dedup --exclude all.missnp --recode --out myvariants.subset
plink --file 1000g_myvariants --extract rs.snplist.dedup --exclude all.missnp --recode --out 1000g_myvariants.subset

and finally I merged them

plink --file 1000g_myvariants.subset --merge myvariants.subset.ped --recode --out all

and I created MDS plot

plink --file all --read-genome all.genome --cluster --mds-plot 2 --out all_mds_2

and plotted component 2 versus component 1

tab = read.table("plink.mds", h = T)
tab$pop = factor(c(rep("1KG", 1212), rep("mycohort", 285)))
plot(tab$C1, tab$C2, col=as.integer(tab$pop),xlab="eigenvector 2", ylab="eigenvector 1")

and here is how the result look like,


basically, there is a shift which I am curious what could be the reason ? do I have to filter more SNP to get the right match ? is there any other tools to run PCA rather than PLINK?

Is the 1000 genome variants some how normalized while the other cohort is not ?

exome-sequencing pca • 3.8k views
ADD COMMENTlink modified 2.5 years ago by Zhenyu160 • written 3.7 years ago by User 1933300

Are the black dots individuals from 1000genomes? Which dataset are you using, exactly? Check if the separate groups are due to different sequencing technology.

ADD REPLYlink written 3.7 years ago by Giovanni M Dall'Olio25k

yes, there are 1212 individuals in 1KG which are represented by black in the plot. mm, the 1KG sequencing has been done both with Illumina and ABI sequencing; I feel I should have normalize these two cohort separately somehow before hand.

ADD REPLYlink written 3.7 years ago by User 1933300

Have you tried doing factor analysis to see which SNPs are underlying this? You could also just look at the rotated data (if you were to do the PCA with the prcomp() function in R, this would be output$x). That's the next thing I would try.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Devon Ryan73k
gravatar for Zhenyu
2.5 years ago by
United States
Zhenyu160 wrote:

Hi.  I am wondering if you have figured out the reason.  I recently did a similar smartpca analysis, and also see such phenomena of population stratification shift.

ADD COMMENTlink written 2.5 years ago by Zhenyu160

OK, let me answer myself.  I filter the data with HWE, and then it looks great now. 

ADD REPLYlink written 2.5 years ago by Zhenyu160

also you have to append 1KG data and yours and doing one pca. not two separate.

ADD REPLYlink written 2.3 years ago by Quak240
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1431 users visited in the last hour