How to properly merge data with 1000 Genomes data for PCA
0
0
Entering edit mode
8.1 years ago
anitabow ▴ 10

Hi all,

I apologize in advance if this seems like an elementary question. However, I am trying to create a PCA of my admixed case/control samples along with two "parental" populations from 1000 Genomes(CEU and YRI in this case). If I use eigenstrat's smartpca to create a PCA of CEU, YRI, and ASW, I get exactly what I would expect. However, if I add my own samples, which are ethnically very similar to ASW, I get a complete mess. I've tried limiting the SNP-set to those only found in both the 1000 Genomes data and my samples using bcftools isec. But that PCA looked terrible too. Am I completely missing something? I really just want to be able to show that the ancestry of my cases and controls are the same. They both should fall along a cline between CEU and YRI.

pca eigensoft SNP vcf ancestry • 5.3k views
ADD COMMENT
0
Entering edit mode

smartpca gets complicated. What data format you use at first (you samples and 1000 genomes are in vcf, plink?). Did you use lsq project? It's a good idea to take into account PCA shrinkage <- if your samples have small number of SNPs relative to reference ones. Then you should subsample some SNPs from reference samples, and plot only those

ADD REPLY
1
Entering edit mode

Both my samples and the 1000 genomes samples were in vcf format. I was aware there may be some issues trying to simply merge the two datasets, as 1000 genomes data was phased and mine was not. After taking only the intersection of SNPs found in both 1KG and my data, I pruned for ld then simply output as plink map/ped files and used convertf to turn those into eigenstrat format. I'm not sure PCA shrinkage is the issue here, but after only calculating principal components using YRI and CEU, then simply projecting the ASW and my own samples onto it, it looked fine and made sense. Is this the way I should've been doing it all along?

ADD REPLY
0
Entering edit mode

yes, however be aware that projecting 1000 genomes sample and your own sample might not be the same, as they don't come from the same source. I had the same problems when merging my samples with POPRES database

ADD REPLY

Login before adding your answer.

Traffic: 3101 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6