Question

PCA on subset of patients, from 22 separate chromosome vcf files

0

Entering edit mode

3.2 years ago

vincentpluimakers • 0

Dear all,

I would like to perform a principal component analysis using plink, in order to use these PCs to adjust for genetic ancestry in a GWAS. I have 22 vcf files (per chromosome), with genotype data of ~6000 people. I would like to perform the following steps, but I struggle a bit with the correct order and commands.

I want to subset to ~4500 people of interest. I think I can use bcftools -S for this, using a .txt file with patient IDs that I want to include. Is this correct?
I am not sure whether to prune first or to merge the 22 files first. My guess is I prune the 22 chromosomes using plink first.
Then I would like to merge the files to yield one file with all genotype data. The options merge and concatenate confuse me a little bit, but I think that for the purpose of performing a PCA I need to go for (bcftools) merge, is this right?
I can then perform a pca in plink, and use the eigenvectors of the first n principal components as covariates in my GWAS. Is there a common number of PCs to be used or do I determine this based on the eigenvalues?

Hope you can help me out

Thanks in advance,
Vincent

principal-component-analysis vcf PCA • 2.1k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.2 years ago by vincentpluimakers • 0

score 0 · Answer 1 · 2021-02-08

0

Entering edit mode

3.2 years ago

4galaxy77 2.8k

I'd do something like this

Convert all your 22 vcfs to plink format - e.g. plink --vcf data.chr1.vcf --make-bed --out data.chr1
Merge all chromosomes together e.g. plink --bfile data.chr1 --merge-list mylist.txt --make-bed --out data.allChr
Subset your samples using plink - e.g. plink --bfile data.allChr --keep samples.txt --make-bed --out data.allChr.subset
Then LD prune your merged in plink using --indep-pairwise command in plink
Then run your PCA using plink on the merged/pruned file

other things you might want to consider are filtering for minor allele frequency --maf 0.01 and also doing some IBD checks using --genome. This is a useful tutorial

Hope that helps.

ADD COMMENT • link 3.2 years ago by 4galaxy77 2.8k

0

Entering edit mode

Thank you very much! I'm almost there. At step 4, do I choose for --make-bed as well? This gives me three files pruned.bed, pruned.bim and pruned.fam. If I use these for pca, I get the impression that all variants are still used in pca, not only the ones that remained after pruning. But if I don't use --make-bed, and try to use the file .prune.in for pca, there is an error with not finding a .map file.

ADD REPLY • link 3.2 years ago by vincentpluimakers • 0

0

Entering edit mode

Step 4 will give you a list of SNPs which are in linkage equilibrium (i.e. the ones you want to keep) in a file called xxx.prune.in. You can then keep this snps by running something like plink --bfile data --extract xxx.prune.in --make-bed --out data.pruned, which will create a new set of plink files with the base-name data.prune. You should then use this for your PCA. The steps in the practical I linked should give some more detailed info.

ADD REPLY • link 3.2 years ago by 4galaxy77 2.8k

0

Entering edit mode

Yes that's it. I tried it with one of my vcf files, as they do in the practical, which worked. For the plink files I missed this extra conversion step. Thank you!

ADD REPLY • link 3.2 years ago by vincentpluimakers • 0