Dear all, I would like to perform a principal component analysis using plink, in order to use these PCs to adjust for genetic ancestry in a GWAS. I have 22 vcf files (per chromosome), with genotype data of ~6000 people. I would like to perform the following steps, but I struggle a bit with the correct order and commands.
- I want to subset to ~4500 people of interest. I think I can use bcftools -S for this, using a .txt file with patient IDs that I want to include. Is this correct?
- I am not sure whether to prune first or to merge the 22 files first. My guess is I prune the 22 chromosomes using plink first.
- Then I would like to merge the files to yield one file with all genotype data. The options merge and concatenate confuse me a little bit, but I think that for the purpose of performing a PCA I need to go for (bcftools) merge, is this right?
- I can then perform a pca in plink, and use the eigenvectors of the first n principal components as covariates in my GWAS. Is there a common number of PCs to be used or do I determine this based on the eigenvalues?
Hope you can help me out, Thanks in advance, Vincent