PCA on VCF
6.8 years ago
Picasa ▴ 630

Is it possible to produce this kind of PCA:

https://rstudio-pubs-static.s3.amazonaws.com/89838_c06c544a19f94599aa856576e7c08e2b.html

without EIGENSOFT ? (for some reasons I can't install it in my computer).

pca vcf • 5.2k views
How about using PLINK to generate the matrix of the VCF files and then do PCA for it.

PLINK has a lot of tools. Which one are you referring to? Is it pseq proj v-matrix ...?

I am not sure actually. But I saw once that with PLINK a SNPs matrix(numerical) were generated. Through this, a PCA would be easy.

Does it perform LD pruning ?

Not sure. You might need to check them out by yourself because I haven't tried it. But I would recommend you to go with @Philipp and @Michs' answers.

6.8 years ago

GAPIT can do this for you, too, but it needs other input data: http://www.maizegenetics.net/#!gapit/cmkv For the conversion of VCF to HapMap format, have a look here: Convert Plink Ped Format Into Hapmap Format?

You can also use FlashPCA, esp. because that one shows how to do LD-pruning of SNPs. You can then use the output pcs.txt in the R-script from your link,

Thanks for your link. Just one thing. Why do we have to perform LD pruning ?

SNPs in LD are not independent observations and result in spurious inflation of the distance in PCA.

oh i didn't realize that. i thought the whole point of PCA was to transform correlated, non-independent variables into a finite number of dimensions using a covariance matrix. I didn't realize it mattered if two SNPs were correlated because they were close to each other on the chromosome vs correlated because they both conferred some advantage in a certain environment.

0
6.8 years ago
Mitch Bekritsky ★ 1.3k

Illumina has a C++ package that does partial PCA on a population VCF directly: https://github.com/Illumina/akt

(In the interest of full disclosure, I work at Illumina, but do not work on this tool)

6 months ago
hewm2008 ▴ 40

I recently developed a brand new pca analysis software MingPCACluster that can go from vcf to pca and graph( (VCF2PCA and figture)). Very fast and low memory, accurate and very precise

https://github.com/hewm2008/MingPCACluster

### run without pop.info
#   ./bin/MingPCACluster   -InVCF  Khuman.vcf.gz   -OutPut OUT
### run with  pop.info
./bin/MingPCACluster    -InVCF  Khuman.vcf.gz   -OutPut OUT -InSampleGroup  pop.info