Question: Pca From Vcf Files
gravatar for Rubal7
6.1 years ago by
Rubal7690 wrote:

Can anyone recommend a good software for doing Principal Component Analysis from data in VCF file format, or the most straightforward format to convert the VCF into for doing PCA. I hear that Plink is quite suitable for this. I also have some experience using eigenstrat for SNP data but have no experience using eigenstrat with whole genome VCF encoded data. Any tips or experience appreciated.

Many thanks,


vcf genome pca • 15k views
ADD COMMENTlink modified 2.0 years ago by mkulecka300 • written 6.1 years ago by Rubal7690

I agree, it would be nice to have a tool that does this. Meanwhile, You could create your own file with values of 0, 1, or 2 for homozygous ref, het, homoz alt. Then it'd be simple to use a standard PCA library to do the reduction.

ADD REPLYlink written 6.1 years ago by brentp22k

Sure! - take a look here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format

ADD REPLYlink written 4 weeks ago by Kevin Blighe21k
gravatar for Zev.Kronenberg
6.1 years ago by
United States
Zev.Kronenberg11k wrote:

You can use VCFtools to make a PED and MAP file from VCF. This is PLINK format. Many PCA programs take PLINK input or offer conversion scripts.

I ended up using SNPRelate. After some silly errors here is how I got it to work:

snpgdsVCF2GDS(vcf.fn, "ccm.gds",  method="biallelic.only")
genofile <- openfn.gds("ccm.gds")
plot(ccm_pca$eigenvect[,1],ccm_pca$eigenvect[,2] ,col=as.numeric(substr(ccm_pca$sample, 1,3) == 'CCM')+3, pch=2)
ADD COMMENTlink modified 4.9 years ago • written 6.1 years ago by Zev.Kronenberg11k

1000 Genomes also has a tool for producing plink

ADD REPLYlink written 6.1 years ago by Laura1.7k

tried it, and after ccm_pca<-snpgdsPCA(genofile) got: Removing 3604 non-autosomal SNPs. Error in snpgdsPCA(genofile) : There is no SNP!. Any idea?

ADD REPLYlink written 4.4 years ago by Leszek3.9k

Experienced the same issue. It seem the problem is that by default, chromosome names are not in the form "chr1" etc., but just "1" etc. The solution is to use function snpgdsOption() to redefine your chromosome names to whatever form they are in your vcf file : snpgdsVCF2GDS(vcf, "ccm.gds", method="copy.num.of.ref", option=snpgdsOption(chr1=1, chr2=2, chr3=3, chr4=4, chr5=5, chr6=6, chr7=7, chr8=8, chr9=9, chr10=10, chr11=11, chr12=12, chr13=13, chr14=14, chr15=15, chr16=16, chr17=17, chr18=18, chr19=19, chr20=20, chr21=21, chr22=22, chrX=23, chrY=24, chrM=25))

Another solution is to add autosome.only=FALSE in snpgdsPCA() - it then takes all your chromosomes whatever their names are.

ADD REPLYlink modified 4.2 years ago by Leszek3.9k • written 4.3 years ago by jockbanan370

you meant autosome.only=FALSE ? since TRUE returns the same error ("Removing 362090 non-autosomal SNPs. - there is no SNP")

ADD REPLYlink written 4.2 years ago by User 1933320

that works. thx!

ADD REPLYlink written 4.2 years ago by Leszek3.9k

Hi All,


Can you please explain about this line:

plot(ccm_pca$eigenvect[,1],ccm_pca$eigenvect[,2] ,col=as.numeric(substr(ccm_pca$sample, 1,3) == 'CCM')+3, pch=2)
ADD REPLYlink written 4.0 years ago by always_learning890
ccm_pca$eigenvect[,1],ccm_pca$eigenvect[,2] implies that you are plotting between eigen vectors 1 and eigen vectors 2 ... PC1 and PC2...
ADD REPLYlink written 3.9 years ago by geek_y8.6k

Thank you - this was very helpful. After struggling with Eigenstrat, I managed to produce a nice graph with this R package. I am relatively new to R and this is my question: Is there a way of labelling the different individuals in order to be able to distinguish the outliers in the graph?

ADD REPLYlink written 3.7 years ago by zinzin.steenkamp0

Have you find how to manage it ?

ADD REPLYlink written 2.2 years ago by Picasa350

I was able to run the PCA without errors and I got a nice plot. But I want to specify my two populations. How can I create the population file? Or do I need to add this info in the GDS file? Thanks!

ADD REPLYlink modified 24 months ago • written 24 months ago by CB10

Hi, I found this thread really helpful . My data is grouped in 4 different populations and I managed to get them labelled as such by doing something similar to the following. Following on from Zev's code above....

>genofile <- snpgdsOpen("ccm.gds")
> <- read.gdsn(index.gdsn(genofile, ""))
[1] "Sample1"   
[2] "Sanple2"  
[3] "Sample3"  
[4] "Sample4" 
[5] "Sample5"

In a text file, list the group name of each sample in, one group per line with each line corresponding to the group of the corresponding sample. For example if Sample 1 and 2 in were in 'Group 1' and Sample 3 - 5 were in 'Group 2', you'd have:


Save the file as 'pops.txt' and then:

>pop_code <- scan("pops.txt", what=character())
>cbind(, pop_code)

>ccm_pca<-snpgdsPCA(genofile, autosome.only=FALSE, num.thread=4)

>tab <- = ccm_pca$,
              pop = factor(pop_code)[match(ccm_pca$,],
              EV1 = ccm_pca$eigenvect[,1],    # the first eigenvector
              EV2 = ccm_pca$eigenvect[,2],    # the second eigenvector
              stringsAsFactors = FALSE)

>plot(tab$EV2, tab$EV1, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("bottomright", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))

And that should do it!

ADD REPLYlink modified 22 months ago • written 22 months ago by graham.etherington0

Just tried it but got some error...

tab <- = ccm_pca$ Error in (tab <- = ccm_pca$ : object 'tab' not found

What is exactly?

ADD REPLYlink written 15 months ago by nwhza12_psk0150
gravatar for sa9
5.8 years ago by
USA, Cambridge
sa9800 wrote:

SNPRelate is an R package that is able to read from VCF files directly and perform PCA and IBD/IBS. According to the documentation, it runs 10-45x faster than EIGENSTRAT (v3.0) and PLINK (v1.07) respectively.

Update (Oct 2014): The package seems to be moved to GitHub (link)

ADD COMMENTlink modified 3.7 years ago • written 5.8 years ago by sa9800

thanks, that's a good find!

ADD REPLYlink written 5.8 years ago by brentp22k

Just tried it. It hates my Unified genotyper VCF??

ADD REPLYlink written 5.0 years ago by Zev.Kronenberg11k

What is the error message? Can you post part of the VCF?

ADD REPLYlink written 5.0 years ago by sa9800

There was no problem with the function, it was user error :-).

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Zev.Kronenberg11k

Can you elaborate? I get "file has different number of columns" with a UnifiedGenotyper VCF. Is there a fix, or this type of VCF not suitable?

ADD REPLYlink written 4.9 years ago by Neilfws48k

Neilfws I used a multi-individual Unified Genotyper VCF as an input. I can't quite remember what the error was, but I can remember how I got it to work. See revised post above.

ADD REPLYlink written 4.9 years ago by Zev.Kronenberg11k

Thanks for that. I had no joy with the VCF; I converted to PLINK bed format using vcftools then used snpgdsBED2GDS() in SNPRelate. That converted to GDS no problem.

ADD REPLYlink written 4.9 years ago by Neilfws48k

This sounds great

ADD REPLYlink written 4.2 years ago by Rubal7690

which version if R can be ran into? I got this error:

library("SNPRelate") Error in library("SNPRelate") : there is no package called ‘SNPRelate’ install.packages("SNPRelate") Installing package into ‘/Users/ib7/Library/R/3.4/library’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘SNPRelate’ is not available (for R version 3.4.2)

any advice? thanks


ADD REPLYlink written 7 months ago by ibseq120

SNPRelate has been removed from CRAN. You need to install it from Bioconductor now.

ADD REPLYlink written 7 months ago by Neilfws48k
gravatar for mkulecka
2.0 years ago by
European Union
mkulecka300 wrote:

While this thread in very old, I think it would be useful to add that PLINK now directly supports vcfs link to new (1.9) version of PLINK. .

ADD COMMENTlink written 2.0 years ago by mkulecka300

It makes things easier than ever. plink --pca --allow-extra-chr --vcf samples.vcf

ADD REPLYlink written 18 months ago by Xiaowei 0

How would you plot the resulting output?

ADD REPLYlink written 3 months ago by BeetleCheers0

With R it's pretty easy:


df<-read.delim("plink.eigenvec") #read in eigenvectors

eigens<-read.delim("plink.eigenval",header=F) #read in eignevalues


sum_eigs<-lapply(eigens$V1,function(x){ rt<-(x/sum_eig)*100 rt<-round(rt) return(rt) })

ggplot(df, aes(PC1, PC2, color=condition)) + geom_point(size=3) + xlab(paste0("PC1: ",sum_eigs[[1]],"% variance")) + ylab(paste0("PC2: ",sum_eigs[[2]],"% variance"))+geom_text_repel(aes(label=SampleID), size=3) #The part with geom_text_repel adds easy-readable labels.

I have slightly modified my eigenvectors file I have added SampleID and condition columns.

ADD REPLYlink modified 3 months ago • written 3 months ago by mkulecka300
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 775 users visited in the last hour