Question: Multi-Sample Vcf To Phylogenetic Tree.
9
gravatar for William
5.9 years ago by
William4.4k
Europe
William4.4k wrote:

How can I construct a phylogenetic tree based on the SNP's shared between strains? I have whole genome SNP calls for 10 different strains in a multi-sample vcf.

Are there any tools that can take the vcf as an input for creating phylogenetic trees? Or do I need to convert the multi-sample vcf to another matrix? Which kind of matrix would that be how can I create it from the vcf?

Is there a list somewhere of popular packages that kan be used for creating phylogenetic trees? Or guides on how to go from a multi-sample vcf to a plylogenetic tree.

vcf • 16k views
ADD COMMENTlink modified 3 months ago by rm.umayal240 • written 5.9 years ago by William4.4k
2

You can use SNPhylo --> http://chibba.pgml.uga.edu/snphylo/

The only suggestion in order to get it work is that the chromosomes ids in your vcf file have to be numbers (1,2,3,4...) and not like Chr1 or Gm01.

 

ADD REPLYlink written 4.2 years ago by mcff2360
13
gravatar for William
5.8 years ago by
William4.4k
Europe
William4.4k wrote:

Here is what I did in the SNPrelate package to get a dendogram and pca from my multisample vcf file

#vcf to GDS

snpgdsVCF2GDS("my.vcf", "my.gds")

snpgdsSummary("my.gds")

genofile <- openfn.gds("my.gds")

#dendogram

dissMatrix  <-  snpgdsDiss(genofile , sample.id=NULL, snp.id=NULL, autosome.only=TRUE,remove.monosnp=TRUE, maf=NaN, missing.rate=NaN, num.thread=10, verbose=TRUE)

snpHCluster <-  snpgdsHCluster(dist, sample.id=NULL, need.mat=TRUE, hang=0.25)

cutTree <- snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000, samp.group=NULL,col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE, verbose=TRUE)

#pca

sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))

pop_code <- read.gdsn(index.gdsn(genofile, "sample.id")

pca <- snpgdsPCA(genofile)

tab <- data.framesample.id = pca$sample.id,pop = factor(pop_code)[match(pca$sample.id, sample.id)],EV1 = pca$eigenvect[,1],EV2 = pca$eigenvect[,2],stringsAsFactors = FALSE)

plot(tab$EV2, tab$EV1, col=as.integer(tab$pop),xlab="eigenvector 2", ylab="eigenvector 1")
legend("topleft", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))
ADD COMMENTlink written 5.8 years ago by William4.4k

Hi it seems a good tool, but how should I proceed if I want to construct a Phylogenetic tree from .vcf files from different samples. Do i have to concatenate them to create a multisample vcf or i can manege them independently to create the tree.?

Any advice is helpfull

Best Celso

ADD REPLYlink written 5.8 years ago by Cortes0

You could use either approach.

1) Use a tool to create a multi-sample VCF (e.g. VCFtools)

or

2) Use snpgdsVCF2GDS() to read in each VCF, then merge in R using snpgdsCombineGeno().

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Neilfws48k

Thanks Neil, I'll try your recomendations.

ADD REPLYlink written 5.8 years ago by Cortes0

when i execute the R script i get the following error "Removing 181 non-autosomal SNPs Error in snpgdsDiss(genofile) : There is no SNP!" Obviously, the error is because there aren't SNPs. What parameter do i need to change to avoid this problem? , Thanks!

ADD REPLYlink written 5.7 years ago by Diego D.50

As I understand it this creates a PCA from all of the snps in vcf. How can one filter a vcf so as to get only unlinked variants?

ADD REPLYlink written 16 months ago by wrab42510

Where are you defining the dist obejct in the code above? Its throwing error. Am I supposed to replace dist with dissMatrix?

Also this line needs another closing parentheses

pop_code <- read.gdsn(index.gdsn(genofile, "sample.id")

oh and this line is messed up too...need a parentheses after data.frame

tab <- data.framesample.id = pca$sample.id,pop = factor(pop_code)[match(pca$sample.id, sample.id)],EV1 = pca$eigenvect[,1],EV2 = pca$eigenvect[,2],stringsAsFactors = FALSE)

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by greymonroe0
3
gravatar for Neilfws
5.9 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I'd look at the R package SNPRelate. It will read VCF files, create various matrices and plot dendrograms using e.g. an identity-by-state matrix. See examples in the vignette PDF.

ADD COMMENTlink written 5.9 years ago by Neilfws48k

Thanks the program worked really nice for creating both a phylogenetic tree and a principal component analysis plot almost directly on the vcf file.

ADD REPLYlink written 5.9 years ago by William4.4k
3
gravatar for Sergey Naumenko
2.9 years ago by
Sergey Naumenko350 wrote:

Hi!

I had SNPs for 39 WES samples, some of them were from related individuals. I wanted to check the kinship to see if there any mislabelling during the sample processing.

Finally I've build a nice tree with the code below.

It was validated with independent observations (family diagrams, ancestry). All the unrelated individuals were connected above the FC (first cousins) line, all sibs, half-sibs, and other relatives were where they should be.

The crucial steps were using IBS function to calculate distances and taking LD into account.
Without these two I got just misleading trees. The default LD threshold (0.2) removed too many SNPs, I increased it to 0.5 to achieve higher sensitivity. LD filtation reduced 500K SNPs to 16K.

#install SNPRelate as described here:
#http://www.bioconductor.org/packages/release/bioc/vignettes/SNPRelate/inst/doc/SNPRelateTutorial.html#installation-of-the-package-snprelate
#prepare multisample vcf with bcftools merge 

library(gdsfmt)
library(SNPRelate)
setwd([your dir here])

#biallelic by default
snpgdsVCF2GDS("dataset1.vcf", "dataset1.gds")
snpgdsSummary("dataset1.gds")
genofile = snpgdsOpen("dataset1.gds")

#LD based SNP pruning
set.seed(1000)
snpset = snpgdsLDpruning(genofile,ld.threshold = 0.5)
snp.id=unlist(snpset)

# distance matrix - use IBS
dissMatrix  =  snpgdsIBS(genofile , sample.id=NULL, snp.id=snp.id, autosome.only=TRUE, 
    remove.monosnp=TRUE,  maf=NaN, missing.rate=NaN, num.thread=2, verbose=TRUE)
snpgdsClose(genofile)

snpHCluster =  snpgdsHCluster(dissMatrix, sample.id=NULL, need.mat=TRUE, hang=0.01)

cutTree = snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000, samp.group=NULL, 
    col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE, 
    verbose=TRUE)

snpgdsDrawTree(cutTree, main = "Dataset 1",edgePar=list(col=rgb(0.5,0.5,0.5,0.75),t.col="black"),
    y.label.kinship=T,leaflab="perpendicular")

I hope this will be helpful for somebody.

Sergey

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Sergey Naumenko350
2
gravatar for Louis Boumans
5.1 years ago by
Norway
Louis Boumans20 wrote:

Thanks for the tip! I want to draw a dendrogram based on RADtag based snps from a non-model organism, and the vcf file output by Stacks. I managed to do this (only) with IBS, using these commands:

 

snpgdsVCF2GDS("batch_511.vcf", "batch_511.gds")
snpgdsSummary("batch_511.gds")
genofile <- openfn.gds("batch_511.gds")
set.seed(100) 
ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2))
rv <- snpgdsCutTree(ibs.hc)
plot(rv$dendrogram, leaflab="perpendicular", main="Batch 511")

 

I suppose the dendrogram is based on distance clustering, but that's not clear from the documentation. Does anyone know? And what are the units of the scalebar in the resulting graph?

Finally, I haven't yet succeeded in getting any results with the ML alternative in SNPSRelate. Should it be possible at all without phased data?

 

Louis Boumans

 

 

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Louis Boumans20
1
gravatar for danrdanny
4.4 years ago by
danrdanny60
danrdanny60 wrote:

I just got done installing SNPRelate on R 3.1.1, OSX Yosemite. In order to save yourself some time do the following.

First, make sure you have gfortran installed. Follow instructions to install gfortran from here: http://www.thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks-lgfortran-and-lquadmath-error

Second, download and install both gdsfmt and snprelate from source: https://github.com/zhengxwen/SNPRelate

ADD COMMENTlink written 4.4 years ago by danrdanny60
0
gravatar for user230613
3.7 years ago by
user230613280
Europe
user230613280 wrote:

Hi all,
My question is related to this one, so that's the reasson I'm writing here and not in new post. There is a way to make a phylogenetic tree, just as in Williams answer, but using maximum likelihood method? There is a function inside SNPRelate package  called snpgdsIBDMLE which performs this task but I'm not able to get a phylogenetic tree image.

Any suggestion?

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by user230613280
0
gravatar for rm.umayal24
3 months ago by
rm.umayal240 wrote:

Hi all,

In order to create the phylogenetic tree from whole genome SNP file including the human data, the following software would be helpful.

It is known as the VCF2PopTree software and it is available on http://sankarsubramanian.net/dat/index.html. It is so cool and it does not need any dependencies. Just a HTML file is sufficient enough to get the Phylo tree.

Very simple and straight forward. Highly recommended for the evolutionary biologists and population geneticists.

ADD COMMENTlink written 3 months ago by rm.umayal240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1208 users visited in the last hour