Question: Recommendations about phylogenetic analysis tools for RFLP/AFLP/RAPD data
0
gravatar for JL
5 weeks ago by
JL0
CIC bioGUNE
JL0 wrote:

Hello Biostars!

Anybody can recommend any phylogenetic analysis tools to create trees for RFLP/AFLP/RAPD type data, apart from Treecon? (my datasets seem too big for this software)

Thanks in advance

JL

phylogenetic tree tools • 137 views
ADD COMMENTlink modified 5 weeks ago by Burnedthumb90 • written 5 weeks ago by JL0

Just in case it helps readers to figure out the kind of data I need to analyze/cluster, here is a sample:

##SNP   St1 St2 St3 St4 St5
1284995 0   0   0   1   0
1285001 1   1   1   0   1
1285017 0   0   0   0   0
1285034 0   0   1   0   0
1285040 0   1   0   0   0
1285070 0   0   0   0   1

Thanks once more

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by JL0
3
gravatar for Burnedthumb
5 weeks ago by
Burnedthumb90
Netherlands
Burnedthumb90 wrote:

You could use R in combination with the proxy package:

Given your dataset as a tab delimited dataset "dataset.txt":

SNP St1 St2 St3 St4 St5 St6
1284995 0   0   0   1   0   0
1285001 1   1   1   0   1   1
1285017 0   0   0   0   0   0
1285034 0   0   1   0   0   0
1285040 0   1   0   0   0   0
1285070 0   0   0   0   1   1

Then do this in R (you may want to look up the Jaccard similarity, I am not entirely sure if that is the best one to use).

install.packages("proxy")
library(proxy)

## load dataset:
dataset <- read.table(file="dataset.txt", sep="\t", header=T, row.names=1)

## Calculate distance using Jaccard method:
d <- dist(t(dataset), method="Jaccard")

## Hierarchical cluster the data
# Note that I transpose the dataset otherwise I cluster the markers
hc <- hclust(d)

## Plot the data:    
plot(hc)

The result

ADD COMMENTlink written 5 weeks ago by Burnedthumb90

@ Burnedthumb,

Thank you very much for your insight. I will try to do this immediately I just have an extra question, do you think that R graphical devices will be able to handle a dataset with hundreds/thousands of rows and columns? I ask you this because in my experience, representing such big datasets is not an easy task for R...

Thanks in advance for your kind help!

ADD REPLYlink written 5 weeks ago by JL0
1

R itself will handle data up to a couple of gigabytes just fine. However, if you want to plot hundreds or thousands of samples the image gets unreadable. What you could do is instead of regular plotting, writing the dendrogram to a file like this:

## Plot the data to image with 1000 pixels width and height:
png(file="dendrogram.png", width=1000, height=1000)
plot(hc)
dev.off()

What are the dimensions of your data?

ADD REPLYlink written 5 weeks ago by Burnedthumb90

Depending if I transpose the table (if I want to inspect the clustering of strains or SNPs) I will have about 5000 strains and up to 3000 SNPs in some cases. So let's say 3000 x 5000 (rows x columns). Do you think it is a viable dataset for this task?

Thank you very much again!

ADD REPLYlink written 5 weeks ago by JL0
1

You can make the plot as big as you want, however at that size it will be unreadable and won't get any information out it. You may want to filter the data a little bit first for the most interesting samples. Or you can cut[1] the tree at a specific height and plot those sub trees separately.

[1] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cutree.html

ADD REPLYlink written 5 weeks ago by Burnedthumb90
0
gravatar for genePod
5 weeks ago by
genePod30
USA/Chicago
genePod30 wrote:

You can use MATLAB or Python to compute the dissimilarity matrix of the data first, and then draw the phylogenetic trees. I am not sure which method you can use to get the dissimilarity matrix of the data set.

ADD COMMENTlink written 5 weeks ago by genePod30

@ Channgchuan Yin, could you elaborate your suggestions a little bit (you use custon scripts or there is some module/package/software you would recommend)?

ADD REPLYlink written 5 weeks ago by JL0
1

Sorry I tried to post my answer to Biostar, but the message has not been successfully updated.

You may need to define the distance for two SNPs, for example, hamming distance. You may refer to the paper [Wang, C., Kao, W. H., & Hsiao, C. K. (2015). Using Hamming distance as information for SNP-sets clustering and testing in disease association studies. PloS one, 10(8), e0135918.]. [ http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135918] . Programming is needed.

ADD REPLYlink written 5 weeks ago by genePod30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 946 users visited in the last hour