Can anyone suggest me how to do clustering a set of bacterial genome based on their hamming or snp distance ?
More detail is really needed. What exactly is your problem? How to calculate Hamming distance or SNP for two genomes? Which clustering algorithm to use once you've calculated the Hamming distances? Must it be Hamming or SNP distance, or are you in fact looking for distance metrics better suited for the problem you are trying to solve? How closely related are the genomes you want to cluster?
just get a matrix of distances MxN and use simple ward clustering or you could even try MDS. Both done in R
ward clustering with manhattan distance for example:
pvclust(data = t(mydata),method.hclust = "ward.D",method.dist = "manhattan",nboot = 10000)
additionally you will get p-value for each clade as the number of replicated clusters
cd-hit can be used for clustering.
It would be pointless to apply cd-hit to complete bacterial genome sequences (unless they were very similar sharing the same exact gene order and stuff). Perhaps a better strategy would be to build a distance matrix with e.g. all-vs-all MUMmer. Counting shared k-mers could also result in a relatively representative distance matrix..
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy