Question: Visualizing Pairwise Distance Between A Group Of Sequences
gravatar for David M
6.9 years ago by
David M550
David M550 wrote:

I have a library of ~2000 nucleotide sequences for which I'd like to visualize the similarity. Because some of the sequences are very diverse, its not feasible to perform a multiple sequence alignment and then build a phylogenetic tree, as (at least when I perform this process using MEGA) no common bases can be found.

I'm looking for a way to build a phylogenetic tree based on individual pairwise distances between each pair of sequences (aligned in isolation), and not on the pairwise distances of sequences taken from a multiple alignment. Building a tree this way should overcome the issue of some sequences not being aligned due to their extreme diversity.

Alternatively, I'd like to create a plot where distances between two points are proportional to the pairwise distances between the sequences they represent. In this way, I could start to visually identify clusters of sequences which might exist.


phylogenetics clustering • 5.3k views
ADD COMMENTlink modified 22 months ago by Biostar ♦♦ 20 • written 6.9 years ago by David M550

Something: how meaningful can a phylogenetic tree be based on entities that have almost nothing in common? Like if you had to make a phylogeny about a 'house', a 'air plane', a 'table', a 'sunflower'. It's not that it is not possible to define a distance, but how meaningful could that be?

ADD REPLYlink written 6.9 years ago by Michael Dondrup45k

I would echo Micheal's comment above. While you can do something like UPGMA on e-values from an all-versus-all blast run it sounds like you need to think carefully about the biological implications of your question. What are you trying to ask and why do you need something like pairwise distances? Clustering will at least tell you which sequences should be grouped together and which should not. Keep in mind that multiple sequence alignment does contain a pairwise heuristic of some sort as the initial step.

ADD REPLYlink written 6.9 years ago by Dan Gaston7.1k

In this case I'm trying to establish a vague classification of repeat elements. I'm out to see if sequences form into relatively distinct groups, from which I can manually classify a few representatives to get an idea for a whole. I'm not actually going to infer a biological relationship from the distantly related sequences; its more that I don't know which ones are distant and which ones are close.

ADD REPLYlink written 6.9 years ago by David M550
gravatar for Christian
6.9 years ago by
Cambridge, US
Christian2.7k wrote:

Sounds like a problem for hierarchical clustering. Pairwise bit-scores or e-values derived from an all-vs-all BLAST search could serve as distance measures. I got good results with average-linkage clustering, for example using the UPGMA algorithm. There are many tools that can perform hierarchical clustering, for example MC-UPGMA ( or R ('hclust').

A plot can be created using multidimensional scaling, which can also be performed in R (

ADD COMMENTlink written 6.9 years ago by Christian2.7k

yes, it could work, but i suggest to use tblastx.

ADD REPLYlink written 6.9 years ago by Michael Dondrup45k

I'm not as familiar with R as I should be, so I'll be looking into CLANS first. Hierarchical clustering does sound exactly like what I'm trying to do, however. Thanks!

ADD REPLYlink written 6.9 years ago by David M550
gravatar for Jan Kosinski
6.9 years ago by
Jan Kosinski1.6k
Jan Kosinski1.6k wrote:


"The program takes unaligned fasta format sequences a input, performs all-against-all BLAST searches and displays the pairwise similarities in either 2D or 3D graphs. Contrary to phylogenetic inference methods this approach uses unaligned sequences"

In the output graph similar sequences will spatially group together in clusters, easy to analyze visually. You will be also able to select the clusters and export the sequences that make the cluster.

The prerequisite is that your expected pairwise distances between sequences you want to consider similar are high enough for BLAST to align them and give a good P-values to the alignments.

Also see: What softwares can be used for clustering nucleic acid fragments??

ADD COMMENTlink written 6.9 years ago by Jan Kosinski1.6k

This is a very intuitive program that does almost everything I'm looking for. Thanks!

ADD REPLYlink written 6.9 years ago by David M550
gravatar for Michael Dondrup
6.9 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

If the sequences are very distant it might be better to work with translated amino acid sequence, to pick up very weak similarity. Can you translate your sequences prior to analysis? Maybe, using tblastx approach together with CLANS proposed by Jan might be an option (while inflating the search complexity though).

To cluster an visualize a distance matrix e.g. in R is easy, to derive the distance matrix is the real difficulty here. Other approaches for creating distance matrices include sequence composition aka. n-word counts like di/tri-nucleotides or n-aminoacid counts.

ADD COMMENTlink written 6.9 years ago by Michael Dondrup45k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1451 users visited in the last hour