Question: Visualizing Pairwise Distance Between A Group Of Sequences
6.9 years ago by
David M550
David M550 wrote:

I have a library of ~2000 nucleotide sequences for which I'd like to visualize the similarity. Because some of the sequences are very diverse, its not feasible to perform a multiple sequence alignment and then build a phylogenetic tree, as (at least when I perform this process using MEGA) no common bases can be found.

I'm looking for a way to build a phylogenetic tree based on individual pairwise distances between each pair of sequences (aligned in isolation), and not on the pairwise distances of sequences taken from a multiple alignment. Building a tree this way should overcome the issue of some sequences not being aligned due to their extreme diversity.

Alternatively, I'd like to create a plot where distances between two points are proportional to the pairwise distances between the sequences they represent. In this way, I could start to visually identify clusters of sequences which might exist.


Something: how meaningful can a phylogenetic tree be based on entities that have almost nothing in common? Like if you had to make a phylogeny about a 'house', a 'air plane', a 'table', a 'sunflower'. It's not that it is not possible to define a distance, but how meaningful could that be?

I would echo Micheal's comment above. While you can do something like UPGMA on e-values from an all-versus-all blast run it sounds like you need to think carefully about the biological implications of your question. What are you trying to ask and why do you need something like pairwise distances? Clustering will at least tell you which sequences should be grouped together and which should not. Keep in mind that multiple sequence alignment does contain a pairwise heuristic of some sort as the initial step.

In this case I'm trying to establish a vague classification of repeat elements. I'm out to see if sequences form into relatively distinct groups, from which I can manually classify a few representatives to get an idea for a whole. I'm not actually going to infer a biological relationship from the distantly related sequences; its more that I don't know which ones are distant and which ones are close.

6.9 years ago by
Cambridge, US
Christian2.7k wrote:

Sounds like a problem for hierarchical clustering. Pairwise bit-scores or e-values derived from an all-vs-all BLAST search could serve as distance measures. I got good results with average-linkage clustering, for example using the UPGMA algorithm. There are many tools that can perform hierarchical clustering, for example MC-UPGMA ( or R ('hclust').

A plot can be created using multidimensional scaling, which can also be performed in R (

yes, it could work, but i suggest to use tblastx.

I'm not as familiar with R as I should be, so I'll be looking into CLANS first. Hierarchical clustering does sound exactly like what I'm trying to do, however. Thanks!

6.9 years ago by
Jan Kosinski1.6k
Jan Kosinski1.6k wrote:


"The program takes unaligned fasta format sequences a input, performs all-against-all BLAST searches and displays the pairwise similarities in either 2D or 3D graphs. Contrary to phylogenetic inference methods this approach uses unaligned sequences"

In the output graph similar sequences will spatially group together in clusters, easy to analyze visually. You will be also able to select the clusters and export the sequences that make the cluster.

The prerequisite is that your expected pairwise distances between sequences you want to consider similar are high enough for BLAST to align them and give a good P-values to the alignments.

Also see: What softwares can be used for clustering nucleic acid fragments??

This is a very intuitive program that does almost everything I'm looking for. Thanks!

6.9 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

If the sequences are very distant it might be better to work with translated amino acid sequence, to pick up very weak similarity. Can you translate your sequences prior to analysis? Maybe, using tblastx approach together with CLANS proposed by Jan might be an option (while inflating the search complexity though).

To cluster an visualize a distance matrix e.g. in R is easy, to derive the distance matrix is the real difficulty here. Other approaches for creating distance matrices include sequence composition aka. n-word counts like di/tri-nucleotides or n-aminoacid counts.

