Visualizing Pairwise Distance Between A Group Of Sequences
3
7
Entering edit mode
9.3 years ago
David M ▴ 550

I have a library of ~2000 nucleotide sequences for which I'd like to visualize the similarity. Because some of the sequences are very diverse, its not feasible to perform a multiple sequence alignment and then build a phylogenetic tree, as (at least when I perform this process using MEGA) no common bases can be found.

I'm looking for a way to build a phylogenetic tree based on individual pairwise distances between each pair of sequences (aligned in isolation), and not on the pairwise distances of sequences taken from a multiple alignment. Building a tree this way should overcome the issue of some sequences not being aligned due to their extreme diversity.

Alternatively, I'd like to create a plot where distances between two points are proportional to the pairwise distances between the sequences they represent. In this way, I could start to visually identify clusters of sequences which might exist.

Thanks!

phylogenetics clustering • 7.0k views
ADD COMMENT
0
Entering edit mode

Something: how meaningful can a phylogenetic tree be based on entities that have almost nothing in common? Like if you had to make a phylogeny about a 'house', a 'air plane', a 'table', a 'sunflower'. It's not that it is not possible to define a distance, but how meaningful could that be?

ADD REPLY
0
Entering edit mode

I would echo Micheal's comment above. While you can do something like UPGMA on e-values from an all-versus-all blast run it sounds like you need to think carefully about the biological implications of your question. What are you trying to ask and why do you need something like pairwise distances? Clustering will at least tell you which sequences should be grouped together and which should not. Keep in mind that multiple sequence alignment does contain a pairwise heuristic of some sort as the initial step.

ADD REPLY
0
Entering edit mode

In this case I'm trying to establish a vague classification of repeat elements. I'm out to see if sequences form into relatively distinct groups, from which I can manually classify a few representatives to get an idea for a whole. I'm not actually going to infer a biological relationship from the distantly related sequences; its more that I don't know which ones are distant and which ones are close.

ADD REPLY
5
Entering edit mode
9.3 years ago
Christian ★ 3.0k

Sounds like a problem for hierarchical clustering. Pairwise bit-scores or e-values derived from an all-vs-all BLAST search could serve as distance measures. I got good results with average-linkage clustering, for example using the UPGMA algorithm. There are many tools that can perform hierarchical clustering, for example MC-UPGMA or R ('hclust').

A plot can be created using multidimensional scaling, which can also be performed in R.

ADD COMMENT
0
Entering edit mode

yes, it could work, but i suggest to use tblastx.

ADD REPLY
0
Entering edit mode

I'm not as familiar with R as I should be, so I'll be looking into CLANS first. Hierarchical clustering does sound exactly like what I'm trying to do, however. Thanks!

ADD REPLY
4
Entering edit mode
9.3 years ago
Jan Kosinski ★ 1.6k

Use CLANS:

The program takes unaligned fasta format sequences a input, performs all-against-all BLAST searches and displays the pairwise similarities in either 2D or 3D graphs. Contrary to phylogenetic inference methods this approach uses unaligned sequences

In the output graph similar sequences will spatially group together in clusters, easy to analyze visually. You will be also able to select the clusters and export the sequences that make the cluster.

The prerequisite is that your expected pairwise distances between sequences you want to consider similar are high enough for BLAST to align them and give a good P-values to the alignments.

Also see this post.

ADD COMMENT
0
Entering edit mode

This is a very intuitive program that does almost everything I'm looking for. Thanks!

ADD REPLY
1
Entering edit mode
9.3 years ago

If the sequences are very distant it might be better to work with translated amino acid sequence, to pick up very weak similarity. Can you translate your sequences prior to analysis? Maybe, using tblastx approach together with CLANS proposed by Jan might be an option (while inflating the search complexity though).

To cluster an visualize a distance matrix e.g. in R is easy, to derive the distance matrix is the real difficulty here. Other approaches for creating distance matrices include sequence composition aka. n-word counts like di/tri-nucleotides or n-aminoacid counts.

ADD COMMENT

Login before adding your answer.

Traffic: 2444 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6