Question: Cluster plot of nucleotide fasta sequences
gravatar for roblogan6
3.7 years ago by
roblogan630 wrote:

I have about 10,000 individual DNA sequences in fasta format- both as a single large multifasta file and as separate files per sequence so I can be flexible with the input format. I should be able to work with fasta format though and am not looking to compromise there.

I am looking for a way to graphically display these sequences via cluster plot. I am generally unfamiliar with matlab and R, but familiar enough to know that they shine at graphical outputs but tend to rely on numerical input (like .csv files). I can't figure out how to use the R package hclust, for example, with my fasta file(s). This might just be because I don't know R very well.

A tree would be fine too, but it is extremely computationally heavy to align these sequences prior to putting them into a tree-making tool like RAxML. I have tried to mafft align all of these sequences on a remote server and the job timed out. In addition, I think a cluster plot is more visually appealing than a tree. However, if all I can get is a tree that is better than what I have now!

Simply stated, the goal is to quickly see how many rough clusters of DNA sequences I have. Thanks for any help you can provide. -Rob

dna anlysis cluster fasta • 1.5k views
ADD COMMENTlink modified 3.7 years ago by piet1.7k • written 3.7 years ago by roblogan630

See also: Hierarchial Clustering

And the Muscle manual on large alignments:

You would need a distance matrix somehow, however as you experienced, a multiple alignment of 10000 sequences is computationally heavy. CD-hit might be your best bet. I would forget about plotting such dendrogram for now, as such large dendrograms are not very useful in my opinion.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Michael Dondrup47k

What kind of sequences are they? 16S?

ADD REPLYlink written 3.7 years ago by Brian Bushnell17k
gravatar for piet
3.7 years ago by
planet earth
piet1.7k wrote:

it is extremely computationally heavy to align these sequences

it depends very much on the sequences, on their size and how similar they are to each other. Does it even make any sense to align them to each other?

Aligning all of them in one run is almost always impossible. But the 'trick' is to divide the problem into smaller chunks. This is often called profile aligning. In a first round you aligning subsets of sequences, and then you align the already aligned subsets to each other.

ADD COMMENTlink written 3.7 years ago by piet1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1575 users visited in the last hour