Cluster plot of nucleotide fasta sequences
1
0
Entering edit mode
7.8 years ago
roblogan6 ▴ 30

I have about 10,000 individual DNA sequences in fasta format- both as a single large multifasta file and as separate files per sequence so I can be flexible with the input format. I should be able to work with fasta format though and am not looking to compromise there.

I am looking for a way to graphically display these sequences via cluster plot. I am generally unfamiliar with matlab and R, but familiar enough to know that they shine at graphical outputs but tend to rely on numerical input (like .csv files). I can't figure out how to use the R package hclust, for example, with my fasta file(s). This might just be because I don't know R very well.

A tree would be fine too, but it is extremely computationally heavy to align these sequences prior to putting them into a tree-making tool like RAxML. I have tried to mafft align all of these sequences on a remote server and the job timed out. In addition, I think a cluster plot is more visually appealing than a tree. However, if all I can get is a tree that is better than what I have now!

Simply stated, the goal is to quickly see how many rough clusters of DNA sequences I have. Thanks for any help you can provide. -Rob

fasta cluster anlysis DNA • 2.9k views
ADD COMMENT
0
Entering edit mode

See also: Hierarchial Clustering

And the Muscle manual on large alignments: http://www.drive5.com/muscle/manual/bigalignments.html

You would need a distance matrix somehow, however as you experienced, a multiple alignment of 10000 sequences is computationally heavy. CD-hit might be your best bet. I would forget about plotting such dendrogram for now, as such large dendrograms are not very useful in my opinion.

ADD REPLY
0
Entering edit mode

What kind of sequences are they? 16S?

ADD REPLY
0
Entering edit mode
7.8 years ago
piet ★ 1.8k

it is extremely computationally heavy to align these sequences

it depends very much on the sequences, on their size and how similar they are to each other. Does it even make any sense to align them to each other?

Aligning all of them in one run is almost always impossible. But the 'trick' is to divide the problem into smaller chunks. This is often called profile aligning. In a first round you aligning subsets of sequences, and then you align the already aligned subsets to each other.

ADD COMMENT

Login before adding your answer.

Traffic: 1763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6