Question: How To Cluster 454 Reads?
gravatar for Panos
7.8 years ago by
Geneva, Switzerland
Panos1.6k wrote:

I have ~1.2 million 454 reads and I want to cluster them according to their DNA sequence (eg those that have at least 90% identities over at least 70% of their length)... I know that at least for smaller datasets (a few thousands of sequences) blastclust works good.

What happens though, if you have hundreds of thousands or even millions of sequences? What program(s) do you use?

I tried blastclust but it's been running for more than 4 days and it's not printing any progress message so I have no idea how long will it take...

I also tried what the authors of the CANGS pipeline suggest but mafft-distance creates a way too big distance matrix (for ~50,000 sequences it has reached ~240GB)!! Even if this is normal I don't have that much hard drive free space to store the file!

short clustering • 4.9k views
ADD COMMENTlink written 7.8 years ago by Panos1.6k

I think you can't do that, because a distance matrix space requirement is quadratic over #objects: if I didn't calculate wrong (assuming a float/double = 32bits) you would need: (32 bit * (1.2e6^2)/2)/(8bit/byte * 1024^3) = 2682.209 GB! of memory. There could be a different approach where you don't have to calculate the whole distance matrix ahead of computation out there. At least you cannot hold it in memory.

ADD REPLYlink written 7.8 years ago by Michael Dondrup44k
gravatar for iw9oel_ad
7.8 years ago by
iw9oel_ad6.0k wrote:

You can try cd-hit. I've used this to create non-redundant protein sets from very large inputs (millions of sequences). It also includes cd-hit-est which will cluster DNA sequences.

Also, cdhit-454 may be of use to you ( I have not used it).

Further information; having just downloaded cdhit-v4.0-2010-04-20, it fails to compile with errors including:

cdhit-common.c++:1292: error: ‘uint64_t’ was not declared in this scope

It appears that a header file is not being included. Adding


to the top of cdhit-common.h allowed it to compile. I was about to delete my answer because of this...

However, it seems the latest code is actually here. This builds for me (Linux X86_64) without errors. They could really do with chasing down and removing their redundant pages that contain non-building code [?]. The definitive cd-hit page now seems to be at the Weizhong lab

ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by iw9oel_ad6.0k

Really useful program Keith! One question, though... Is it only for aminoacid sequences (I have nucleotide sequences)?

ADD REPLYlink written 7.8 years ago by Panos1.6k

Yes, see above re. cd-hit-est. It will also run parallel computes if you have more than 1 CPU.

ADD REPLYlink written 7.8 years ago by iw9oel_ad6.0k

For some reason, cd-hit-est can't be compiled. When I run "make" it gives me multiple errors and no binaries are compiled (my OS is Ubuntu 10.04, 64bit)...

ADD REPLYlink written 7.8 years ago by Panos1.6k

This might be the error I describe above.

ADD REPLYlink written 7.8 years ago by iw9oel_ad6.0k

Yes Keith, it was exactly the same error. I downloaded the latest code from google code, as you said, and it compiles ok! Thanks!

ADD REPLYlink written 7.8 years ago by Panos1.6k
gravatar for Elipapa
7.8 years ago by
Elipapa90 wrote:

The very versatile UCLUST program (recently renamed USEARCH) is much faster than CD-HIT. The 32-bit version can be installed for free if you are part of an academic institution. Once installed it's a matter of calling:

uclust --sort large.fasta --output large_sorted.fasta
uclust --input large_sorted.fasta --uc results.uc --id 0.90

Larger sets can be split and sorted using the option


as explained in the PDF copy of the manual.

ADD COMMENTlink written 7.8 years ago by Elipapa90
gravatar for Jeremy Leipzig
7.8 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

Vmatch has excellent clustering capabilities.

ADD COMMENTlink written 7.8 years ago by Jeremy Leipzig17k
gravatar for Yannick Wurm
7.7 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

How about simply using Newbler to assemble them with those parameters. You can get the information of which reads assemble with which from one of the newbler output files..... and use that as your clustering info...

ADD COMMENTlink written 7.7 years ago by Yannick Wurm2.3k
gravatar for Frédéric Mahé
6.4 years ago by
Kaiserslautern, Germany
Frédéric Mahé2.7k wrote:

DNAClust is a new clustering algorithm. The paper itself is a good entry point if you want to understand the principles and limitations of clustering algorithms. For instance, USEARCH is fast when you want to build clusters of distant sequences, while DNAclust is better when one wants to build clusters of sequences with 98 or 99% identity. A warning though, the results are not 100% guaranteed: some identical sequences may not be clustered as one expects.

ADD COMMENTlink written 6.4 years ago by Frédéric Mahé2.7k
gravatar for Stefano Berri
7.8 years ago by
Stefano Berri4.0k
Cambridge, UK
Stefano Berri4.0k wrote:

I think you are not tackling the problem correctly. Why do you need to "cluster" these sequences? Couldn't you solve the problem (or at least make it more easy) by aligning to a reference genome first? If a genome is not available, how about ex novo assembly? It would tell you which sequence overlap. What are you sequencing? How many clusters do you expect? Normally I would expect cluster as you describe either for very conserved gene families or for partially overlapping sequences. Both cases can be tackled differently.

Give us more details ;)

ADD COMMENTlink written 7.8 years ago by Stefano Berri4.0k

I have 2 sampling sites and one of the things I want to do is pool the sequences from the 2 samples, determine the clusters and see whether there are clusters that appear preferentially in either of the 2 sites.

ADD REPLYlink written 7.8 years ago by Panos1.6k

I think QIIME will be excellent for this! It does alpha and beta diversity plots and rarefaction curves etc. I highly recommend it! Let me know if you decide to go that way and if you need help setting it up ;-)

ADD REPLYlink written 6.4 years ago by Steve Moss2.2k
gravatar for Casbon
7.8 years ago by
Casbon3.2k wrote:

Accurate determination of microbial diversity from 454 pyrosequencing data

We present an algorithm, PyroNoise, that clusters the flowgrams of 454 pyrosequencing reads using a distance measure that models sequencing noise. This infers the true sequences in a collection of amplicons. We pyrosequenced a known mixture of microbial 16S rDNA sequences extracted from a lake and found that without noise reduction the number of operational taxonomic units is overestimated but using PyroNoise it can be accurately calculated.

ADD COMMENTlink written 7.8 years ago by Casbon3.2k
gravatar for Steve Moss
6.4 years ago by
Steve Moss2.2k
United Kingdom
Steve Moss2.2k wrote:

What data do you have? E.g. which gene? You mention CANGSDB, so I'm assuming SSU rRNA.

I briefly looked at CANGS/CANGSDB, but have you considered QIIME? It uses UCLUST (which I think is the best at the moment) or cd-hit/cd-hit454 for clustering, amongst others and above all outputs some excellent, publication quality visualisations of the data (check out their tutorial)!

I think given your comments on an earlier post regarding what you are hoping to test, that QIIME will be the ideal pipeline for performing the analyses you require on your data.

I have also used Metaxa, which uses HMM-profiles and HMMER, but the output needs to be parsed and analysed manually.

Papers are here:

Incidentally, I did this with ~455,000 sequences.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Steve Moss2.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1566 users visited in the last hour