8
9
Entering edit mode
12.7 years ago
Panos ★ 1.8k

I have ~1.2 million 454 reads and I want to cluster them according to their DNA sequence (eg those that have at least 90% identities over at least 70% of their length)... I know that at least for smaller datasets (a few thousands of sequences) blastclust works good.

What happens though, if you have hundreds of thousands or even millions of sequences? What program(s) do you use?

I tried blastclust but it's been running for more than 4 days and it's not printing any progress message so I have no idea how long will it take...

I also tried what the authors of the CANGS pipeline suggest but mafft-distance creates a way too big distance matrix (for ~50,000 sequences it has reached ~240GB)!! Even if this is normal I don't have that much hard drive free space to store the file!

clustering short • 6.4k views
2
Entering edit mode

I think you can't do that, because a distance matrix space requirement is quadratic over #objects: if I didn't calculate wrong (assuming a float/double = 32bits) you would need: (32 bit * (1.2e6^2)/2)/(8bit/byte * 1024^3) = 2682.209 GB! of memory. There could be a different approach where you don't have to calculate the whole distance matrix ahead of computation out there. At least you cannot hold it in memory.

8
Entering edit mode
12.7 years ago

You can try cd-hit. I've used this to create non-redundant protein sets from very large inputs (millions of sequences). It also includes cd-hit-est which will cluster DNA sequences.

Also, cdhit-454 may be of use to you ( I have not used it).

Further information; having just downloaded cdhit-v4.0-2010-04-20, it fails to compile with errors including:

cdhit-common.c++:1292: error: ‘uint64_t’ was not declared in this scope


#include<stdint.h>


to the top of cdhit-common.h allowed it to compile. I was about to delete my answer because of this...

However, it seems the latest code is actually here. This builds for me (Linux X86_64) without errors. They could really do with chasing down and removing their redundant pages that contain non-building code [?]. The definitive cd-hit page now seems to be at the Weizhong lab

0
Entering edit mode

Really useful program Keith! One question, though... Is it only for aminoacid sequences (I have nucleotide sequences)?

0
Entering edit mode

Yes, see above re. cd-hit-est. It will also run parallel computes if you have more than 1 CPU.

0
Entering edit mode

For some reason, cd-hit-est can't be compiled. When I run "make" it gives me multiple errors and no binaries are compiled (my OS is Ubuntu 10.04, 64bit)...

0
Entering edit mode

This might be the error I describe above.

0
Entering edit mode

Yes Keith, it was exactly the same error. I downloaded the latest code from google code, as you said, and it compiles ok! Thanks!

8
Entering edit mode
12.7 years ago
Elipapa ▴ 90

The very versatile UCLUST program (recently renamed USEARCH) is much faster than CD-HIT. The 32-bit version can be installed for free if you are part of an academic institution.

Once installed it's a matter of calling:

uclust --sort large.fasta --output large_sorted.fasta
uclust --input large_sorted.fasta --uc results.uc --id 0.90


Larger sets can be split and sorted using the option

--mergesort


as explained in the PDF copy of the manual.

2
Entering edit mode
12.7 years ago

Vmatch has excellent clustering capabilities.

2
Entering edit mode
12.5 years ago
Yannick Wurm ★ 2.4k

How about simply using Newbler to assemble them with those parameters. You can get the information of which reads assemble with which from one of the newbler output files..... and use that as your clustering info...

2
Entering edit mode
11.2 years ago

DNAClust is a new clustering algorithm. The paper itself is a good entry point if you want to understand the principles and limitations of clustering algorithms. For instance, USEARCH is fast when you want to build clusters of distant sequences, while DNAclust is better when one wants to build clusters of sequences with 98 or 99% identity. A warning though, the results are not 100% guaranteed: some identical sequences may not be clustered as one expects.

1
Entering edit mode
12.7 years ago

I think you are not tackling the problem correctly. Why do you need to "cluster" these sequences? Couldn't you solve the problem (or at least make it more easy) by aligning to a reference genome first? If a genome is not available, how about ex novo assembly? It would tell you which sequence overlap. What are you sequencing? How many clusters do you expect? Normally I would expect cluster as you describe either for very conserved gene families or for partially overlapping sequences. Both cases can be tackled differently.

Give us more details ;)

0
Entering edit mode

I have 2 sampling sites and one of the things I want to do is pool the sequences from the 2 samples, determine the clusters and see whether there are clusters that appear preferentially in either of the 2 sites.

0
Entering edit mode

I think QIIME will be excellent for this! It does alpha and beta diversity plots and rarefaction curves etc. I highly recommend it! Let me know if you decide to go that way and if you need help setting it up ;-)

1
Entering edit mode
12.6 years ago
Casbon ★ 3.2k

Accurate determination of microbial diversity from 454 pyrosequencing data

We present an algorithm, PyroNoise, that clusters the flowgrams of 454 pyrosequencing reads using a distance measure that models sequencing noise. This infers the true sequences in a collection of amplicons. We pyrosequenced a known mixture of microbial 16S rDNA sequences extracted from a lake and found that without noise reduction the number of operational taxonomic units is overestimated but using PyroNoise it can be accurately calculated.

1
Entering edit mode
11.2 years ago

What data do you have? E.g. which gene? You mention CANGSDB, so I'm assuming SSU rRNA.

I briefly looked at CANGS/CANGSDB, but have you considered QIIME? It uses UCLUST (which I think is the best at the moment) or cd-hit/cd-hit454 for clustering, amongst others and above all outputs some excellent, publication quality visualisations of the data (check out their tutorial)! I think given your comments on an earlier post regarding what you are hoping to test, that QIIME will be the ideal pipeline for performing the analyses you require on your data. I have also used Metaxa, which uses HMM-profiles and HMMER, but the output needs to be parsed and analysed manually. Papers are here:

Incidentally, I did this with ~455,000 sequences.