I have ~1.2 million 454 reads and I want to cluster them according to their DNA sequence (eg those that have at least 90% identities over at least 70% of their length)... I know that at least for smaller datasets (a few thousands of sequences) blastclust works good.
What happens though, if you have hundreds of thousands or even millions of sequences? What program(s) do you use?
I tried blastclust but it's been running for more than 4 days and it's not printing any progress message so I have no idea how long will it take...
I also tried what the authors of the CANGS pipeline suggest but mafft-distance creates a way too big distance matrix (for ~50,000 sequences it has reached ~240GB)!! Even if this is normal I don't have that much hard drive free space to store the file!