Condense BLAST FASTA file
1
0
Entering edit mode
8.2 years ago
igor 13k

I am trying to reduce the size of a FASTA file that I got from the BLAST database archive. Some of the FASTA files they post already have identical sequences removed, but that still leaves a lot of very similar sequences. For example, I am working with "nt" and there are a lot of sequences in there that are very minor variations of each other or are overlapping. Is there a good way to combine those and eliminate "duplicate" entries?

blast fasta • 1.7k views
ADD COMMENT
1
Entering edit mode
8.2 years ago

Try cd hit-est

They use cd hit to cluster protein sequences to make the UniRef databases. I presume you could do something similar for nucleotide sequences. Maybe someone already has.

ADD COMMENT
0
Entering edit mode

Yes, there is CD-HIT and also USEARCH and vsearch, but all those can't handle large files.

ADD REPLY
0
Entering edit mode

Maybe you could do some kind of iterative strategy to make it more manageable? Like breaking the database up into 100 parts, removing the redundancies in the sub-parts and then combining them, and removing the redundancies in the combinations?

It really is a huge database, though, maybe there isn't a computationally feasible way to do this. Or maybe you just need to use a cluster computer. I'm not sure.

ADD REPLY

Login before adding your answer.

Traffic: 2015 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6