Question: Condense BLAST FASTA file
0
gravatar for igor
3.6 years ago by
igor8.2k
United States
igor8.2k wrote:

I am trying to reduce the size of a FASTA file that I got from the BLAST database archive. Some of the FASTA files they post already have identical sequences removed, but that still leaves a lot of very similar sequences. For example, I am working with "nt" and there are a lot of sequences in there that are very minor variations of each other or are overlapping. Is there a good way to combine those and eliminate "duplicate" entries?

blast fasta • 912 views
ADD COMMENTlink modified 3.6 years ago by Sean R Johnson120 • written 3.6 years ago by igor8.2k
1
gravatar for Sean R Johnson
3.6 years ago by
United States
Sean R Johnson120 wrote:

Try cd hit-est

They use cd hit to cluster protein sequences to make the UniRef databases. I presume you could do something similar for nucleotide sequences. Maybe someone already has.

 

 

ADD COMMENTlink written 3.6 years ago by Sean R Johnson120

Yes, there is CD-HIT and also USEARCH and vsearch, but all those can't handle large files.

ADD REPLYlink written 3.6 years ago by igor8.2k

Maybe you could do some kind of iterative strategy to make it more manageable? Like breaking the database up into 100 parts, removing the redundancies in the sub-parts and then combining them, and removing the redundancies in the combinations?

It really is a huge database, though, maybe there isn't a computationally feasible way to do this. Or maybe you just need to use a cluster computer. I'm not sure.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Sean R Johnson120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 890 users visited in the last hour