Question

any tool for searching duplicated or redundant sequences in a database?

0

Entering edit mode

23 months ago

v.berriosfarias ▴ 140

Hello I'm building a prokaryotic protein database and I have used different sources of sequence databases, its likely the fact that on my new database more than 1 repeated sequence is present. Is there any tool for estimating sequence similarity on a single fasta file (my database)?

Thank for your time

fasta database • 702 views

ADD COMMENT • link updated 22 months ago by Hugo ▴ 380 • written 23 months ago by v.berriosfarias ▴ 140

0

Entering edit mode

How can I BLAST each sequence in a FASTA-file against all the other sequences in the same file? ?

ADD REPLY • link 23 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

22 months ago

Hugo ▴ 380

You can use SEDA (https://www.sing-group.org/seda/). The "Remove Redundant Sequences" operation (https://www.sing-group.org/seda/manual/operations.html#remove-redundant-sequences) allows to do this.

ADD COMMENT • link 22 months ago by Hugo ▴ 380

score 2 · Accepted Answer · 2022-05-04

2

Entering edit mode

23 months ago

GenoMax 141k

cd-hit (LINK) or MMseq2 cluster (LINK) can both help generate non-redundant sequences. In fact NCBI is now using mmseq2 to cluster nr for their web version.

ADD COMMENT • link 23 months ago by GenoMax 141k