Question

Remove duplicate protein sequences having different fasta identifiers

0

Entering edit mode

7.8 years ago

utkarsh.sood ▴ 40

Hello

I have 9058 faa files. some are having duplicate protein sequences but with unique different fasta identifiers. How these duplicate sequences can be removed?

Thanks!

sequence alignment • 4.0k views

ADD COMMENT • link updated 7.8 years ago by Sej Modha 5.3k • written 7.8 years ago by utkarsh.sood ▴ 40

1

Entering edit mode

The answer below is fantastic. I'd point out that strictly speaking these aren't necessarily duplicate sequences. They could be if someone just made a mistake with FASTA headers, but chances are they are protein sequences of the same protein from different organisms, or strains, or isolates, etc. They may be identical sequences, and you may only want a single representative in those cases, in which case Sej's answer below with clustering will solve it. But for clarity, Identical sequences doesn't necessarily equate to duplicate sequence.

ADD REPLY • link 7.8 years ago by DG 7.3k

score 4 · Answer 1 · 2016-06-16

4

Entering edit mode

7.8 years ago

Sej Modha 5.3k

You can use cd-hit with 1.0 identity and redundancy 1 (-t parameter) to do this.