Question: Remove duplicate protein sequences having different fasta identifiers
0
gravatar for utkarsh.sood
15 months ago by
utkarsh.sood30
India
utkarsh.sood30 wrote:

Hello

I have 9058 faa files. some are having duplicate protein sequences but with unique different fasta identifiers. How these duplicate sequences can be removed?

Thanks!

alignment sequence • 812 views
ADD COMMENTlink modified 15 months ago by Sej Modha1.8k • written 15 months ago by utkarsh.sood30
1

The answer below is fantastic. I'd point out that strictly speaking these aren't necessarily duplicate sequences. They could be if someone just made a mistake with FASTA headers, but chances are they are protein sequences of the same protein from different organisms, or strains, or isolates, etc. They may be identical sequences, and you may only want a single representative in those cases, in which case Sej's answer below with clustering will solve it. But for clarity, Identical sequences doesn't necessarily equate to duplicate sequence.

ADD REPLYlink written 15 months ago by Dan Gaston6.8k
4
gravatar for Sej Modha
15 months ago by
Sej Modha1.8k
Glasgow, UK
Sej Modha1.8k wrote:

You can use cd-hit with 1.0 identity and redundancy 1 (-t parameter) to do this.

ADD COMMENTlink written 15 months ago by Sej Modha1.8k

yep cluster the sequences. One representative will be kept.

ADD REPLYlink written 15 months ago by ALchEmiXt1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1697 users visited in the last hour