Question

Remove duplicate sequences with same id from a fasta file

1

Entering edit mode

7.3 years ago

CB ▴ 10

Dear all, there are many posts about remove duplicate sequences in a fasta file (https://www.biostars.org/p/3003/), but I want to remove only the duplicate sequences with the same ids.

I have many duplicate sequences in my fasta file, but with different ids and I want to keep them.

How to remove only same id sequence duplicates? I have protein sequences and my sequences are split in different lines.

sequence fasta remove duplicate • 8.3k views

ADD COMMENT • link updated 7.3 years ago by Alex Reynolds 35k • written 7.3 years ago by CB ▴ 10

0

Entering edit mode

BBMap's Dedupe utility has a "requirematchingnames" flag. This will make it only remove duplicates that have identical sequence and identical names. For example:

dedupe.sh in=file.fasta out=deduped.fasta ac=f requirematchingnames

One copy of each duplicate set will remain, unless you add the "uniqueonly" flag.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

2

Entering edit mode

7.3 years ago

Alex Reynolds 35k

Make your FASTA files single-line (one line for header, one line for sequence): Multiline Fasta To Single Line Fasta

Then:

$ awk '{ if (($0 ~ /^>/) && (!seen[$0]++)) { print $0; printSeq=1; } else if (($0 ~ /^[^>]/) && printSeq) { print $0; printSeq=0; }  }' in.fa > out.fa

ADD COMMENT • link 7.3 years ago by Alex Reynolds 35k

score 3 · Accepted Answer · 2017-01-09

3

Entering edit mode

7.3 years ago

shenwei356 8.4k

http://bioinf.shenwei.me/seqkit/usage/#rmdup

ADD COMMENT • link 7.3 years ago by shenwei356 8.4k

0

Entering edit mode

It worked very well. It is very easy to use. Thanks!

ADD REPLY • link 7.3 years ago by CB ▴ 10