Question: Remove duplicate sequences with same id from a fasta file
1
gravatar for CB
8 months ago by
CB10
US
CB10 wrote:

Dear all, there are many posts about remove duplicate sequences in a fasta file (https://www.biostars.org/p/3003/), but I want to remove only the duplicate sequences with the same ids.

I have many duplicate sequences in my fasta file, but with different ids and I want to keep them.

How to remove only same id sequence duplicates? I have protein sequences and my sequences are split in different lines.

duplicate sequence remove fasta • 598 views
ADD COMMENTlink modified 8 months ago by Alex Reynolds20k • written 8 months ago by CB10

BBMap's Dedupe utility has a "requirematchingnames" flag. This will make it only remove duplicates that have identical sequence and identical names. For example:

dedupe.sh in=file.fasta out=deduped.fasta ac=f requirematchingnames

One copy of each duplicate set will remain, unless you add the "uniqueonly" flag.

ADD REPLYlink modified 8 months ago • written 8 months ago by Brian Bushnell14k
3
gravatar for shenwei356
8 months ago by
shenwei3563.3k
China
shenwei3563.3k wrote:

http://bioinf.shenwei.me/seqkit/usage/#rmdup

ADD COMMENTlink written 8 months ago by shenwei3563.3k

It worked very well. It is very easy to use. Thanks!

ADD REPLYlink written 8 months ago by CB10
1
gravatar for Alex Reynolds
8 months ago by
Alex Reynolds20k
Seattle, WA USA
Alex Reynolds20k wrote:

Make your FASTA files single-line (one line for header, one line for sequence): Multiline Fasta To Single Line Fasta

Then:

$ awk '{ if (($0 ~ /^>/) && (!seen[$0]++)) { print $0; printSeq=1; } else if (($0 ~ /^[^>]/) && printSeq) { print $0; printSeq=0; }  }' in.fa > out.fa
ADD COMMENTlink modified 8 months ago • written 8 months ago by Alex Reynolds20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1741 users visited in the last hour