Question: Remove duplicate sequences with same id from a fasta file
1
gravatar for CB
17 months ago by
CB10
US
CB10 wrote:

Dear all, there are many posts about remove duplicate sequences in a fasta file (https://www.biostars.org/p/3003/), but I want to remove only the duplicate sequences with the same ids.

I have many duplicate sequences in my fasta file, but with different ids and I want to keep them.

How to remove only same id sequence duplicates? I have protein sequences and my sequences are split in different lines.

duplicate sequence remove fasta • 1.6k views
ADD COMMENTlink modified 17 months ago by Alex Reynolds24k • written 17 months ago by CB10

BBMap's Dedupe utility has a "requirematchingnames" flag. This will make it only remove duplicates that have identical sequence and identical names. For example:

dedupe.sh in=file.fasta out=deduped.fasta ac=f requirematchingnames

One copy of each duplicate set will remain, unless you add the "uniqueonly" flag.

ADD REPLYlink modified 17 months ago • written 17 months ago by Brian Bushnell15k
3
gravatar for shenwei356
17 months ago by
shenwei3563.8k
China
shenwei3563.8k wrote:

http://bioinf.shenwei.me/seqkit/usage/#rmdup

ADD COMMENTlink written 17 months ago by shenwei3563.8k

It worked very well. It is very easy to use. Thanks!

ADD REPLYlink written 17 months ago by CB10
1
gravatar for Alex Reynolds
17 months ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

Make your FASTA files single-line (one line for header, one line for sequence): Multiline Fasta To Single Line Fasta

Then:

$ awk '{ if (($0 ~ /^>/) && (!seen[$0]++)) { print $0; printSeq=1; } else if (($0 ~ /^[^>]/) && printSeq) { print $0; printSeq=0; }  }' in.fa > out.fa
ADD COMMENTlink modified 17 months ago • written 17 months ago by Alex Reynolds24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 680 users visited in the last hour