Question: Remove duplicate sequences with same id from a fasta file
1
gravatar for CB
13 months ago by
CB10
US
CB10 wrote:

Dear all, there are many posts about remove duplicate sequences in a fasta file (https://www.biostars.org/p/3003/), but I want to remove only the duplicate sequences with the same ids.

I have many duplicate sequences in my fasta file, but with different ids and I want to keep them.

How to remove only same id sequence duplicates? I have protein sequences and my sequences are split in different lines.

duplicate sequence remove fasta • 1.2k views
ADD COMMENTlink modified 13 months ago by Alex Reynolds22k • written 13 months ago by CB10

BBMap's Dedupe utility has a "requirematchingnames" flag. This will make it only remove duplicates that have identical sequence and identical names. For example:

dedupe.sh in=file.fasta out=deduped.fasta ac=f requirematchingnames

One copy of each duplicate set will remain, unless you add the "uniqueonly" flag.

ADD REPLYlink modified 13 months ago • written 13 months ago by Brian Bushnell15k
3
gravatar for shenwei356
13 months ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

http://bioinf.shenwei.me/seqkit/usage/#rmdup

ADD COMMENTlink written 13 months ago by shenwei3563.4k

It worked very well. It is very easy to use. Thanks!

ADD REPLYlink written 13 months ago by CB10
1
gravatar for Alex Reynolds
13 months ago by
Alex Reynolds22k
Seattle, WA USA
Alex Reynolds22k wrote:

Make your FASTA files single-line (one line for header, one line for sequence): Multiline Fasta To Single Line Fasta

Then:

$ awk '{ if (($0 ~ /^>/) && (!seen[$0]++)) { print $0; printSeq=1; } else if (($0 ~ /^[^>]/) && printSeq) { print $0; printSeq=0; }  }' in.fa > out.fa
ADD COMMENTlink modified 13 months ago • written 13 months ago by Alex Reynolds22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 820 users visited in the last hour