Question: Remove duplicate sequences with same id from a fasta file
1
gravatar for CB
2.4 years ago by
CB10
US
CB10 wrote:

Dear all, there are many posts about remove duplicate sequences in a fasta file (https://www.biostars.org/p/3003/), but I want to remove only the duplicate sequences with the same ids.

I have many duplicate sequences in my fasta file, but with different ids and I want to keep them.

How to remove only same id sequence duplicates? I have protein sequences and my sequences are split in different lines.

duplicate sequence remove fasta • 2.9k views
ADD COMMENTlink modified 2.4 years ago by Alex Reynolds28k • written 2.4 years ago by CB10

BBMap's Dedupe utility has a "requirematchingnames" flag. This will make it only remove duplicates that have identical sequence and identical names. For example:

dedupe.sh in=file.fasta out=deduped.fasta ac=f requirematchingnames

One copy of each duplicate set will remain, unless you add the "uniqueonly" flag.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Brian Bushnell16k
3
gravatar for shenwei356
2.4 years ago by
shenwei3564.7k
China
shenwei3564.7k wrote:

http://bioinf.shenwei.me/seqkit/usage/#rmdup

ADD COMMENTlink written 2.4 years ago by shenwei3564.7k

It worked very well. It is very easy to use. Thanks!

ADD REPLYlink written 2.4 years ago by CB10
2
gravatar for Alex Reynolds
2.4 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Make your FASTA files single-line (one line for header, one line for sequence): Multiline Fasta To Single Line Fasta

Then:

$ awk '{ if (($0 ~ /^>/) && (!seen[$0]++)) { print $0; printSeq=1; } else if (($0 ~ /^[^>]/) && printSeq) { print $0; printSeq=0; }  }' in.fa > out.fa
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1182 users visited in the last hour