Removing duplicate protein sequences from fasta file
1
1
Entering edit mode
5.7 years ago

Hi, I want to do psiblast in command line. Before, performing psiblast, I tried to make a blast database by using makeblastdb command. But it shows the following error:

BLAST Database creation error: Error: Duplicate seq_ids are found:
REF|WP_003261842.1

My fasta file is very large. So, manual deletion of the duplicate sequence is nearly impossible. How can I get rid from the duplication?

Cheers

blast database makeblastdb • 2.8k views
ADD COMMENT
3
Entering edit mode
5.7 years ago

Use seqkit:

$ seqkit rmdup input.fa > output.fa
ADD COMMENT
0
Entering edit mode

Thanks a lot! It works.

ADD REPLY
0
Entering edit mode

Then please accept his answer. There's a little tick box.

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6