Error: Duplicate seq_ids are found
0
0
Entering edit mode
2.3 years ago
rupjandu • 0

Hello everyone, I am using the web-based Galaxy tool, not the command line version. I merged FASTA files into one and I'm trying to construct a BLAST database with these local sequences through the makeblastdb function. I get an error message that reads, "Error: Duplicate seq_ids are found: GNL|BL_ORD_ID|9650923".

Can anyone assist in finding a way to remove the duplicate seq IDs using the web-based Galaxy tool preferentially?

Thank you!

GALAXY BLAST • 1.1k views
ADD COMMENT
1
Entering edit mode

If you need Galaxy specific assistance please post this on their help forum: https://help.galaxyproject.org/

ADD REPLY
0
Entering edit mode
  1. Check if there are any fasta ID / header repairing tools in galaxy toolshed and see if they are not installed on galaxy instance.
  2. If they are not, try to install them if you are admin or if you have access to admin account. If not request admin to install such a tool.
  3. If 1 and 2 are not possible, download fasta file.
  4. Run this function sed -nr '/^>/p' <input.fa> |sort -V | uniq -D | uniq -c on download file (input.fa). This should print duplicated/identical headers and their count.
  5. Download seqkit tool and run seqkit rename -n <input.fa> -o <output.fa>. This would generate a new file output.fa and append numbers serially at the end of fasta IDs/headers if they are identical.
  6. Run this function sed -nr '/^>/p' <output.fa> | sort -V | uniq -D | uniq -c on new file (output.fa). This should not print any line.
  7. Upload the new fasta file and run blast on it.
ADD REPLY

Login before adding your answer.

Traffic: 1273 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6