Question: (Closed) please help with blast database creation error
0
gravatar for sohra
3.5 years ago by
sohra30
Switzerland
sohra30 wrote:

Hi folks,

I'm trying to make a protein blast database that it give me an error due to duplicate sequences;  I removed duplicate sequences with Dedupe tool (http://sourceforge.net/projects/bbmap/). But I still got the same error : "BLAST Database creation error: Error: Duplicate seq_ids are found:
LCL|51651|M".

I have also checked the fasta protein sequences for the blank lines, but there was no blank line. Could you please help me what is wrong here?

Thanks so much

blast fasta sequence database • 4.4k views
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by sohra30

Hello sohra!

Questions similar to yours can already be found at:

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

ADD REPLYlink written 3.5 years ago by Michael Dondrup45k
1

oh, no close, please. As I mentioned, I removed duplicate sequences, but the error still exist.

ADD REPLYlink written 3.5 years ago by sohra30
1
grep "^>" deduped.fa | awk '{print $1}' | sort | uniq -c | sort -n -k1,1 | tail 

There should be no anything with a number on the left larger than 1. This is wrong: 

2   LCL|51651|M

And if you got it, then no matter which tool you used you are wrong about removing entries with duplicated names.  

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Darked894.2k

Thanks for your comment. Good to know its output, however, I got the an error:

sort: cannot read: k1,1: No such file or directory
uniq: invalid option -- 'n'
Try 'uniq --help' for more information

I tried with "unique", but the error remained. I would be highly appreciated if you could please review the command.

ADD REPLYlink written 3.5 years ago by sohra30

Sorry, two errors:

uniq -n instead of uniq -c 

sort -n k1,1 instead of sort -n -k1,1

Fixed above and tested. Best.

ADD REPLYlink written 3.5 years ago by Darked894.2k

Many thanks for your help. I used your proposed command on the fasta file before and after duplicated sequence removal, but the output was the same for the both file (original file and dedup file) as here:

      2 >70072|M.
      2 >70257|M.
      2 >70649|M.
      2 >70691|M.
      2 >70757|M.
      2 >70764|M.
      2 >70768|M.
      2 >70830|M.
      2 >70857|M.
      2 >70927|M.

However, dedup file has 139173 sequences less than original file. I'm totally confused what happened here?

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by sohra30
1

Without the original fasta file plus the exact version of the Dedupe program you have used it is impossible to figure it out remotely. But the number of already excluded duplicated sequence identifiers begs the question: why so many? how was this fasta file created in the first place?

Because you know, one guy on the next side of corridor (or other continent) may write some not so clever script outputting things like this:

>chr1|region02|gene protein kinase 2345
>chr1|region02|gene histone acetylase 9876

and some unlucky person removes "duplicates": chr1|region02|gene. O

I would go to the original fasta file and check what is going on there. Quick and dirty check: 

grep -A 3 "^>70927|M" original.fa  

You should get fasta headers plus 3 extra lines containing sequence of presumably two 70927|M entries. If the whole sequences are identical then obviously you do have duplicated entries. If there is some mismatch, filtering on sequence names is a wrong thing to do. In that case you may need to rename the all entries giving them unique name prefix first (sohra_00001_70927|M,  sohra_00002_70927|M, etc), then remove identical sequences if needed. 

ADD REPLYlink written 3.5 years ago by Darked894.2k

Thanks Darked for your great response. I downloaded the sequence file from phytozome database. The problem was exactly what you wisely mentioned it as the output of your commands was:

>70927|M. pusilla CCMP1545|(M=1) PTHR24361:SF61 - NIK-RELATED PROTEIN KINASE
MELPRSTRRSTVVGNYILGDEIGKGAHGQVYRAIDKRDGRVVAVKEIPLR
ATSAYDVDAIESECALLRSLSHRNVTRFLGTVRAPEHLYILLELAENGSL
AGVIKPSRFGPTPEPLAACYVAQLLDGLIYLHANGVTHRDIKGANVLATK
--
>70927|M. sp. RCC299|(M=4) PF07819 - PGAP1-like protein
KPAVVVLPGLGNCTEDYDEFASELELRGFSATVAAVGRPDWLRNAAGLTQ
LAYWQGTLEPRPTVDWYLSRIADAVAAAKEKSGADRVALVAHSAGGWMSR
VYMQDFGVDDIRCVVTLGSPLNAVPKDVPGVVDQTRGILTYVEANCAKPT

It means, there isn't really duplicate sequences in the original file, am I right? But, in your professional view, how I should deal with the error for making blast database?

Thanks in advance

ADD REPLYlink written 3.5 years ago by sohra30

I would simply reformat all fasta headers in the original file. But first check with the grep "^>" deduped.fa etc. using your original fasta before Dedup that what you got from  phytozome DB looks sane. Not very likely, I hope, but again, this time at the higher level you may get:

>gene1 Somevery_long speciesname and_a_long_strain_info foobar kinase 123 (name truncated)

>gene1 Somevery_long speciesname and_a_long_strain_info foobar kinase 123 (name truncated)

where in reality these entries may had "isoform A" and "isoform B" extra strings removed at some step. 

Add some simple counter as prefix, then replace all spaces in names with underscores. I am paranoid about non alpha-numeric characters, so I would probably replace all "() | / \ {}:;" as well. To be on the safe side, also check for the lengths of headers and  lengths of sequences themselves. 

Last but not least: command line hacks work for simple things, but it is better if you move to real scripting language (i.e. Python) plus use existing solutions (https://github.com/brentp/pyfasta/ for a start). 

ADD REPLYlink written 3.5 years ago by Darked894.2k
1

Thanks for your effort to solve the problem. I just recently come into this filed and working on a server where cannot simply install the tool as our admin takes several days off from work. So, it will be great if it's possible for you to put your helpful commands to do the suggested task for resolving the problem.

I don't know how to thank you

ADD REPLYlink written 3.5 years ago by sohra30

If makeblastdb says there are duplicate id entries, then there are. Why don't you simply look at the file with less, then see if the entry is there. If it is the only duplicate entry you can simply remove one entry in an editor. Did you remove identical sequences or did you remove sequences with identical ids?

ADD REPLYlink written 3.5 years ago by Michael Dondrup45k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1929 users visited in the last hour