Makeblastdb Error: Protein Delta Sequences Are Not Supported
3
2
Entering edit mode
11.4 years ago
arnstrm ★ 1.9k

Hi all,

I am trying to create blast database for protein sequences formatted like this:

>lcl|Bm_nscaf1071_03
MLPIWTSEFLQAVSRMDPKFIALHLQEVGGKAYEKSMQYVKDFVQRLCDC
PELRLFDKIRIYLDEDFSSPEKFTALGNMYFAHSSLVDLKIWDFDLKAYV
DVVGKEIHSGNIENASTKEKAKFPQHFFPEVMSSLYLHLNVLR
>lcl|Bm_nscaf1071_04
MATRLDASWNRWMLTNSTAVPRMVQRVRRNPFRDHLHCNNTAPLGQRSLW
PVTLLAETWDPSFSQGLYPIQLYPYLTPLKTLSSITLTQSHGFKDSIQSN
SNPVSPL
>lcl|Bm_nscaf1071_05
MLQKVTEDLSAQRVQGGAEGGRLQYRRRADQRLVLTVGKKEFAHVDHQKI
FREPWVRDLPTPPSHYRLVSYQGQSGRLVVEGFLARRKRKPSYTASSRRL
QRYDRELEALRPHLFEFPVKFPPTYPFEEDVLLPTHYMKTRCPSWCDRVL
VSQAARPLLHDPPRHDTPRHDTPRHDRRSVTESTDSSSGRASSDSSPARS

when I run this command

makeblastdb -in bmori.fasta -dbtype prot -parse_seqids -out bmori

I am getting this error:

Error: NCBI C++ Exception:
"/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/objtools/blast/seqdb_writer/build_db.cpp", line 303: Error: ncbi::s_FixBioseqDeltas() - Protein delta sequences are not supported.

I was able to run this on all other genomes without any problem. Can anybody explain why this is happening?

Thanks very much!

makeblastdb • 8.8k views
ADD COMMENT
2
Entering edit mode
11.4 years ago

no problem withmakeblastdb 2.2.28+

$ cat in.fa                                             
>lcl|Bm_nscaf1071_03
MLPIWTSEFLQAVSRMDPKFIALHLQEVGGKAYEKSMQYVKDFVQRLCDC 
PELRLFDKIRIYLDEDFSSPEKFTALGNMYFAHSSLVDLKIWDFDLKAYV 
DVVGKEIHSGNIENASTKEKAKFPQHFFPEVMSSLYLHLNVLR
>lcl|Bm_nscaf1071_04
MATRLDASWNRWMLTNSTAVPRMVQRVRRNPFRDHLHCNNTAPLGQRSLW 
PVTLLAETWDPSFSQGLYPIQLYPYLTPLKTLSSITLTQSHGFKDSIQSN SNPVSPL
>lcl|Bm_nscaf1071_05
MLQKVTEDLSAQRVQGGAEGGRLQYRRRADQRLVLTVGKKEFAHVDHQKI 
FREPWVRDLPTPPSHYRLVSYQGQSGRLVVEGFLARRKRKPSYTASSRRL 
QRYDRELEALRPHLFEFPVKFPPTYPFEEDVLLPTHYMKTRCPSWCDRVL
VSQAARPLLHDPPRHDTPRHDTPRHDRRSVTESTDSSSGRASSDSSPARS

$ makeblastdb -in in.fa -dbtype prot -parse_seqids -out bmori                         

Building a new DB, current time: 06/28/2013 18:42:58
New DB name:   bmori
New DB title:  jeter.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3 sequences in 0.00245404 seconds.
[
ADD COMMENT
0
Entering edit mode

Unfortunately, there are thousands of sequences in that file! I don't know what sequences are causing this error. If I get to know why we see this error then I can look for those sequences are eliminate them.

ADD REPLY
1
Entering edit mode

search for a strange AA using

grep -v ">" *.fasta | sed "s/\(.\)/\1\n/g" | sort | uniq -c
ADD REPLY
0
Entering edit mode

Thanks, I discovered that there were "-" in protein sequences. Once I removed them, it was fine!

ADD REPLY
0
Entering edit mode

Can you please tell me how exactly you removed the "-" in the protein sequences? When you removed the "-" from the sequences, how can you be sure you are not changing the actual sequence and there by altering the output you will be getting after running blast?

ADD REPLY
0
Entering edit mode

Easiest way is tr:

tr -d '-' < FILE > FILE.2

This command is quick and simple, however it will also remove - from the sequence identifier.

ADD REPLY
1
Entering edit mode
10.4 years ago
Shaun Jackman ▴ 420

Look for sequences that start with a hyphen - character:

grep '^-'

Remove the leading hyphen:

sed 's/^-//'

These sequences starting with a - character seem to signify a start codon that is created by RNA editing. Can anyone verify?

ADD COMMENT
0
Entering edit mode

I just encountered this same issue, and at least in my case, these sequences don't represent instances of RNA editing – instead, they're translated pseudogenes: stretches of nucleic acid whose putative translations are found by gene annotation methods that identify homology to known protein sequences, but for which the regions corresponding to the would-be ORF lack the appropriate start and / or stop codons, or have accumulated premature stop codons or frame-shift mutations in the coding sequence.

In my case, especially with sequences containing premature stop codons (indicated by *), simply removing - and * produces nonsensical sequences. I've decided to remove any pseudogene translations.

ADD REPLY
0
Entering edit mode
10.4 years ago
jcsoellner • 0

I have just come across the same issue, and thanks to this post I looked into unexpected characters.

I think my problems originate in this sequence:

>gi|651852200|ref|YP_008869149.1| DNA-binding protein [Streptococcus pneumoniae R6]
-**VWFFFSSVAHSFERIVDGSWMTAKFRSYFSQISVWIISKIVFKSISINFSWFCSFYLVV*LSCFLF*L*PAINGIT*
DLENV*CFCYTTCSLTIRENFFTKIY*ICHKKIIPH

Which makes me suspect it is best to eclude these entries. Possibly not those with a starting - only, in case the Shaun's suggestion regarding RNA editiing is correct, but in case of examples like my own cosmetic corrections will do more harm than good.

ADD COMMENT

Login before adding your answer.

Traffic: 1436 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6