Question

Makeblastdb Error: Protein Delta Sequences Are Not Supported

2

Entering edit mode

10.8 years ago

arnstrm ★ 1.8k

Hi all,

I am trying to create blast database for protein sequences formatted like this:

>lcl|Bm_nscaf1071_03
MLPIWTSEFLQAVSRMDPKFIALHLQEVGGKAYEKSMQYVKDFVQRLCDC
PELRLFDKIRIYLDEDFSSPEKFTALGNMYFAHSSLVDLKIWDFDLKAYV
DVVGKEIHSGNIENASTKEKAKFPQHFFPEVMSSLYLHLNVLR
>lcl|Bm_nscaf1071_04
MATRLDASWNRWMLTNSTAVPRMVQRVRRNPFRDHLHCNNTAPLGQRSLW
PVTLLAETWDPSFSQGLYPIQLYPYLTPLKTLSSITLTQSHGFKDSIQSN
SNPVSPL
>lcl|Bm_nscaf1071_05
MLQKVTEDLSAQRVQGGAEGGRLQYRRRADQRLVLTVGKKEFAHVDHQKI
FREPWVRDLPTPPSHYRLVSYQGQSGRLVVEGFLARRKRKPSYTASSRRL
QRYDRELEALRPHLFEFPVKFPPTYPFEEDVLLPTHYMKTRCPSWCDRVL
VSQAARPLLHDPPRHDTPRHDTPRHDRRSVTESTDSSSGRASSDSSPARS

when I run this command

makeblastdb -in bmori.fasta -dbtype prot -parse_seqids -out bmori

I am getting this error:

Error: NCBI C++ Exception:
"/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/objtools/blast/seqdb_writer/build_db.cpp", line 303: Error: ncbi::s_FixBioseqDeltas() - Protein delta sequences are not supported.

I was able to run this on all other genomes without any problem. Can anybody explain why this is happening?

Thanks very much!

makeblastdb • 8.5k views

ADD COMMENT • link updated 9.9 years ago by jcsoellner • 0 • written 10.8 years ago by arnstrm ★ 1.8k

Ram · Answer 1 · 2013-06-28

2

Entering edit mode

10.8 years ago

Pierre Lindenbaum 161k

no problem withmakeblastdb 2.2.28+

$ cat in.fa                                             
>lcl|Bm_nscaf1071_03
MLPIWTSEFLQAVSRMDPKFIALHLQEVGGKAYEKSMQYVKDFVQRLCDC 
PELRLFDKIRIYLDEDFSSPEKFTALGNMYFAHSSLVDLKIWDFDLKAYV 
DVVGKEIHSGNIENASTKEKAKFPQHFFPEVMSSLYLHLNVLR
>lcl|Bm_nscaf1071_04
MATRLDASWNRWMLTNSTAVPRMVQRVRRNPFRDHLHCNNTAPLGQRSLW 
PVTLLAETWDPSFSQGLYPIQLYPYLTPLKTLSSITLTQSHGFKDSIQSN SNPVSPL
>lcl|Bm_nscaf1071_05
MLQKVTEDLSAQRVQGGAEGGRLQYRRRADQRLVLTVGKKEFAHVDHQKI 
FREPWVRDLPTPPSHYRLVSYQGQSGRLVVEGFLARRKRKPSYTASSRRL 
QRYDRELEALRPHLFEFPVKFPPTYPFEEDVLLPTHYMKTRCPSWCDRVL
VSQAARPLLHDPPRHDTPRHDTPRHDRRSVTESTDSSSGRASSDSSPARS

$ makeblastdb -in in.fa -dbtype prot -parse_seqids -out bmori                         

Building a new DB, current time: 06/28/2013 18:42:58
New DB name:   bmori
New DB title:  jeter.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3 sequences in 0.00245404 seconds.
[

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Unfortunately, there are thousands of sequences in that file! I don't know what sequences are causing this error. If I get to know why we see this error then I can look for those sequences are eliminate them.

ADD REPLY • link 10.8 years ago by arnstrm ★ 1.8k

1

Entering edit mode

search for a strange AA using

grep -v ">" *.fasta | sed "s/\(.\)/\1\n/g" | sort | uniq -c

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks, I discovered that there were "-" in protein sequences. Once I removed them, it was fine!

ADD REPLY • link 10.8 years ago by arnstrm ★ 1.8k

0

Entering edit mode

Can you please tell me how exactly you removed the "-" in the protein sequences? When you removed the "-" from the sequences, how can you be sure you are not changing the actual sequence and there by altering the output you will be getting after running blast?

ADD REPLY • link 10.2 years ago by deepak.datta007 • 0

0

Entering edit mode

Easiest way is tr:

tr -d '-' < FILE > FILE.2

This command is quick and simple, however it will also remove - from the sequence identifier.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.2 years ago by PoGibas 5.1k

Ram · Answer 2 · 2014-05-26

1

Entering edit mode

9.9 years ago

Shaun Jackman ▴ 420

Look for sequences that start with a hyphen - character:

grep '^-'

Remove the leading hyphen:

sed 's/^-//'

These sequences starting with a - character seem to signify a start codon that is created by RNA editing. Can anyone verify?

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Shaun Jackman ▴ 420

0

Entering edit mode

I just encountered this same issue, and at least in my case, these sequences don't represent instances of RNA editing – instead, they're translated pseudogenes: stretches of nucleic acid whose putative translations are found by gene annotation methods that identify homology to known protein sequences, but for which the regions corresponding to the would-be ORF lack the appropriate start and / or stop codons, or have accumulated premature stop codons or frame-shift mutations in the coding sequence.

In my case, especially with sequences containing premature stop codons (indicated by *), simply removing - and * produces nonsensical sequences. I've decided to remove any pseudogene translations.

ADD REPLY • link 5.9 years ago by ucpete ▴ 150

Ram · Answer 3 · 2014-06-19

I have just come across the same issue, and thanks to this post I looked into unexpected characters.

I think my problems originate in this sequence:

>gi|651852200|ref|YP_008869149.1| DNA-binding protein [Streptococcus pneumoniae R6]
-**VWFFFSSVAHSFERIVDGSWMTAKFRSYFSQISVWIISKIVFKSISINFSWFCSFYLVV*LSCFLF*L*PAINGIT*
DLENV*CFCYTTCSLTIRENFFTKIY*ICHKKIIPH

Which makes me suspect it is best to eclude these entries. Possibly not those with a starting - only, in case the Shaun's suggestion regarding RNA editiing is correct, but in case of examples like my own cosmetic corrections will do more harm than good.