Question: Makeblastdb Error: Protein Delta Sequences Are Not Supported
2
gravatar for arnstrm
7.2 years ago by
arnstrm1.8k
Ames, IA
arnstrm1.8k wrote:

Hi all,

I am trying to create blast database for protein sequences formatted like this:

>lcl|Bm_nscaf1071_03
MLPIWTSEFLQAVSRMDPKFIALHLQEVGGKAYEKSMQYVKDFVQRLCDC
PELRLFDKIRIYLDEDFSSPEKFTALGNMYFAHSSLVDLKIWDFDLKAYV
DVVGKEIHSGNIENASTKEKAKFPQHFFPEVMSSLYLHLNVLR
>lcl|Bm_nscaf1071_04
MATRLDASWNRWMLTNSTAVPRMVQRVRRNPFRDHLHCNNTAPLGQRSLW
PVTLLAETWDPSFSQGLYPIQLYPYLTPLKTLSSITLTQSHGFKDSIQSN
SNPVSPL
>lcl|Bm_nscaf1071_05
MLQKVTEDLSAQRVQGGAEGGRLQYRRRADQRLVLTVGKKEFAHVDHQKI
FREPWVRDLPTPPSHYRLVSYQGQSGRLVVEGFLARRKRKPSYTASSRRL
QRYDRELEALRPHLFEFPVKFPPTYPFEEDVLLPTHYMKTRCPSWCDRVL
VSQAARPLLHDPPRHDTPRHDTPRHDRRSVTESTDSSSGRASSDSSPARS

when I run this command

makeblastdb -in bmori.fasta -dbtype prot -parse_seqids -out bmori

I am getting this error:

Error: NCBI C++ Exception:
"/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/objtools/blast/seqdb_writer/build_db.cpp", line 303: Error: ncbi::s_FixBioseqDeltas() - Protein delta sequences are not supported.

I was able to run this on all other genomes without any problem. Can anybody explain why this is happening?

Thanks very much!

makeblastdb • 6.7k views
ADD COMMENTlink modified 6.3 years ago by jcsoellner0 • written 7.2 years ago by arnstrm1.8k
2
gravatar for Pierre Lindenbaum
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

no problem withmakeblastdb 2.2.28+

$ cat in.fa                                             
>lcl|Bm_nscaf1071_03
MLPIWTSEFLQAVSRMDPKFIALHLQEVGGKAYEKSMQYVKDFVQRLCDC 
PELRLFDKIRIYLDEDFSSPEKFTALGNMYFAHSSLVDLKIWDFDLKAYV 
DVVGKEIHSGNIENASTKEKAKFPQHFFPEVMSSLYLHLNVLR
>lcl|Bm_nscaf1071_04
MATRLDASWNRWMLTNSTAVPRMVQRVRRNPFRDHLHCNNTAPLGQRSLW 
PVTLLAETWDPSFSQGLYPIQLYPYLTPLKTLSSITLTQSHGFKDSIQSN SNPVSPL
>lcl|Bm_nscaf1071_05
MLQKVTEDLSAQRVQGGAEGGRLQYRRRADQRLVLTVGKKEFAHVDHQKI 
FREPWVRDLPTPPSHYRLVSYQGQSGRLVVEGFLARRKRKPSYTASSRRL 
QRYDRELEALRPHLFEFPVKFPPTYPFEEDVLLPTHYMKTRCPSWCDRVL
VSQAARPLLHDPPRHDTPRHDTPRHDRRSVTESTDSSSGRASSDSSPARS

$ makeblastdb -in in.fa -dbtype prot -parse_seqids -out bmori                         

Building a new DB, current time: 06/28/2013 18:42:58
New DB name:   bmori
New DB title:  jeter.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3 sequences in 0.00245404 seconds.
[
ADD COMMENTlink modified 8 months ago by RamRS30k • written 7.2 years ago by Pierre Lindenbaum130k

Unfortunately, there are thousands of sequences in that file! I don't know what sequences are causing this error. If I get to know why we see this error then I can look for those sequences are eliminate them.

ADD REPLYlink written 7.2 years ago by arnstrm1.8k
1

search for a strange AA using

grep -v ">" *.fasta | sed "s/\(.\)/\1\n/g" | sort | uniq -c
ADD REPLYlink modified 8 months ago by RamRS30k • written 7.2 years ago by Pierre Lindenbaum130k

Thanks, I discovered that there were "-" in protein sequences. Once I removed them, it was fine!

ADD REPLYlink written 7.2 years ago by arnstrm1.8k

Can you please tell me how exactly you removed the "-" in the protein sequences? When you removed the "-" from the sequences, how can you be sure you are not changing the actual sequence and there by altering the output you will be getting after running blast?

ADD REPLYlink written 6.6 years ago by deepak.datta0070

Easiest way is tr:

tr -d '-' < FILE > FILE.2

This command is quick and simple, however it will also remove - from the sequence identifier.

ADD REPLYlink modified 8 months ago by RamRS30k • written 6.6 years ago by PoGibas4.8k
1
gravatar for Shaun Jackman
6.3 years ago by
Shaun Jackman420
Vancouver, Canada
Shaun Jackman420 wrote:

Look for sequences that start with a hyphen - character:

grep '^-'

Remove the leading hyphen:

sed 's/^-//'

These sequences starting with a - character seem to signify a start codon that is created by RNA editing. Can anyone verify?

ADD COMMENTlink modified 8 months ago by RamRS30k • written 6.3 years ago by Shaun Jackman420

I just encountered this same issue, and at least in my case, these sequences don't represent instances of RNA editing – instead, they're translated pseudogenes: stretches of nucleic acid whose putative translations are found by gene annotation methods that identify homology to known protein sequences, but for which the regions corresponding to the would-be ORF lack the appropriate start and / or stop codons, or have accumulated premature stop codons or frame-shift mutations in the coding sequence.

In my case, especially with sequences containing premature stop codons (indicated by *), simply removing - and * produces nonsensical sequences. I've decided to remove any pseudogene translations.

ADD REPLYlink written 2.3 years ago by ucpete50
0
gravatar for jcsoellner
6.3 years ago by
jcsoellner0
Austria
jcsoellner0 wrote:

I have just come across the same issue, and thanks to this post I looked into unexpected characters.

I think my problems originate in this sequence:

>gi|651852200|ref|YP_008869149.1| DNA-binding protein [Streptococcus pneumoniae R6]
-**VWFFFSSVAHSFERIVDGSWMTAKFRSYFSQISVWIISKIVFKSISINFSWFCSFYLVV*LSCFLF*L*PAINGIT*
DLENV*CFCYTTCSLTIRENFFTKIY*ICHKKIIPH

Which makes me suspect it is best to eclude these entries. Possibly not those with a starting - only, in case the Shaun's suggestion regarding RNA editiing is correct, but in case of examples like my own cosmetic corrections will do more harm than good.

ADD COMMENTlink modified 8 months ago by RamRS30k • written 6.3 years ago by jcsoellner0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 623 users visited in the last hour