Hi all,
I am trying to create blast database for protein sequences formatted like this:
>lcl|Bm_nscaf1071_03
MLPIWTSEFLQAVSRMDPKFIALHLQEVGGKAYEKSMQYVKDFVQRLCDC
PELRLFDKIRIYLDEDFSSPEKFTALGNMYFAHSSLVDLKIWDFDLKAYV
DVVGKEIHSGNIENASTKEKAKFPQHFFPEVMSSLYLHLNVLR
>lcl|Bm_nscaf1071_04
MATRLDASWNRWMLTNSTAVPRMVQRVRRNPFRDHLHCNNTAPLGQRSLW
PVTLLAETWDPSFSQGLYPIQLYPYLTPLKTLSSITLTQSHGFKDSIQSN
SNPVSPL
>lcl|Bm_nscaf1071_05
MLQKVTEDLSAQRVQGGAEGGRLQYRRRADQRLVLTVGKKEFAHVDHQKI
FREPWVRDLPTPPSHYRLVSYQGQSGRLVVEGFLARRKRKPSYTASSRRL
QRYDRELEALRPHLFEFPVKFPPTYPFEEDVLLPTHYMKTRCPSWCDRVL
VSQAARPLLHDPPRHDTPRHDTPRHDRRSVTESTDSSSGRASSDSSPARS
when I run this command
makeblastdb -in bmori.fasta -dbtype prot -parse_seqids -out bmori
I am getting this error:
Error: NCBI C++ Exception:
"/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/objtools/blast/seqdb_writer/build_db.cpp", line 303: Error: ncbi::s_FixBioseqDeltas() - Protein delta sequences are not supported.
I was able to run this on all other genomes without any problem. Can anybody explain why this is happening?
Thanks very much!
Unfortunately, there are thousands of sequences in that file! I don't know what sequences are causing this error. If I get to know why we see this error then I can look for those sequences are eliminate them.
search for a strange AA using
Thanks, I discovered that there were "-" in protein sequences. Once I removed them, it was fine!
Can you please tell me how exactly you removed the "-" in the protein sequences? When you removed the "-" from the sequences, how can you be sure you are not changing the actual sequence and there by altering the output you will be getting after running blast?
Easiest way is
tr
:This command is quick and simple, however it will also remove
-
from the sequence identifier.