Question: Missing HSPs from Blast xml output
0
gravatar for alissamariewilliams2015
2.1 years ago by

I am blasting a local fasta file of protein sequences against a local protein database. The Blast works fine and I am able to generate an XML file as my output. I would like to extract the query and top hit sequences from this XML file. However, my HSPs (high-scoring pair objects) in the XML file have missing information.

Here is the code that generates the XML file:

blast_handle = NcbiblastpCommandline(query='/PATH/to/proteins_of_interest.fasta', db='/PATH/to/local/db/protein.fa', outfmt=5, out='/PATH/to/results.xml')

stdout, stderr = blast_handle()

result_handle = open('/PATH/to/results.xml')
blast_records = NCBIXML.parse(result_handle)

I successfully get an XML file, and the following loop gives me the name of the top hit of each Blast (there were four protein sequences in my query fasta file):

for blast_record in blast_records:
    print(blast_record.alignments[0].hit_id)

However, my XML file is missing HSP data. This is a portion of the XML file:

<Hit>
  <Hit_num>9</Hit_num>
  <Hit_id>lcl|name_of_my_hit</Hit_id>
  <Hit_def>Unknown</Hit_def>
  <Hit_accession>Unknown</Hit_accession>
  <Hit_len>0</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>105.145</Hsp_bit-score>
      <Hsp_score>261</Hsp_score>
      <Hsp_evalue>9.96617e-26</Hsp_evalue>
      <Hsp_query-from>0</Hsp_query-from>
      <Hsp_query-to>0</Hsp_query-to>
      <Hsp_hit-from>0</Hsp_hit-from>
      <Hsp_hit-to>0</Hsp_hit-to>
      <Hsp_identity>60</Hsp_identity>
      <Hsp_qseq></Hsp_qseq>
      <Hsp_hseq></Hsp_hseq>
    </Hsp>
  </Hit_hsps>
</Hit>

As you can see, there is nothing between the Hsp_qseq and Hsp_hseq start and end tags. Does anyone know why the amino acid sequences aren't here? Thanks!

biopython blastp hsp blast • 940 views
ADD COMMENTlink written 2.1 years ago by alissamariewilliams20150

Update: I ran the same code, except outputted to a text file instead of an XML file. What I get for each alignment looks like this:

Sequence with id evm_27.model.AmTr_v1.0_scaffold00105.17 no longer exists in database...alignment skipped

So it appears that blast is looking for the alignment in the NCBI database in order to make the alignments, which I don't understand because I gave it a local database to search. And it clearly finds the best matches in the local database, hence the odd name of the match...so why isn't it just aligning the query to the hit it gets in my database?

ADD REPLYlink written 2.1 years ago by alissamariewilliams20150

Probably the error is not because BLAST is looking for the matched sequences in the NCBI database. BLAST is used by many to search local databases, there should not be such bug that breaks the software this way.

Sequence with id ... no longer exists error is reported in showalign.cpp in BLAST source code. Instead of reporting the actual error/exception they report the error we see. Most likely the error we see is the most common reason of failure in the two try-catch blocks that report this error. If you get the error consistently it should be possible to modify the source code to report the actual exception but it should be better to contact NCBI help desk first.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Mahmut Uludag10

Thank you! I will contact the help desk.

ADD REPLYlink written 2.1 years ago by alissamariewilliams20150

Great - please update this question when you hear back from the NCBI - thanks.

This is definitely a BLAST+ problem (and an unusual one), not something in Biopython.

ADD REPLYlink written 2.1 years ago by Peter5.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 707 users visited in the last hour