Question: Strange Behaviour Of Bioperl'S Bio::Searchio When Parsing Xml Blast Output
gravatar for rwn
7.2 years ago by
United Kingdom
rwn530 wrote:

Hello, I've noticed some strange behaviour when parsing BLAST .xml output files (-oufmt 5) using BioPerl's Bio::SearchIO library.

I have a simple parser script that looks something like:

#!/usr/bin/perl -w

use strict;
use Bio::SearchIO;

my $in = Bio::SearchIO -> new (-format => 'blastxml', -file => "consensusSeqs.BLASTp.xml");
open (OUT, ">consensusSeqs.parse.OUT");

my $query_count = 1;

while( my $result = $in->next_result ) {
    print "Query count: ".$query_count."\n";
    print "Query name: ".$result->query_description."\n";
    print "Number of hits: ".$result->num_hits."\n";
    my $hit_count = 1;

    if (!defined $result->next_hit) {
        print OUT $result->query_description."\tNo hit\n";

    while ( my $hit = $result->next_hit ) {
        print "\tHit count: ".$hit_count."\n";
        while ( my $hsp = $hit->next_hsp ) {
            my @a = split /\|/, $hit->name;
            my $hit_accession = $a[3];
            # print "Hit name: ".$hit->name."\n";
            print "\tHit accession: ".$hit_accession."\n";

             ## get some stats of hit
            my $percent_id = sprintf("%.2f", $hsp->percent_identity);
            my $percent_q_coverage = sprintf("%.2f", ((($hsp->length('query')/($result->query_length)))*100)); 
            my @b = split /[\[\]]/, $hit->description;
            my $organism = $b[1];

            (my $short_desc = $hit->description) =~ s/\[.*//;

            print "\t\tE-value: ".$hsp->evalue."\n";
            print "\t\t% identity: ".$percent_id."\n";
            print "\t\t% coverage: ".$percent_q_coverage."\n";
            print "\t\tDescription: ".$hit->description."\n";
            if (!defined $short_desc) {
                $short_desc = "No hits found";
            print "\t\tShort description: ".$short_desc."\n";
            ## if there is nothing under [organism] 
            if (!defined $organism) {
                $organism = "not defined";
            print "\t\tOrganism: ".$organism."\n";
            print "\t~~~\n";

            ## print the hit to OUT
            print OUT $result->query_description."\t".$percent_id."\t".$percent_q_coverage."\t".$organism."\t".$short_desc."\n"; 
    print "\n";
    print OUT "~~~\n";

print "Finished\n\n";

With something like the above, I hope to capture some information for all query sequences, even if there is no hit. But I noticed that some queries were not in my output file. Looking more closely at the XML file itself, I think the issue only occurs when there is exactly one hit for a given query. In these cases, it seems to skip the while ( my $hit = $result->next_hit ) and while ( my $hsp = $hit->next_hsp ) loops and iterates directly to the next $result.

Has anyone else noticed this behaviour? Is it a bug, or is it something in the way I've set up my script? Any suggestions for a workaround would be much appreciated!


perl xml bioperl blast • 2.8k views
ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by rwn530
gravatar for rwn
7.2 years ago by
United Kingdom
rwn530 wrote:

Turns out it was just me being an idiot. The piece of code:

if (!defined $result->next_hit) {
    print OUT $result->query_description."\tNo hit\n";

iterates through the list of possible $hit objects, such that by the time the while (my $hit = $result->next_hit) loop is reached, it is already on $hit number 2. To get around it, can use the result->rewind function to start from the beginning again, like this:

if (!defined $result->next_hit) {
  print OUT $result->query_description."\tNo hit\n";
# etc etc

You live and learn!

ADD COMMENTlink written 7.2 years ago by rwn530
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2497 users visited in the last hour