Errors In A Bioperl Script, Please Help Me To Have A Check, Thanks In Advance!
2
0
Entering edit mode
11.0 years ago
xiongtl2013 ▴ 40

Dear friends:

I have a blast report, and want to extract following information for each result(query): the Query_name, hit_number, name and description of the hit(HSP) with the highest identity.

my bioperl script is:

#!/usr/bin/perl -w
use strict;
use warnings;
use Bio::SearchIO;

if (@ARGV != 2){die "Usage: $0 <BLAST-report-file> <output-file>\n"};

my ($infile,$outfile) = @ARGV;
print "Parsing the BLAST result ...";
my $blast = new Bio::SearchIO(
-format => 'blast',
-file   => $infile);
open (OUT,">$outfile") or die "Cannot open $outfile: $!";

print OUT "query\tNumber of hits\tGreatest identity %\tAccession (identity %)\tDescription (identity %)\n";

while (my $result = $blast->next_result){
   print OUT $result->query_name . "\t";
   print OUT $result->num_hits. "\t";
        # don't need the '&':   if (my $hit = sort_hit($result))   
        if (my $hit = &sort_hit($result)){

                         if (my $hsp = $hit->hsp){
                                        print OUT $hsp->percent_identity. "\t"; 
                                        print OUT $hit->accession. "\t";
                                        print OUT $hit->description. "\n";
                            }                   
           }
    }
 close OUT;
 print " DONE!!!\n";
 # the problem starts here:
 # no matter what, this function will always return the first hit
 sub sort_hit{
                  my $result = shift;
                  my @hits;

                  # The following while loop is very clumsy, ResultI has a hits method:
                  # better:  my @hits = $result->hits();
                  while (my $hit = $result->next_hit) {
                        push @hits, $hit;
                        }
                  # here you sort the @hits, assign the result and do not use the result ever again.
                  # sorting the array has no effect on the original hits, that is also why I recommended to
                  # use $result->sort_hits instead
                  # minor: you don't need to multiply percent_identity by 100
                  my @hit_sort = sort { $b-> hsp ->percent_identity * 100 <=> $a-> hsp ->percent_identity * 100 } @hits;

                  # the following code is possibly meant to return the first element of the array, but
                  # is totally without use. You do not need to iterate again.
                  # Correct would be: 
                  return shift @hit_sort; # you do know shift, unshift, push and pop for array operations?
                  # rest of code doesn't do anything sensible:
                  $result->rewind;
                  my $flag=0;
                  my $temp_hit;
                  # At the end of this while loop, $temp_hit will always be the first hit!

                  while (my $hit=$result->next_hit){
                        if($flag==0){ # this is false already in the second iteration
                           $temp_hit=$hit;
                     $flag++;
                           }
                  return $temp_hit
                        }
         }

error information:

-------------EXCEPTION: Bio::Root::Exception -------------
MSG: Can't get HSPs: data not collected.
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/xiongtl/Bioperl/dag/lib/perl5/Bio/Root/Root.pm:368
STACK: Bio::Search::Hit::GenericHit::hsp /home/xiongtl/Bioperl/dag/lib/perl5/Bio/Search/Hit/GenericHit.pm:688
STACK: main::sort_hit xtl_new.pl:37
STACK: xtl_new.pl:20
-----------------------------------------------------------
Parsing the BLAST result ...

please give me some clues, all your words are welcome!

NOTE: My bioperl version is 1.6.1, and I use blastx 2.2.27+. My blast report(-outfmt 0) have many queries and each query have many hits(against nr database), so each hit maybe also have several HSPs.

Some more information: I tried the $hit->next_hsp before, but running the script still informed me the same error. However, today I have just reruned the script, both $hit->next_hsp and $hit->hsp are ok! They both outputed the first hit for each result. I don't know why. So I tried different sort rules as follows:

# my @hit_sort = sort { $b-> hsp ->hsp_length <=> $a-> hsp ->hsp_length } @hits;
# my @hit_sort = sort { $b-> hsp ->percent_identity * 100 <=> $a-> hsp ->percent_identity * 100 } @hits;
# my @hit_sort = sort { $b-> significance <=> $a-> significance } @hits;
# my @hit_sort = sort { $a-> accession cmp $b-> accession } @hits;

but different sort functions still generate the same output (still the first hit)! It seems that the sub-function does not work at all! oh, it really troubles me...

Here is my data structure:

Query= scaffold28:2776872-2777385(-)

Length=513
                                                                     Score     E
Sequences producing significant alignments:                          (Bits)  Value

gb|AEA49877.1|  RNA-dependent RNA polymerase [Drosophila ananassa...  90.5    3e-30
...
...
...
gb|AAA48442.1|  L protein [Vesicular stomatitis virus]                 115    2e-26


>gb|AEA49877.1| RNA-dependent RNA polymerase [Drosophila ananassae sigmavirus]
Length=615

Score = 90.5 bits (223),  Expect(2) = 3e-30, Method: Composition-based stats.
Identities = 40/57 (70%), Positives = 43/57 (75%), Gaps = 0/57 (0%)
Frame = +2

Query  173  ISIANHIDYEKWNAHQRGTGNDPCFKVMGQFLGLPNSILRTNEFFSKITFYYNQRPD  343
            I+IANHIDYEKWN HQRG  N P FKVMGQFLG PN I RT+EFF K   YY+ R D
Sbjct  554  ITIANHIDYEKWNNHQRGDANGPVFKVMGQFLGYPNLIARTHEFFEKSLIYYSDRAD  610


Score = 67.0 bits (162),  Expect(2) = 3e-30, Method: Compositional matrix adjust.
Identities = 39/54 (72%), Positives = 46/54 (85%), Gaps = 0/54 (0%)
Frame = +1

Query  1    FSLMSWRLREYFVITEYLIKLHFVPLFKGITMAddlttlikklldtSSGQALTD  162
            FSLMSW LR+YFV TEYLIK HFVPLF+G+TMADDLTT+ +K+L +SSGQ   D
Sbjct  497  FSLMSWELRDYFVFTEYLIKTHFVPLFEGLTMADDLTTVTQKMLSSSSGQGNDD  550

>....
>....
>....

>gb|AAA48442.1| L protein [Vesicular stomatitis virus]
Length=2109

Score =  115 bits (289),  Expect = 2e-26, Method: Compositional matrix adjust.
Identities = 81/172 (47%), Positives = 104/172 (60%), Gaps = 3/172 (2%)
Frame = +1

Query  1    FSLMSWRLREYFVITEYLIKLHFVPLFKGITMAddlttlikklldtSSGQALTDyrsyfy  180
            FSLMSWRLREYFVITEYLIK ++VPLFKG+TMADDLT++IKK++D+SSGQ L DY S   
Sbjct  541  FSLMSWRLREYFVITEYLIKTYYVPLFKGLTMADDLTSVIKKMMDSSSGQGLDDYSSVCL  600

Query  181  s*syrL*KMECSPTRNRE*SMFQGHGSIFRSAQLNTQN**IF**NYFLL*STTRPD-MIM  357
            +      K      +     +F+  G       L  +    F     L+    RPD M +
Sbjct  601  ANHIDYEKWNNHQRKESNGPIFRVMGQFLGYPSLIERTHEFF--EKSLIYYNGRPDLMTI  658

Query  358  NNGVIENLTDQRVCWNGQLGGQEGIGQRGCCVVNALAIERECQDQNIKVSLL  513
             NG + N T  RVCWNGQ GG EG+ Q+G  +VN L I+RE + +N  V +L
Sbjct  659  RNGTLCNSTKHRVCWNGQKGGLEGLRQKGWSIVNLLVIQREAKIRNTAVKVL  710



Lambda      K        H
   0.318    0.134    0.401 

Gapped
Lambda      K        H
0.267   0.0410    0.140 

Effective search space used: 304649109255
bioperl script blast • 5.6k views
ADD COMMENT
5
Entering edit mode

cross posted on seqanswers: http://seqanswers.com/forums/showthread.php?t=30888

Also, with similar problems it is absolutely necessary to post the following information:

  • version of Perl
  • version of BioPerl
  • a reproducible example input format
  • information about the software yielding the input (program, version, parameters)

The requirement to give sufficient version information and reproducible examples applies to all software related questions. I think, such question should be closed as "not enough detail/ no longer relevant (because they might refer to an outdated version of the software)" unless the information is provided in the future.

ADD REPLY
2
Entering edit mode

I feel very sorry for my mistakes! you've helped me in a good habit of posting a clear question, I really appreciate you and your tips!

The extra important information has been added, as shown above.

ADD REPLY
1
Entering edit mode

I have experienced some problems with older versions of BioPerl and blast+, do you have the possibility to test with 1.6.901? Otherwise I need to see your blast output. A single result (from Query= tag to next Query=)

ADD REPLY
0
Entering edit mode

So, according to your edit, the error message just "magically" vanished?

ADD REPLY
0
Entering edit mode

Indeed, I have compared the upload version above and my current local version (I just repeated to rerun it, no error informed), there are no difference in both! Maybe a temporary bug?

Now, the sort function still does not work (different rules have the same output, although values in report are obviously different [even I have tried to inverse $a and $b]). I am trying to work it out via modifying bit by bit. Do you have any ideas?

My data structure is showed above. Thank you very much!

ADD REPLY
0
Entering edit mode

I added some comments, please have a look at your sourcecode.

ADD REPLY
2
Entering edit mode
11.0 years ago
Michael 54k

Works for me with BioPerl 1.6.901 and a Blast file in default format, what is your version? It might be a problem with your blast output or you are using an old version, can you please post a reproducible example (a blast file with only one result entry, please)? Possibly your blast format doesn't contain HSPs, which version of Blast or other software are you using to generate your blast file? Some programs might make output that does not contain HSP information.

As a side note: you can have this a little bit easier because Bio::Search::Result::ResultI implements a function sort_hits already, from the documentation:

 Title        : sort_hits
 Usage        : $result->sort_hits(\&sort_function)
 Function    : Sorts the available hit objects by a user-supplied function. Defaults to sort
                  by descending score.
 Returns    : n/a
 Args        : A coderef for the sort function.  See the documentation on the Perl sort() 
                  function for guidelines on writing sort functions.  
 Note        : To access the special variables $a and $b used by the Perl sort() function 
                  the user function must access Bio::Search::Result::ResultI namespace. 
                  For example, use : 
                  $result->sort_hits( sub{$Bio::Search::Result::ResultI::a->length <=> 
                          $Bio::Search::Result::ResultI::b->length});
                   NOT $result->sort_hits($a->length <=>$b->length);
ADD COMMENT
1
Entering edit mode

It's almost certainly an issue with the BLAST output and as you say, most likely that the output contains no HSPs. We need more information to be sure.

ADD REPLY
0
Entering edit mode

Thank you for your suggestion, it is really nice of you!

The sort_hits function of Bio::Search::Result::ResultI perhaps (I am not sure) is only constrained in values generated by hit methods (such as: E-value, bit, raw score, and length or other string value of hit, not hsp).

Here, I want to sort hits by the identities of their hsps (any hsp is ok. Equivalently to say, comparing all hsp identities in a specific result and extracting the only hit [but maybe several hsps] with the highest hsp identity).

In addition, I have checked the sort_hsps in Bio::Search::Result::HitI, it can only sort the hsps in a specific hit.

Thank you very much! Please give me more comments...

ADD REPLY
0
Entering edit mode
10.9 years ago
Michael 54k

So now let's have a look at why your sort won't work. Honestly, you should improve your perl skills to be able to write efficient code. I put some comments into your code, to show you the worst mistakes.

ADD COMMENT

Login before adding your answer.

Traffic: 2363 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6