Question: Extracting Fasta Alignments From Parsed Blastxml File
0
gravatar for Eric
7.3 years ago by
Eric0
Eric0 wrote:

Hello,

I have cobbled together a small script that parses a BLASTxml file. It seems to parse the xml file just fine (judging from what it prints to the screen). The problem is the hsp.fas alignment file is incomplete. This file only contains one of the alignments contained in the BLAST output.

I would like to have all the alignments (including the query sequence in each of the individual alignments) that I see in the BLAST outputs (for example if I designate m -2 I get a complete file from the blastall).

Any suggestions? -Thanks!

module load perl

#give the name of the blast xml file to parse in the line where it says 'file =>'
use Bio::SearchIO; 
#Use m -7 to generate xml file from blastall
my $in = new Bio::SearchIO(-format => 'blastxml', 
                           -file   => 'BLASToutxml');
while( my $result = $in->next_result ) {
  ## $result is a Bio::Search::Result::ResultI compliant object
  while( my $hit = $result->next_hit ) {
    ## $hit is a Bio::Search::Hit::HitI compliant object
    while( my $hsp = $hit->next_hsp ) {
      ## $hsp is a Bio::Search::HSP::HSPI compliant object
#ENTER desired sequence length
      if( $hsp->length('total') > 50 ) {
#ENTER desired percent identity
        if ( $hsp->percent_identity >= 75 ) {
          print "Query=",   $result->query_name,
            " Hit=",        $hit->name,
            " Length=",     $hsp->length('total'),
            " Percent_id=", $hsp->percent_identity, "\n";
#Print alignment to file
#$aln will be a Bio::SimpleAlign object
       use Bio::AlignIO;
           my $aln = $hsp->get_aln;

#changed msf to fasta and hsp.msf to hsp.fas output is now a fas file 
          my $alnIO = Bio::AlignIO->new(-format =>"fasta", -file => ">hsp.fas"); 
      $alnIO->write_aln($aln);

        }
      }
    }  
  }
}
fasta multiple blast • 2.9k views
ADD COMMENTlink written 7.3 years ago by Eric0

BioPerl mailing list http://lists.open-bio.org/mailman/listinfo/bioperl-l

ADD REPLYlink written 7.3 years ago by Scott Cain750
1
gravatar for Pierre Lindenbaum
7.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

I would use an XSLT stylesheet. For example, with the following BLAST xml result:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>BLASTN 2.2.26+</BlastOutput_version>
  <BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), &quot;A greedy algorithm for aligning DNA sequences&quot;, J Comput Biol 2000; 7(1-2):203-14.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>26343</BlastOutput_query-ID>
  <BlastOutput_query-def>No definition line</BlastOutput_query-def>
  <BlastOutput_query-len>671</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_sc-match>1</Parameters_sc-match>
      <Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
      <Parameters_gap-open>0</Parameters_gap-open>
      <Parameters_gap-extend>0</Parameters_gap-extend>
      <Parameters_filter>L;m;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>26343</Iteration_query-ID>
  <Iteration_query-def>No definition line</Iteration_query-def>
  <Iteration_query-len>671</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|118082669|ref|XM_416233.2|</Hit_id>
  <Hit_def>PREDICTED: Gallus gallus similar to ubiquitous tetratricopeptide containing protein RoXaN; Rotavirus X associated non-structural protein (LOC417996), mRNA</Hit_def>
  <Hit_accession>XM_416233</Hit_accession>
  <Hit_len>2868</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>556.962</Hsp_bit-score>
      <Hsp_score>301</Hsp_score>
      <Hsp_evalue>3.58957e-158</Hsp_evalue>
      <Hsp_query-from>92</Hsp_query-from>
      <Hsp_query-to>395</Hsp_query-to>
      <Hsp_hit-from>2378</Hsp_hit-from>
      <Hsp_hit-to>2681</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>303</Hsp_identity>
      <Hsp_positive>303</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>304</Hsp_align-len>
      <Hsp_qseq>TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCA</Hsp_qseq>
      <Hsp_hseq>TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTTCCA</Hsp_hseq>
      <Hsp_midline>|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
<Hit>
  <Hit_num>2</Hit_num>
  <Hit_id>gi|27881483|ref|NM_017590.4|</Hit_id>
  <Hit_def>Homo sapiens zinc finger CCCH-type containing 7B (ZC3H7B), mRNA</Hit_def>
  <Hit_accession>NM_017590</Hit_accession>
  <Hit_len>5868</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>366.757</Hsp_bit-score>
      <Hsp_score>198</Hsp_score>
      <Hsp_evalue>6.49273e-101</Hsp_evalue>
      <Hsp_query-from>100</Hsp_query-from>
      <Hsp_query-to>390</Hsp_query-to>
      <Hsp_hit-from>2608</Hsp_hit-from>
      <Hsp_hit-to>2898</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>264</Hsp_identity>
      <Hsp_positive>264</Hsp_positive>
      <Hsp_gaps>8</Hsp_gaps>
      <Hsp_align-len>295</Hsp_align-len>
      <Hsp_qseq>ATGCAGCAGACCTATGACATGTGGCT-AAAGAAACACAATCCTGGGAAGCCTGGAG-AGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGC-TATC-GCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGG</Hsp_qseq>
      <Hsp_hseq>ATGCAGCAGACCTATGACATGTGGCTGAAA-AAACACAACCCAGGAAAGCCTGGAGAAGGGACCCCCA-TCAGTTCTCGGGAAGGGGAGAAGCAGATCCAGATGCCCACGGACTACGCGGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCCGAGAAGCACAAGGAGAAGGTCTTCACGTCCGACAGTGACGCCAGCGGCTGG-GCCT-TCCGCTTCCCCATGGGCGAGTTCCGGCTCTGCGACAGG</Hsp_hseq>
      <Hsp_midline>|||||||||||||||||||||||||| ||| |||||||| || || |||||||||| ||||| | ||| ||| ||| || ||||||||||| ||||||||||||||||| ||||| || ||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||||||||| || ||||||||| ||||| ||||| || | || ||||||| ||||||||||||| |||||| || |||</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
<Hit>
  <Hit_num>3</Hit_num>
  <Hit_id>gi|194733718|ref|NM_001130695.1|</Hit_id>
  <Hit_def>Rattus norvegicus zinc finger CCCH-type containing 7B (Zc3h7b), mRNA</Hit_def>
  <Hit_accession>NM_001130695</Hit_accession>
  <Hit_len>5466</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>355.677</Hsp_bit-score>
      <Hsp_score>192</Hsp_score>
      <Hsp_evalue>1.40543e-97</Hsp_evalue>
      <Hsp_query-from>97</Hsp_query-from>
      <Hsp_query-to>390</Hsp_query-to>
      <Hsp_hit-from>2433</Hsp_hit-from>
      <Hsp_hit-to>2726</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>266</Hsp_identity>
      <Hsp_positive>266</Hsp_positive>
      <Hsp_gaps>12</Hsp_gaps>
      <Hsp_align-len>300</Hsp_align-len>
      <Hsp_qseq>GATATGCAGCAGACCTATGACATGTGGCT-AAAGAAACACAATCCTGGGAAGCCTGGAG-AGGGAACACCACTCA-CTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTG-CTGGAGC-TATCGCTTCCCTATGGGCGAGTTCC-AGCTCTGTGAAAGG</Hsp_qseq>
      <Hsp_hseq>GATATGCAACAGACCTATGACATGTGGCTGAAA-AAACACAACCCAGGGAAGCCAGGAGAAGGGACCCCCA-TCAGC-TCCCGGGAAGGAGAGAAGCAGATCCAGATGCCCACGGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCTGAGAAGCACAAGGAGAAGGTCTTCACTTCCGACAGCGACGCCAG-TGGCTGG-GCCTACCGATTCCCCATGGGCGAGTTCCGA-CTCTGTGACAGG</Hsp_hseq>
      <Hsp_midline>|||||||| |||||||||||||||||||| ||| |||||||| || |||||||| |||| ||||| | ||| ||| | || || ||||| ||||| ||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||||||||| || ||||| ||| |||| || |||| || || || ||||| ||||||||||||| | |||||||| |||</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
</Iteration_hits>
  <Iteration_stat>
    <Statistics>
      <Statistics_db-num>18780</Statistics_db-num>
      <Statistics_db-len>25940078</Statistics_db-len>
      <Statistics_hsp-len>0</Statistics_hsp-len>
      <Statistics_eff-space>0</Statistics_eff-space>
      <Statistics_kappa>0.46</Statistics_kappa>
      <Statistics_lambda>1.28</Statistics_lambda>
      <Statistics_entropy>0.85</Statistics_entropy>
    </Statistics>
  </Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

and the following XSLT stylesheet: https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2fasta.xsl


processing:

xsltproc --novalid  blast2fasta.xsl blast.xml


result:

>PREDICTED: Gallus gallus similar to ubiquitous tetratricopeptide containing protein RoXaN; Rotavirus X associated non-structural protein (LOC417996), mRNA|len:303|ident:303
TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTTCCA
>Homo sapiens zinc finger CCCH-type containing 7B (ZC3H7B), mRNA|len:290|ident:264
ATGCAGCAGACCTATGACATGTGGCTGAAAAAACACAACCCAGGAAAGCCTGGAGAAGGGACCCCCATCAGTTCTCGGGAAGGGGAGAAGCAGATCCAGATGCCCACGGACTACGCGGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCCGAGAAGCACAAGGAGAAGGTCTTCACGTCCGACAGTGACGCCAGCGGCTGGGCCTTCCGCTTCCCCATGGGCGAGTTCCGGCTCTGCGACAGG
>Rattus norvegicus zinc finger CCCH-type containing 7B (Zc3h7b), mRNA|len:293|ident:266
GATATGCAACAGACCTATGACATGTGGCTGAAAAAACACAACCCAGGGAAGCCAGGAGAAGGGACCCCCATCAGCTCCCGGGAAGGAGAGAAGCAGATCCAGATGCCCACGGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCTGAGAAGCACAAGGAGAAGGTCTTCACTTCCGACAGCGACGCCAGTGGCTGGGCCTACCGATTCCCCATGGGCGAGTTCCGACTCTGTGACAGG
ADD COMMENTlink modified 3.8 years ago • written 7.3 years ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1374 users visited in the last hour