Question: How to merge BLAST XML output from different database?
0
gravatar for tcf.hcdg
2.0 years ago by
tcf.hcdg60
European Union
tcf.hcdg60 wrote:

Hello I did the command line BLAST against NCBI nr database. Because of the limited computational resourses, administrator at our university reccomended me to split the query sequence as well as databases (120gb of nr). I splited the database in almost two equall parts and then did the blastx against each part. Now I have the blast results of each query sequence in tow parts. Each query have the maximum of 20 hits in each part of the database. I would like to use BLAST2GO for further annotation. But BLAST2GO can accept the multiple files for only one part of the database.. Therefore, I would like to merge the results of each query sequence from both part of the database. I could merge the tabulat output of blast and but with XML files it is not working. Could you please guide me how can I merge BLAST XML results from different databases.

merge blast blastxml • 965 views
ADD COMMENTlink modified 2.0 years ago by Pierre Lindenbaum119k • written 2.0 years ago by tcf.hcdg60

Now I have the blast results of each query sequence in tow parts

does that mean that the very same query is present in the two xml files ?

ADD REPLYlink written 2.0 years ago by Pierre Lindenbaum119k

Yes I use the same query sequence for both part of database.

ADD REPLYlink written 2.0 years ago by tcf.hcdg60

For example "TRINITY_DN99193_c0_g5" query against database1


http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastx</BlastOutput_program>
  <BlastOutput_version>BLASTX 2.6.0+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>/scratch/tb44227/db/nrdb_1/nr</BlastOutput_db>
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>TRINITY_DN99193_c0_g5_i1 len=580</BlastOutput_query-def>
  <BlastOutput_query-len>580</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>0.001</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_1</Iteration_query-ID>
  <Iteration_query-def>TRINITY_DN99193_c0_g5_i1 len=580</Iteration_query-def>
  <Iteration_query-len>580</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|598008123|ref|XP_007370486.1|</Hit_id>
  <Hit_def>hypothetical protein DICSQDRAFT_93145, partial [Dichomitus squalens LYAD-421 SS1] >gi|395324364|gb|EJF56806.1| hypothetical protein DICSQDRAFT_93145, partial [Dichomitus squalens LYAD-421 SS1]</Hit_def>

. . .

AND "TRINITY_DN99193_c0_g5" query against database2

<BlastOutput_query-def>TRINITY_DN99193_c0_g5_i1 len=580</BlastOutput_query-def>
  <BlastOutput_query-len>580</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>0.001</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_1</Iteration_query-ID>
  <Iteration_query-def>TRINITY_DN99193_c0_g5_i1 len=580</Iteration_query-def>
  <Iteration_query-len>580</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|973327510|gb|KUL87405.1|</Hit_id>
  <Hit_def>hypothetical protein ZTR_04698 [Talaromyces verruculosus]</Hit_def>
  <Hit_accession>KUL87405</Hit_accession>
  <Hit_len>171</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>131.724</Hsp_bit-score>
      <Hsp_score>330</Hsp_score>
      <Hsp_evalue>1.36501e-35</Hsp_evalue>
      <Hsp_query-from>134</Hsp_query-from>
      <Hsp_query-to>580</Hsp_query-to>
      <Hsp_hit-from>22</Hsp_hit-from>
      <Hsp_hit-to>170</Hsp_hit-to>
      <Hsp_query-frame>-1</Hsp_query-frame>
      <Hsp_hit-frame>0</Hsp_hit-frame>

. . .

ADD REPLYlink written 2.0 years ago by tcf.hcdg60

Is there only ONE query def / Iteration per XML file ?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Pierre Lindenbaum119k

No, ecah file has 50000 query sequences, am I have 62 files 31 for each.

ADD REPLYlink written 2.0 years ago by tcf.hcdg60
0
gravatar for Pierre Lindenbaum
2.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

Ok, I've quickly written something: https://github.com/lindenb/jvarkit/wiki/MergeBlastXml

$ java -jar dist/mergeblastxml.jar input1.blastn.xml  input2.blastn.xml  input2.blastn.xml > out.xml
ADD COMMENTlink written 2.0 years ago by Pierre Lindenbaum119k
java -jar dist/mergeblastxml.jar ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1 ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2 > output.xml

[INFO][MergeBlastXml]resolveEntity:-//NCBI//NCBI BlastOutput/EN/http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd/null
[INFO][MergeBlastXml]opening /home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2
[INFO][MergeBlastXml]resolveEntity:-//NCBI//NCBI BlastOutput/EN/http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd/null
[SEVERE][MergeBlastXml]ParseError at [row,col]:[840759,6]
Message: The processing instruction target matching "[xX][mM][lL]" is not allowed.
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[840759,6]
Message: The processing instruction target matching "[xX][mM][lL]" is not allowed.
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.peek(XMLEventReaderImpl.java:276)
        at com.github.lindenb.jvarkit.tools.blast.MergeBlastXml.doWork(MergeBlastXml.java:210)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:747)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:861)
        at com.github.lindenb.jvarkit.tools.blast.MergeBlastXml.main(MergeBlastXml.java:273)
ADD REPLYlink written 2.0 years ago by tcf.hcdg60

what is the output of

xmllint ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1 > /dev/null
xmllint ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2 > /dev/null

?

I suspect your process have concatenated some XML files

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Pierre Lindenbaum119k
I am sorry I couldn't respond earlier because of my limited posting option. 
Well, Itried and the output is like that:
tb44227@lido-gw02:~/nobackup/jvarkit>xmllint ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1 > /dev/null
/home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1:822600: parser error : XML declaration allowed only at the start of the document
<?xml version="1.0"?>
     ^
/home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1:822601: parser error : Extra content at the end of the document
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm
^
tb44227@lido-gw02:~/nobackup/jvarkit>
tb44227@lido-gw02:~/nobackup/jvarkit>xmllint ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2 > /dev/null
/home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2:840759: parser error : XML declaration allowed only at the start of the document
<?xml version="1.0"?>
     ^
/home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2:840760: parser error : Extra content at the end of the document
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm
^
ADD REPLYlink written 2.0 years ago by tcf.hcdg60

I think the script should be like:

First it should merge all the resulted xml files that was generated for one database 1 let's say in this case merge all 31 files of database1

Then it should some how sort the xml files. The reason to sort is queries are not insorted format,because I used GNU parrallel to split the blast queries over nodes.

Doing the same exercise for the database2.

Now we will have two files that contained the results for each query against each database.

Therefore script should be trained in such a way that it takes the first query from first file and copy the results of this specific query from both files

    BUT BEFORE Pasting the results into new file, a little bit modification is required.
    Each query has a maximum of 20 hits. it means the hit starts from 
        <Hit_num>1</Hit_num>
        .   
        .
        .
        .and ends on 
        <Hit_num>20</Hit_num>

    The problem is the hit number always starts form 1 to 20 in both databases.

I wonder whether it will effect when I use this file for the BLAST2GO? or we need to change the number of second part of th edatabase from

        <Hit_num>1</Hit_num>     TO  <Hit_num>21</Hit_num>
        .   
        .
        .
        .and  
        <Hit_num>20</Hit_num>    TO <Hit_num>40</Hit_num>

    So, After merging each query will have 40 hits
ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by tcf.hcdg60

these are not valid xml files.

ADD REPLYlink written 2.0 years ago by Pierre Lindenbaum119k

As I mentioned in the previous comment, because fo the GNU parallel each file has multiple blast xml header and footer.

I remove the repeated header and footer from the file and then tried your script, it worked and merged two files. But the resultant file in in one line

http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"><BlastOutput>
  <BlastOutput_program>blastx</BlastOutput_program>
  <BlastOutput_version>BLASTX 2.6.0+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>/scratch/tb44227/db/nrdb_1/nr</BlastOutput_db>
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>TRINITY_DN99193_c0_g5_i1 len=580</BlastOutput_query-def>
  <BlastOutput_query-len>580</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>0.001</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations><Iteration><Iteration_iter-num>1459</Iteration_iter-num><Iteration_query-ID>Query_1459</Iteration_query-ID><Iteration_query-def>TRINITY_DN100341_c0_g1_i1 len=1240</Iteration_query-def><Iteration_query-len>1240</Iteration_query-len><Iteration_hits></Iteration_hits><Iteration_stat><Statistics><Statistics_db-num>112583993</Statistics_db-num><Statistics_db-len>41258651338</Statistics_db-len><Statistics_hsp-len>156</Statistics_hsp-len><Statistics_eff-space>6089755946510</Statistics_eff-space><Statistics_kappa>0.041</Statistics_kappa><Statistics_lambda>0.267</Statistics_lambda><Statistics_entropy>0.14</Statistics_entropy></Statistics></Iteration_stat><Iteration_message>No hits found</Iteration_message></Iteration><Iteration><Iteration_iter-num>1375</Iteration_iter-num><Iteration_query-ID>Query_1375</Iteration_query-ID><Iteration_query-def>TRINITY_DN100342_c0_g1_i1 len=3033</Iteration_query-def><Iteration_query-len>3033</Iteration_query-len><Iteration_hits><Hit><Hit_num>1</Hit_num><Hit_id>gi|168037340|ref|XP_001771162.1|</Hit_id><Hit_def>predicted protein [Physcomitrella patens] >gi|162677542|gb|EDQ64011.1| predicted protein [Physcomitrella patens]</Hit_def><Hit_accession>XP_001771162</Hit_accession><Hit_len>1226</Hit_len><Hit_hsps><Hsp><Hsp_num>1</Hsp_num><Hsp_bit-score>58.5362</Hsp_bit-score><Hsp_score>140</Hsp_score><Hsp_evalue>7.71095e-05</Hsp_evalue><Hsp_query-from>2883</Hsp_query-from><Hsp_query-to>2996</Hsp_query-to><Hsp_hit-from>573</Hsp_hit-from><Hsp_hit-to>611</Hsp_hit-to><Hsp_query-frame>3</Hsp_query-frame><Hsp_hit-frame>0</Hsp_hit-frame><Hsp_identity>27</Hsp_identity><Hsp_positive>30</Hsp_positive><Hsp_gaps>1</Hsp_gaps><Hsp_align-len>39</Hsp_align-len><Hsp_qseq>RSYKFILTCYLERYINRWNPFTAKAMEVSHGREA-LVPA</Hsp_qseq><Hsp_hseq>RSYNFIVTCGLERYINIWNPFTAKTMGYLHGHEASVMPA</Hsp_hseq><Hsp_midline>RSY FI+TC LERYIN WNPFTAK M   HG EA ++PA</Hsp_midline></Hsp></Hit_hsps></Hit><Hit><Hit_num>1</Hit_num><Hit_id>gi|1026767740|gb|OAE27996.1|</Hit_id><Hit_def>hypothetical protein AXG93_1062s1280 [Marchantia polymorpha subsp. polymorpha]</Hit_def><Hit_accession>OAE27996</Hit_accession><Hit_len>883</Hit_len><Hit_hsps><Hsp><Hsp_num>1</Hsp_num><Hsp_bit-score>43.8986</Hsp_bit-score><Hsp_score>102</Hsp_score><Hsp_evalue>6.4385e-09</Hsp_evalue><Hsp_query-from>2228</Hsp_query-from><Hsp_query-to>2305</Hsp_query-to><Hsp_hit-from>321</Hsp_hit-from><Hsp_hit-to>346</Hsp_hit-to><Hsp_query-frame>2</Hsp_query-frame><Hsp_hit-frame>0</Hsp_hit-frame><Hsp_identity>17</Hsp_identity><Hsp_positive>23</Hsp_positive><Hsp_gaps>0</Hsp_gaps><Hsp_align-len>26</Hsp_align-len><Hsp_qseq>ILNAPLCATWLRLVENDKLVYGYDLG</Hsp_qseq><Hsp_hseq>LVNAPMCATWLRVAESDKLLYGDDRG</Hsp_hseq><Hsp_midline>++NAP+CATWLR+ E+DKL+YG D G</Hsp_midline></Hsp><Hsp><Hsp_num>2</Hsp_num><Hsp_bit-score>39.2762</Hsp_bit-score><Hsp_score>90</Hsp_score><Hsp_evalue>6.4385e-09</Hsp_evalue><Hsp_query-from>2472</Hsp_query-from><Hsp_query-to>2627</Hsp_query-to><Hsp_hit-from>421</Hsp_hit-from><Hsp_hit-to>460</Hsp_hit-to><Hsp_query-frame>3</Hsp_query-frame><Hsp_hit-frame>0</Hsp_hit-frame><Hsp_identity>23</Hsp_identity><Hsp_positive>28</Hsp_positive><Hsp_gaps>12</Hsp_gaps><Hsp_align-len>52</Hsp_align-len><Hsp_qseq>KWAAIEHAHGIFGICGCEYCRWMFSLHRSMAKLLHCHYQRLGNRWTLDPFVA</Hsp_qseq><Hsp_hseq>KWTAKEH............
ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by tcf.hcdg60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1206 users visited in the last hour