Question: How To Extract Desire Genes Blast Xml Result From A Big Blast Xml File
2
gravatar for KJ Lim
6.3 years ago by
KJ Lim120
KJ Lim120 wrote:

Dear community,

I have a XML file contained 50,000 genes Blast result with 10 hits for each gene. I want to mine desire genes Blast result from that XML file and the output file is still in XML format. The output file should have a full Blast result in XML form of desire genes.

I tired with the Python script shared by juefish on this posts, however, I only get partial of one gene Blast XML result parsed instead a list of my desire genes. I quote juefish's Python script here to make the discussion easier.

#!/usr/bin/env python

import sys
import os
import sets
import Bio

from sets import Set
from Bio.Blast import NCBIXML

# Usage.
if len(sys.argv) < 2:
        print ""
        print "This program extracts blast results from an xml file given a list of query sequences"
        print "Usage: %s -list file1 -xml file2 > outfile"
        print "-list: list of sequence names"
        print "-xml: blast xml output file"
        print ""
        sys.exit()

# Parse args.
for i in range(len(sys.argv)):
        if sys.argv[i] == "-list":
                infile1 = sys.argv[i+1]
        elif sys.argv[i] == "-xml":
                infile2 = sys.argv[i+1]

fls = [infile1,infile2]
results_handle = open(fls[1], "r")
fin1 = open(fls[0],"r")
geneContigs = Set([])

#establish list of names of queries to extract from xml file
for line in fin1:
    temp=line.lstrip('>').split()
    geneContigs.add(temp[0])
fin1.close()

genes2 = []
marker = False
header = True

#cycle through xml file and printing results that match to query list
for line in results_handle:
        line = line.rstrip('\r\n')
        if(header==True):
                if(line.count("<Iteration>")>0):
                        header=False
                        print "%s" % (line)
                        continue
                print "%s" % (line)
        else:
                    if(line.count("<Iteration>")>0):
                            genes2=[]
                    if(marker==False):
                            genes2.append(line)
                    if(marker==True):
                            if(line.count("</Iteration>")<1):
                                    print "%s" % (line)
                            elif(line.count("</Iteration>")>0):
                                    print "%s" % (line)
                                    marker=False
                    if(line.count("<Iteration_query-def>")>0):
                            split1 = line.split(">")
                            split2 = split1[1].split("<")
                            if(split2[0] in geneContigs):
                                    for i in range(len(genes2)):
                                            print "%s" % (genes2[i])
                                    marker=True
                            genes2=[]

I'm not good in Python nor BioPython, it would be great if the community could share with me your experience or suggestion.

Thanks for your time.

xml blast • 3.7k views
ADD COMMENTlink modified 3.6 years ago by Shyam130 • written 6.3 years ago by KJ Lim120

Hi Pierre; I am trying to compile using the command you mentioned. But it say javac: file not found: Biostar75204.java

Thanks

ADD REPLYlink written 3.6 years ago by Shyam130

did you just download the file Biostar75204.java ?

ADD REPLYlink written 3.6 years ago by Pierre Lindenbaum123k

thanks you Pierre. I used the script and was able to compile. Sorry it took a while to reply. I am trying on a large xml output.

ADD REPLYlink written 3.5 years ago by Shyam130
0
gravatar for Pierre Lindenbaum
6.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

"How to extract desire genes blast XML result from a big blast XML file":

create a java parser for blast:

 xjc -dtd -p gov.nih.nlm.ncbi.blast "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"

the following java file scan a blast xml file from stdin using a pull parser. Each time we find a 'Hit' we check if it is in the list and we include it in the resulting xml:

compile:

 javac Biostar75204.java gov/nih/nlm/ncbi/blast/*.java

execute:

 java Biostar75204 GENE1 GENE2 GENE3 GENE4  < blast.xml > result.xml

you can then extract the information using, for example, xslt: e.g. tools parsing NCBI blast -m 7 xml output format?

ADD COMMENTlink modified 3.6 years ago • written 6.3 years ago by Pierre Lindenbaum123k

Thanks Pierre. Thanks for your prompt replied. I will give it a try.

ADD REPLYlink written 6.3 years ago by KJ Lim120

Hi there,

Is it possible to use this script with a list of genes in a file? Thanks, Bernardo

ADD REPLYlink written 6.1 years ago by biotech540

yes replace the loop hitdefs.add(def); by reading a file: http://www.mkyong.com/java/how-to-read-file-from-java-bufferedreader-example/

ADD REPLYlink written 6.1 years ago by Pierre Lindenbaum123k

You can also read files using java.nio.file.Files.readAllBytes() which is faster than the BufferedReader approach in that link, especially for large files:

Here is a code snippet:

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;

public class ReadFile_Files_ReadAllBytes {
  public static void main(String [] pArgs) throws IOException {
    String fileName = "c:\\temp\\sample-10KB.txt";
    File file = new File(fileName);

    byte [] fileBytes = Files.readAllBytes(file.toPath());
    char singleChar;
    for(byte b : fileBytes) {
      singleChar = (char) b;
      System.out.print(singleChar);
    }
  }
}
ADD REPLYlink written 18 months ago by darwincycling10800
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1896 users visited in the last hour