Question: calling a function using glob.glob
0
gravatar for bio90029
2.2 years ago by
bio9002910
bio9002910 wrote:

HI, Hopefully someone can help me with this. I have prepared a script to extract data from a file, this part work very well, and does what I need to be done. The problem comes when I am using glob.glob, and subprocess to call the function. I keep having the above error message, and I do not know how to handle it. error message:

**File "parsing_blast.py", line 45, in <module> my_file=subprocess.Popen(cmd) File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__ errread, errwrite) File "/usr/lib64/python2.6/subprocess.py", line 1238, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

Thanks your help

from Bio.Blast import NCBIXML
from Bio import SeqIO, SearchIO
import sys, glob, subprocess, os

folders = glob.glob('/home/me/my_folder/H*')
print folders
for folder in folders:
    my_files=glob.glob(folder + '/*.xml')
    print my_files
    def parsing_blast():

        results_handle=open(my_files[0])
        blast_results=NCBIXML.parse(results_handle)
        #blast_results=NCBIXML.parse(results_handle)
        output_handle=open(folder + ' my_data_parse.xml','w')   
        #to extract some information from the blast file
        for blast_result in blast_results:
            sequence_length=blast_result.query_letters #this is the length of the sequence
            gene=blast_result.query #gene name
            #print 'The length is:', sequence_length #check point
            #print gene         #check point
            for description in blast_result.descriptions:
                title=description.title  #query seq name
                #print description.title #check point
                for alignment in blast_result.alignments:               
                    for hsp in alignment.hsps:
                        identity=hsp.identities #matching bases
                        num_gaps=hsp.gaps       #number of gaps
                        #print identity         #check point
                        #print num_gaps         #check point

                        per_identities=float(identity)/float(sequence_length)*float(100) 
                        #print per_identities   #check point
                        #sys.exit()

                        extracted_data= (gene + ',' + title + ','+ 'number_gaps: ' + str(num_gaps) +','+ 'per_identity: '+ str(per_identities) +'\n')

                        output_handle.write(extracted_data)
        output_handle.close()               

                    #sys.exit()   
    parsing_blast()



    print 'The file has been created'
biopython glob.glob python • 939 views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by bio9002910

The problem comes when I am using glob.glob, and subprocess to call the function

Why do you use subprocess to call the function?!

ADD REPLYlink written 2.2 years ago by WouterDeCoster40k

I have a hundred files, all starting by H, and in all of them I have an xml file I would like to parse. I do not want to do it one by one. So I want the script to get the information I want from the file within the H* folder, storing that information on another file. When the file is created in one folder to move to the next folder, and so on. I used glob.glob and subprocess before but within a function. I just wanted to use it from outside the function so I could add another function.

ADD REPLYlink written 2.2 years ago by bio9002910

I used glob.glob and subprocess before but within a function.

As far as I understood this is an entirely different use case. Instead, you need the multiprocessing module for parallelizing a function across many files.

ADD REPLYlink written 2.2 years ago by WouterDeCoster40k

Hi, I have re-edited the script, and now it works perfectly. But now I need to find out how to tell the programme to store the created file within the H files. Any help in that area, please.

ADD REPLYlink written 2.2 years ago by bio9002910

I edited my answer to set the output_handle file to a file within the XML file source directory. Is that what you meant?

ADD REPLYlink written 2.2 years ago by steve2.2k
2
gravatar for Dan D
2.2 years ago by
Dan D6.8k
Tennessee
Dan D6.8k wrote:

The error message you're seeing is unrelated to your use of the glob function.

This line:

cmd=['parsing_blast']

is equivalent to typing

parsing_blast

on the command line. There's apparently no executable by that name available.

Are you trying to asynchronously call the parsing_blast function you've defined?

Some quick feedback while I'm looking at your code:

You can simplify your glob query:

my_files=glob.glob('/home/me/my_folder/H*/*.xml')

And it's more efficient to define your function outside of the loop. Else you're recreating it with every loop iteration, which seems unnecessary unless I'm missing something.

ADD COMMENTlink written 2.2 years ago by Dan D6.8k

Thanks, I had tried putting the whole path in my_files but I could not make it work. I would like the script to parse the xml files in my folders H*. I have a hundred folders, and all contained an xml file. If I work just with the function in a folder it works perfectly, I am trying to produce the script to extract the data one after the another. Thanks for your help

ADD REPLYlink written 2.2 years ago by bio9002910

I have re-edited the script, and now is working. Now I need to find out how to store the created file within my H files.

ADD REPLYlink written 2.2 years ago by bio9002910
1
gravatar for steve
2.2 years ago by
steve2.2k
United States
steve2.2k wrote:

In addition to the others' comments, if I were to try to accomplish the task you've described:

I have a hundred files, all starting by H, and in all of them I have an xml file I would like to parse.

I would use a script like this:

#!/usr/bin/env python

import os

def find_H_dirs(parent_dir):
    '''
    Find all the dirs in the parent_dir that start with H
    '''
    matches = []
    for item in os.listdir(parent_dir):
        item_path = os.path.join(parent_dir, item)
        if os.path.isdir(item) & item.startswith("H"):
            matches.append(item_path)
    return(matches)

def find_XML_files(dir):
    '''
    Find all the .xml files in a dir
    '''
    matches = []
    for item in os.listdir(dir):
        item_path = os.path.join(dir, item)
        if os.path.isfile(item_path) & item.endswith(".xml"):
            matches.append(item_path)
    return(matches)

def process_XML_file(XML_file, output_handle):
    '''
    Do a thing to the XML file
    '''
    print("Put your code for processing the {0} file here.".format(XML_file))


parent_dir = "/path/to/parent_dir"
# output_handle = "/path/to/my_data_parse.xml" # if you want it to always go to the same file

H_dirs = find_H_dirs(parent_dir = parent_dir)

for H_dir in H_dirs:
    output_handle = os.path.join(H_dir, "my_data_parse.xml")
    for XML_file in find_XML_files(dir = H_dir):
        process_XML_file(XML_file = XML_file, output_handle = output_handle)

It may be technically less efficient, but it is much simpler to write and understand, and will be easier to expand and re-use in the future.

edit: updated output_handle as per request in the comments

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by steve2.2k
1

Using almost entirely the same code with some multiprocessing code in the loop will allow you to run it in parallel.

ADD REPLYlink written 2.2 years ago by WouterDeCoster40k
0
gravatar for Rodrigo
2.2 years ago by
Rodrigo140
Rodrigo140 wrote:

Seems like the problem is that cmd = ['parsing blast'] is a sequence containing a function and not a process.The subprocess module is for spawning processes and doing things with their input/output - not for running functions, as it is explained Here.

ADD COMMENTlink written 2.2 years ago by Rodrigo140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1427 users visited in the last hour