Question: Extract Specific Entries From Blastx Output File, Write To New File
0
gravatar for charles.bridges
6.8 years ago by
charles.bridges70 wrote:

Hello, I have created a script that successfully searches for keywords (specified by user) within a Blastx output file in XML format. Now, I need to write those records (query, hit, score, evalue, etc) that contain the keyword in the alignment title to a new file. I have created separate lists for each of the query titles, hit title, e-value and alignment lengths but cannot seem to write them to a new file. Problem #1: what if Python errors, and one of the lists is missing a value...? Then all the other lists will be giving wrong information in reference to the query ("line slippage", if you will...). Problem #2: even if Python doesn't error, and all the lists are the same length, how can I write them to a file so that the first item in each list is associated with each other (and thus, item #10 from each list is also associated?) Should I create a dictionary instead? Problem#3: dictionaries have only a single value for a key, what if my query has several different hits? Not sure if it will be overwritten or skipped, or if it will just error. Any suggestions? My current script:

from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
import re

#obtain full path to blast output file (*.xml)
outfile = input("Full path to Blast output file (XML format only): ")

#obtain string to search for
search_string = input("String to search for: ")

#open the output file
result_handle = open(outfile)

#parse the blast record
blast_records = NCBIXML.parse(result_handle)

#initialize lists
query_list=[]
hit_list=[]
expect_list=[]
length_list=[]

#create 'for loop' that loops through each HIGH SCORING PAIR in each ALIGNMENT from each RECORD
for record in blast_records:
        for alignment in record.alignments:     #for description in record.descriptions???
                for hsp in alignment.hsps:      #for title in description.title???

                        #search for designated string
                        search = re.search(search_string, alignment.title)

                        #if search comes up with nothing, end
                        if search is None:
                                print ("Search string not found.")
                                break

                        #if search comes up with something, add it to a list of entries that match search string
                        else:

                                #option to include an 'exception' (if it finds keyword then DOES NOT add that entry to list)
                                if search is "trichomonas" or "entamoeba" or "arabidopsis":
                                        print ("found exception.")
                                        break
                                else:

                                        query_list.append(record.query)
                                        hit_list.append(alignment.title)
                                        expect_list.append(expect_val)
                                        length_list.append(length)

                                        #explicitly convert 'variables' ['int' object or 'float'] to strings
                                        length = str(alignment.length)
                                        expect_val = str(hsp.expect)

                                        #print ("\nquery name: " + record.query)
                                        #print ("alignment title: " + alignment.title)
                                        #print ("alignment length: " + length)
                                        #print ("expect value: " + expect_val)
                                        #print ("\n***Alignment***\n")
                                        #print (hsp.query)
                                        #print (hsp.match)
                                        #print (hsp.sbjct + "\n\n")


                                        if query_len is not hit_len is not expect_len is not length_len:
                                                print ("list lengths don't match!")
                                                break
                                        else:

                                                qrylen = len(query_list)
                                                query_len = str(qrylen)
                                                hitlen = len(hit_list)
                                                hit_len = str(hitlen)
                                                expectlen = len(expect_list)
                                                expect_len = str(expectlen)
                                                lengthlen = len(length_list)
                                                length_len = str(lengthlen)
                                                outpath = str(outfile)

                                                #create new file
                                                outfile = open("__Blast_Parse_Search.txt", "w")
                                                outfile.write("File contains entries from [" + outpath + "] that contain [" + search_string + "]")
                                                outfile.close

                                                #write list to file
                                                i = 0
                                                list_len = int(query_len)
                                                for i in range(0, list_len):

                                                        #append new file
                                                        outfile = open("__Blast_Parse_Search.txt", "a")
                                                        outfile.writelines(query_list + hit_list + expect_list + length_list)
                                                        i = i + 1

                                                #write to disk, close file
                                                outfile.flush()
                                                outfile.close

print ("query list length " + query_len)
print ("hit list length " + hit_len)
print ("expect list length " + expect_len)
print ("length list length " + length_len + "\n\n")
print ("first record: " + query_list[0] + " " + hit_list[0] + " " + expect_list[0] + " " + length_list[0])
print ("last record: " + query_list[-1] + " " + hit_list[-1] + " " + expect_list[-1] + " " + length_list[-1])
print ("\nFinished.\n")
python list biopython blast • 2.1k views
ADD COMMENTlink written 6.8 years ago by charles.bridges70
2
gravatar for Istvan Albert
6.8 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

The solution is to keep all related objects as a single tuple and append that to the list.. For example:

   out = []
   out.append( (record.query, alignment_title, expect_value) )

when you write it out all you need is a pattern to format each element. This would be a solution to all of your questions plus allows you to get rid if about 90% of the code above that has to deal with unnecessary details.

ADD COMMENTlink written 6.8 years ago by Istvan Albert ♦♦ 80k
0
gravatar for charles.bridges
6.7 years ago by
charles.bridges70 wrote:

Thanks, that works. I have been able to 'nest' my lists (tuples?) to contain all the information. Now I'm having trouble writing this information to a file. I have tried

for item in output:
        outfile.write("%s\n" % item)

but I get the error:

outfile.write("%s" % item)
TypeError: not all arguments converted during string formatting
ADD COMMENTlink written 6.7 years ago by charles.bridges70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 797 users visited in the last hour