I've run a local BLAST search using a bunch of query sequences. My aim now is to identify any query sequences that did not have any hits. I have the query sequences in fasta format and the blast results in tabular (with comments) format. Here is the relevant part of the code that is not working as it should be (cut out and simplified):
from Bio import SeqIO
flanks = list(SeqIO.parse("flanks.fasta", "fasta"))
blast_results = open('test.txt')
nohits = 0
for record in flanks:
if record.id not in blast_results: #I've also tried with blast_results.read()
nohits+=1
blast_results.close()
print len(flanks), nohits
In the end, len(flanks) and nohits are always the same. It is basically saying that no query ID from flanks.fasta is ever in the blast output file, even if I can take a random ID from flanks.fasta and find it multiple times in test.txt manually.
EDIT: Realised there's an alternate approach here: http://biopython.org/wiki/Retrieve_nonmatching_blast_queries
But I don't see how there isn't a simpler method similar to what I'm trying to do
EDIT2: This now works:
for record in flanks:
if record.id not in blast_results.read():
nohits +=1
blast_results.seek(0)
blast_results.close()
It just read through the entire file on the first search and then ended up searching for all the following IDs in an empty string so I had to return it to the beginning.
Thanks for the reply, I'll just use that script for now.
For 1) though, I know blast_results is just a file handler. However Python should be able to recognise it as text. I read redcord.id as a string with str() once too, and Python should definitely be able to just search for a string in a text file, right? What's the deal with needing to specifically find the location of each ID based on the format?
And I'm confident my IDs in the flank file are in the correct format because I generated them myself without any spaces
And even if I didn't, record.id and the blast results should still have the same relevant parts
Thanks for the suggestion to get rid of list(), that's just something I got used to when I was using two for loops to check every possible combination of IDs between two files and they needed to be lists because as generator objects I couldn't get the loop to start reading from the beginning of the file for every new iteration so it just stopped after reading through one ID in one file and the entirety of the other file. You don't need it if there's just one for loop
I just found a solution