Question

Trying to check if a query id appears in tabular blast output but not getting any positives even for IDs I know are in the file?

0

Entering edit mode

6.8 years ago

rmartson • 0

I've run a local BLAST search using a bunch of query sequences. My aim now is to identify any query sequences that did not have any hits. I have the query sequences in fasta format and the blast results in tabular (with comments) format. Here is the relevant part of the code that is not working as it should be (cut out and simplified):

from Bio import SeqIO
flanks = list(SeqIO.parse("flanks.fasta", "fasta"))
blast_results = open('test.txt')
nohits = 0
for record in flanks:
    if record.id not in blast_results: #I've also tried with blast_results.read()
        nohits+=1
blast_results.close()

print len(flanks), nohits

In the end, len(flanks) and nohits are always the same. It is basically saying that no query ID from flanks.fasta is ever in the blast output file, even if I can take a random ID from flanks.fasta and find it multiple times in test.txt manually.

EDIT: Realised there's an alternate approach here: http://biopython.org/wiki/Retrieve_nonmatching_blast_queries

But I don't see how there isn't a simpler method similar to what I'm trying to do

EDIT2: This now works:

for record in flanks:
    if record.id not in blast_results.read():
        nohits +=1
    blast_results.seek(0)

blast_results.close()

It just read through the entire file on the first search and then ended up searching for all the following IDs in an empty string so I had to return it to the beginning.

biopython blast • 2.1k views

ADD COMMENT • link 6.8 years ago by rmartson • 0

score 0 · Answer 1 · 2017-07-11

0

Entering edit mode

6.8 years ago

Philipp Bayer 8.3k

A few things have happened here:

1) blast_results is just a file handler, there are no IDs stored. So when you check whether the record ID is in the blast results all you'll check with is this thing: <open file 'test.txt', mode 'r' at 0x2aaaab9958a0>. You'll have to store the blast results in a set or a list first, like

blast_results = set()
with open('test.txt') as f:
  for line in f:
    if line.startswith('#'): continue
    linelist = line.split('\t')
    query, subject = linelist[0], linelist[1]
    blast_results.add(query)

Alternatively, Biopython also has Bio.SearchIO.BlastIO with support for tabular output.

2) You may run into problems with record.id vs. record.name vs. record.description in Biopython - everything before the space in the name line is the ID, the ID + everything after that is the name, example:

>hello there

Here record.id is 'hello' and record.name is 'hello there'. So if you have spaces in your names the ID may not cut it depending on your BLAST output settings.

3) If you run list(SeqIO.parse(..)) you actually lose some time - the first time running list() will iterate over the whole fasta file, but then you iterate over that list again so you run over the file a second time. You can remove the list() and get the same at half the time.

ADD COMMENT • link 6.8 years ago by Philipp Bayer 8.3k

0

Entering edit mode

Thanks for the reply, I'll just use that script for now.

For 1) though, I know blast_results is just a file handler. However Python should be able to recognise it as text. I read redcord.id as a string with str() once too, and Python should definitely be able to just search for a string in a text file, right? What's the deal with needing to specifically find the location of each ID based on the format?

And I'm confident my IDs in the flank file are in the correct format because I generated them myself without any spaces

AC068207.60 [...]

AP006305.11

And even if I didn't, record.id and the blast results should still have the same relevant parts

Thanks for the suggestion to get rid of list(), that's just something I got used to when I was using two for loops to check every possible combination of IDs between two files and they needed to be lists because as generator objects I couldn't get the loop to start reading from the beginning of the file for every new iteration so it just stopped after reading through one ID in one file and the entirety of the other file. You don't need it if there's just one for loop