Question

Code for retrieving hit count from tabular BLAST format (outfmt=7 on Biopython) not working

0

Entering edit mode

6.8 years ago

rmartson • 0

So I needed to retrieve the number of hits for each query sequence from tabular blast output (with comments).

I used the script given here but changed a little: A: write a Python program to determine the average number of hits per sequence in

fh = open('BLASTED.txt')
oh = open('BLASTED.txt.out', 'w')
for qid, hsps in groupby(fh, lambda l: l.split()[0]):
    if qid.startswith('#'): continue
    hits = len(set([l.split()[1] for l in hsps]))
    hits_no += hits
    queries_no += 1
    oh.write('{0}\t{1}\n'.format(qid, hits))
oh.close()
fh.close()

So the relevant part of my script is:

blast = open('testing.txt')
output = open("hitcount.txt", 'w')
queries_no = 0
lowhitqueries_no = 0
for qid, hsps in groupby(blast, lambda l: l.split()[0]):
    if qid.startswith('#'): continue
    hits = len(set([l.split()[1] for l in hsps]))
    queries_no += 1
    #Used "print qid, hits" here to check what kind of hits I'm getting
    if hits<2:
        lowhitqueries_no +=1
        output.write('{0}\t{1}\n'.format(qid, hits))
blast.close()

However, I noticed that the value for hits does not match what's actually in testing.txt. There are some query sequences with thousands of hits and print hits will only return a value of maybe 24.

How is this script meant to work and why is it failing?

EDIT: I found this method that I think is much simpler

import re
blast = open("testing.txt")
output = open("hitcount" + str(dataset) + ".txt", 'w')
lines = blast.readlines()
queries_no = 0
lowhitqueries_no = 0
for index, line in enumerate(lines):
    if re.match("# Query:", line):
        queries_no +=1
        hits = int(re.search(r'\d+', lines[index+3]).group()) 
        qid = line[9:].rstrip()
        print qid, hits #Just to check it's working
    if hits<2:
        lowhitqueries_no +=1
        output.write('{0}\t{1}\n'.format(qid, hits))

blast.close()

Look at how much less space it uses up

biopython blast • 2.2k views

ADD COMMENT • link 6.8 years ago by rmartson • 0

0

Entering edit mode

You're just looking for the number of times a query id shows up in your blast report? I'm not sure why you tagged Biopython, I don't see that you're importing anything from the module, unless you ran the blast via biopython.

This will give you the query ids and counts per id:

cut -f 1 test.txt | sort | uniq -c > hitcount.txt

I think starting off writing your own program, instead of trying to adapt something written by someone else would be a better learning experience, unless this is homework and you have to use pythong/biopython.

ADD REPLY • link 6.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I tagged biopython because I wanted to use biopython to do it, but I found a solution anyway using re.match() to find query id lines, and readlines() and index to find the lines which give matching hit counts

It's in the edit

I think I found that same commandline thing written slightly differently but it didn't work. This one does create a file but its format isn't as I'd want it. The biopython script lets me decide what goes in there and in what format because I have a filtering step too

ADD REPLY • link 6.8 years ago by rmartson • 0

0

Entering edit mode

Ok. Instead of changing the question title to 'resolved', provide your own answer with what you did, and why it worked with what you are looking for. Then you accept your answer.

ADD REPLY • link 6.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

My answer is in the edit. If anyone skips straight to the answers without reading the issue then that's their fault. Accepting my own answer sounds weird but I'll do it this once

ADD REPLY • link 6.8 years ago by rmartson • 0

score 1 · Accepted Answer · 2017-07-12

import re
blast = open("testing.txt")
output = open("hitcount" + str(dataset) + ".txt", 'w')
lines = blast.readlines()
queries_no = 0
lowhitqueries_no = 0
for index, line in enumerate(lines):
    if re.match("# Query:", line):
        queries_no +=1
        hits = int(re.search(r'\d+', lines[index+3]).group()) 
        qid = line[9:].rstrip()
        print qid, hits #Just to check it's working
    if hits<2:
        lowhitqueries_no +=1
        output.write('{0}\t{1}\n'.format(qid, hits))

blast.close()

Here's an alternate route