Question: Code for retrieving hit count from tabular BLAST format (outfmt=7 on Biopython) not working
0
gravatar for rmartson
2.3 years ago by
rmartson0
rmartson0 wrote:

So I needed to retrieve the number of hits for each query sequence from tabular blast output (with comments).

I used the script given here but changed a little: A: write a Python program to determine the average number of hits per sequence in

fh = open('BLASTED.txt')
oh = open('BLASTED.txt.out', 'w')
for qid, hsps in groupby(fh, lambda l: l.split()[0]):
    if qid.startswith('#'): continue
    hits = len(set([l.split()[1] for l in hsps]))
    hits_no += hits
    queries_no += 1
    oh.write('{0}\t{1}\n'.format(qid, hits))
oh.close()
fh.close()

So the relevant part of my script is:

blast = open('testing.txt')
output = open("hitcount.txt", 'w')
queries_no = 0
lowhitqueries_no = 0
for qid, hsps in groupby(blast, lambda l: l.split()[0]):
    if qid.startswith('#'): continue
    hits = len(set([l.split()[1] for l in hsps]))
    queries_no += 1
    #Used "print qid, hits" here to check what kind of hits I'm getting
    if hits<2:
        lowhitqueries_no +=1
        output.write('{0}\t{1}\n'.format(qid, hits))
blast.close()

However, I noticed that the value for hits does not match what's actually in testing.txt. There are some query sequences with thousands of hits and print hits will only return a value of maybe 24.

How is this script meant to work and why is it failing?

EDIT: I found this method that I think is much simpler

import re
blast = open("testing.txt")
output = open("hitcount" + str(dataset) + ".txt", 'w')
lines = blast.readlines()
queries_no = 0
lowhitqueries_no = 0
for index, line in enumerate(lines):
    if re.match("# Query:", line):
        queries_no +=1
        hits = int(re.search(r'\d+', lines[index+3]).group()) 
        qid = line[9:].rstrip()
        print qid, hits #Just to check it's working
    if hits<2:
        lowhitqueries_no +=1
        output.write('{0}\t{1}\n'.format(qid, hits))

blast.close()

Look at how much less space it uses up

blast biopython • 741 views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by rmartson0

You're just looking for the number of times a query id shows up in your blast report? I'm not sure why you tagged Biopython, I don't see that you're importing anything from the module, unless you ran the blast via biopython.

This will give you the query ids and counts per id:

cut -f 1 test.txt | sort | uniq -c > hitcount.txt

I think starting off writing your own program, instead of trying to adapt something written by someone else would be a better learning experience, unless this is homework and you have to use pythong/biopython.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by st.ph.n2.5k

I tagged biopython because I wanted to use biopython to do it, but I found a solution anyway using re.match() to find query id lines, and readlines() and index to find the lines which give matching hit counts

It's in the edit

I think I found that same commandline thing written slightly differently but it didn't work. This one does create a file but its format isn't as I'd want it. The biopython script lets me decide what goes in there and in what format because I have a filtering step too

ADD REPLYlink written 2.3 years ago by rmartson0

Ok. Instead of changing the question title to 'resolved', provide your own answer with what you did, and why it worked with what you are looking for. Then you accept your answer.

ADD REPLYlink written 2.3 years ago by st.ph.n2.5k

My answer is in the edit. If anyone skips straight to the answers without reading the issue then that's their fault. Accepting my own answer sounds weird but I'll do it this once

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by rmartson0
1
gravatar for rmartson
2.3 years ago by
rmartson0
rmartson0 wrote:
import re
blast = open("testing.txt")
output = open("hitcount" + str(dataset) + ".txt", 'w')
lines = blast.readlines()
queries_no = 0
lowhitqueries_no = 0
for index, line in enumerate(lines):
    if re.match("# Query:", line):
        queries_no +=1
        hits = int(re.search(r'\d+', lines[index+3]).group()) 
        qid = line[9:].rstrip()
        print qid, hits #Just to check it's working
    if hits<2:
        lowhitqueries_no +=1
        output.write('{0}\t{1}\n'.format(qid, hits))

blast.close()

Here's an alternate route

ADD COMMENTlink written 2.3 years ago by rmartson0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2088 users visited in the last hour