I'm parsing out of a large blast tabular file spurious hits which are not exact gene-for-gene matches.
I've written the following but my logic is wrong as I'm not getting just the ones I'm looking for I think.
So far, having the align length ==
the query end works reasonably well, but I'm finding a few cases where it doesn't. Its a monday so I'm struggling with this!
What else should I be checking for equivalency to guarantee I'm getting only full length matches?
The reason I'm specifying the columns specifically via options, is that I'm dealing with customised blast files which have additional columns.
I'm guessing you guys won't need my input file to figure out where I might be going wrong/how to improve - but I'll edit and add it in if needed.
# Script to find only exact matches in BLAST tab output.
# Requirement to specify column means the script can be
# used with custom column layouts.
import sys
import argparse
import traceback
import warnings
import csv
##################
#################
def main():
##########################################################
# Parse command line arguments
# Add some defaults:
query = int(0)
subject = int(1)
perc_id = int(2)
aln_len = int(3)
mismatches = int(4)
gaps = int(5)
qstart = int(6)
qend = int(7)
sstart = int(8)
send = int(9)
eval = int(10)
bitscore = int(11)
try:
parser = argparse.ArgumentParser(description='This script finds only exact matches in BLAST tabular outputs.')
parser.add_argument(
'-i',
'--infile',
action='store',
help='The BLAST tabular input file.')
parser.add_argument(
'-a',
'--aln_col',
action='store',
default=aln_len,
help='The column number which corresponds to the alignment length (if different from default).')
parser.add_argument(
'-q',
'--qend_col',
action='store',
default=qend,
help='The column number which corresponds to the query end index (if different from default).')
args = parser.parse_args()
except:
print "An exception occured with argument parsing. Check your provided options."
traceback.print_exc()
infile = args.infile
length = int(args.aln_col)
qend = int(args.qend_col)
# Main code:
with open(infile, 'r') as ifh:
tsvin = csv.reader(ifh, delimiter='\t')
for row in tsvin:
if row[length] == row[qend]:
print('\t'.join(row))
if __name__ == "__main__":
main()
You want qstart == 1 and qend == qlen, as well as int(pident) == 100.
When you say qlen, do you mean alignment length (column 4 in std output)?
No, that is alen, which will only give you the aligned part + gaps. qlen can be added with the -outfmt option and represents the length of the query sequence. If you already have the blast output and cannot rerun it (because it takes too long or w/ever) then you might need to obtain the query sequence length from an external source (e.g. the query fasta). If you don't have that length information (or, alternatively the qcovs or qcovhsp fields), you cannot deduce whether the query is fully covered or not.
Ah ok, I can re-run it no problem. I thought there might be enough in the default output. Cheers
I wish... always found it a bit strange, that you cannot see the query coverage by default, especially since NCBI blast shows this automatically...
qlen doesn't seem to be an option in the blast parameters https://www.ncbi.nlm.nih.gov/books/NBK279675/
I suppose I could calculate it explicitly from the input file. I don't really understand it, but it looks to me like
qcovs
andqcovshsp
should be representing that kind of information?Can someone explain these for me? I read this thread (http://seqanswers.com/forums/showthread.php?t=32352) as they are trying to do the same thing, but they mention that repeats etc can throw this off?
qlen appears in the format options when you call blastn/x/p/... -help
Ah great, found it! NCBIs page is evidently out of date..
If you are looking for 100% alignments, perhaps using global alignment is faster/better option than using BLAST. Something like VSEARCH can do this fairly quickly. Of course, if you are trying to parse a huge BLAST alignment this one time than follow @cschu1981 additions.
For what I'm doing BLAST is speedy enough. It's a real pain that blastn just doesn't have an exact match option like -perc_identity across a whole sequence - would save me all this parsing.
Yes, but the reason is because it is doing local alignment.... global alignment does exactly that compares across the entire sequence.
Is there a reason you're using python? A few bash commands would make quick work of this since you know the cols. you want to look at for querying 100% matches.
Nope not especially. I just quite like python and am more familiar with it. I realise a bit of bash and awk magic would do it without a problem - my awk knowledge is abysmal however :p