parsing only 100% identical hits out of BLAST tab
0
0
Entering edit mode
7.7 years ago
Joe 21k

I'm parsing out of a large blast tabular file spurious hits which are not exact gene-for-gene matches.

I've written the following but my logic is wrong as I'm not getting just the ones I'm looking for I think.

So far, having the align length == the query end works reasonably well, but I'm finding a few cases where it doesn't. Its a monday so I'm struggling with this!

What else should I be checking for equivalency to guarantee I'm getting only full length matches?

The reason I'm specifying the columns specifically via options, is that I'm dealing with customised blast files which have additional columns.

I'm guessing you guys won't need my input file to figure out where I might be going wrong/how to improve - but I'll edit and add it in if needed.

# Script to find only exact matches in BLAST tab output.
# Requirement to specify column means the script can be
# used with custom column layouts.

import sys
import argparse
import traceback
import warnings
import csv

##################


#################
def main():
##########################################################
# Parse command line arguments

# Add some defaults:
    query = int(0)
    subject = int(1)
    perc_id = int(2)
    aln_len = int(3)
    mismatches = int(4)
    gaps = int(5)
    qstart = int(6)
    qend = int(7)
    sstart = int(8)
    send = int(9)
    eval = int(10)
    bitscore = int(11)

    try:
        parser = argparse.ArgumentParser(description='This script finds only exact matches in BLAST tabular outputs.')
        parser.add_argument(
            '-i',
            '--infile',
            action='store',
            help='The BLAST tabular input file.')
        parser.add_argument(
            '-a',
            '--aln_col',
            action='store',
            default=aln_len,
            help='The column number which corresponds to the alignment length (if different from default).')
        parser.add_argument(
            '-q',
            '--qend_col',
            action='store',
            default=qend,
            help='The column number which corresponds to the query end index (if different from default).')
        args = parser.parse_args()
    except:
        print "An exception occured with argument parsing. Check your provided options."
        traceback.print_exc()

    infile = args.infile
    length = int(args.aln_col)
    qend = int(args.qend_col)

# Main code:
    with open(infile, 'r') as ifh:
        tsvin = csv.reader(ifh, delimiter='\t')

        for row in tsvin:
            if row[length] == row[qend]:
                print('\t'.join(row))

if __name__ == "__main__":
    main()
blast tabular parsing • 3.1k views
ADD COMMENT
1
Entering edit mode

You want qstart == 1 and qend == qlen, as well as int(pident) == 100.

ADD REPLY
0
Entering edit mode

When you say qlen, do you mean alignment length (column 4 in std output)?

ADD REPLY
0
Entering edit mode

No, that is alen, which will only give you the aligned part + gaps. qlen can be added with the -outfmt option and represents the length of the query sequence. If you already have the blast output and cannot rerun it (because it takes too long or w/ever) then you might need to obtain the query sequence length from an external source (e.g. the query fasta). If you don't have that length information (or, alternatively the qcovs or qcovhsp fields), you cannot deduce whether the query is fully covered or not.

ADD REPLY
0
Entering edit mode

Ah ok, I can re-run it no problem. I thought there might be enough in the default output. Cheers

ADD REPLY
1
Entering edit mode

I wish... always found it a bit strange, that you cannot see the query coverage by default, especially since NCBI blast shows this automatically...

ADD REPLY
0
Entering edit mode

qlen doesn't seem to be an option in the blast parameters https://www.ncbi.nlm.nih.gov/books/NBK279675/

I suppose I could calculate it explicitly from the input file. I don't really understand it, but it looks to me like qcovs and qcovshsp should be representing that kind of information?

Can someone explain these for me? I read this thread (http://seqanswers.com/forums/showthread.php?t=32352) as they are trying to do the same thing, but they mention that repeats etc can throw this off?

ADD REPLY
0
Entering edit mode

qlen appears in the format options when you call blastn/x/p/... -help

ADD REPLY
0
Entering edit mode

Ah great, found it! NCBIs page is evidently out of date..

ADD REPLY
0
Entering edit mode

If you are looking for 100% alignments, perhaps using global alignment is faster/better option than using BLAST. Something like VSEARCH can do this fairly quickly. Of course, if you are trying to parse a huge BLAST alignment this one time than follow @cschu1981 additions.

ADD REPLY
1
Entering edit mode

For what I'm doing BLAST is speedy enough. It's a real pain that blastn just doesn't have an exact match option like -perc_identity across a whole sequence - would save me all this parsing.

ADD REPLY
0
Entering edit mode

Yes, but the reason is because it is doing local alignment.... global alignment does exactly that compares across the entire sequence.

ADD REPLY
0
Entering edit mode

Is there a reason you're using python? A few bash commands would make quick work of this since you know the cols. you want to look at for querying 100% matches.

ADD REPLY
0
Entering edit mode

Nope not especially. I just quite like python and am more familiar with it. I realise a bit of bash and awk magic would do it without a problem - my awk knowledge is abysmal however :p

ADD REPLY

Login before adding your answer.

Traffic: 804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6