Question: parsing only 100% identical hits out of BLAST tab
0
gravatar for jrj.healey
21 months ago by
jrj.healey8.6k
United Kingdom
jrj.healey8.6k wrote:

I'm parsing out of a large blast tabular file spurious hits which are not exact gene-for-gene matches.

I've written the following but my logic is wrong as I'm not getting just the ones I'm looking for I think.

So far, having the align length == the query end works reasonably well, but I'm finding a few cases where it doesn't. Its a monday so I'm struggling with this!

What else should I be checking for equivalency to guarantee I'm getting only full length matches?

The reason I'm specifying the columns specifically via options, is that I'm dealing with customised blast files which have additional columns.

I'm guessing you guys won't need my input file to figure out where I might be going wrong/how to improve - but I'll edit and add it in if needed.

# Script to find only exact matches in BLAST tab output.
# Requirement to specify column means the script can be
# used with custom column layouts.

import sys
import argparse
import traceback
import warnings
import csv

##################


#################
def main():
##########################################################
# Parse command line arguments

# Add some defaults:
    query = int(0)
    subject = int(1)
    perc_id = int(2)
    aln_len = int(3)
    mismatches = int(4)
    gaps = int(5)
    qstart = int(6)
    qend = int(7)
    sstart = int(8)
    send = int(9)
    eval = int(10)
    bitscore = int(11)

    try:
        parser = argparse.ArgumentParser(description='This script finds only exact matches in BLAST tabular outputs.')
        parser.add_argument(
            '-i',
            '--infile',
            action='store',
            help='The BLAST tabular input file.')
        parser.add_argument(
            '-a',
            '--aln_col',
            action='store',
            default=aln_len,
            help='The column number which corresponds to the alignment length (if different from default).')
        parser.add_argument(
            '-q',
            '--qend_col',
            action='store',
            default=qend,
            help='The column number which corresponds to the query end index (if different from default).')
        args = parser.parse_args()
    except:
        print "An exception occured with argument parsing. Check your provided options."
        traceback.print_exc()

    infile = args.infile
    length = int(args.aln_col)
    qend = int(args.qend_col)

# Main code:
    with open(infile, 'r') as ifh:
        tsvin = csv.reader(ifh, delimiter='\t')

        for row in tsvin:
            if row[length] == row[qend]:
                print('\t'.join(row))

if __name__ == "__main__":
    main()
parsing blast tabular • 792 views
ADD COMMENTlink modified 21 months ago by Biostar ♦♦ 20 • written 21 months ago by jrj.healey8.6k
1

You want qstart == 1 and qend == qlen, as well as int(pident) == 100.

ADD REPLYlink written 21 months ago by cschu1811.4k

When you say qlen, do you mean alignment length (column 4 in std output)?

ADD REPLYlink written 21 months ago by jrj.healey8.6k

No, that is alen, which will only give you the aligned part + gaps. qlen can be added with the -outfmt option and represents the length of the query sequence. If you already have the blast output and cannot rerun it (because it takes too long or w/ever) then you might need to obtain the query sequence length from an external source (e.g. the query fasta). If you don't have that length information (or, alternatively the qcovs or qcovhsp fields), you cannot deduce whether the query is fully covered or not.

ADD REPLYlink written 21 months ago by cschu1811.4k

Ah ok, I can re-run it no problem. I thought there might be enough in the default output. Cheers

ADD REPLYlink written 21 months ago by jrj.healey8.6k
1

I wish... always found it a bit strange, that you cannot see the query coverage by default, especially since NCBI blast shows this automatically...

ADD REPLYlink written 21 months ago by cschu1811.4k

qlen doesn't seem to be an option in the blast parameters https://www.ncbi.nlm.nih.gov/books/NBK279675/

I suppose I could calculate it explicitly from the input file. I don't really understand it, but it looks to me like qcovs and qcovshsp should be representing that kind of information?

Can someone explain these for me? I read this thread (http://seqanswers.com/forums/showthread.php?t=32352) as they are trying to do the same thing, but they mention that repeats etc can throw this off?

ADD REPLYlink written 21 months ago by jrj.healey8.6k

qlen appears in the format options when you call blastn/x/p/... -help

ADD REPLYlink written 21 months ago by cschu1811.4k

Ah great, found it! NCBIs page is evidently out of date..

ADD REPLYlink written 21 months ago by jrj.healey8.6k

If you are looking for 100% alignments, perhaps using global alignment is faster/better option than using BLAST. Something like VSEARCH can do this fairly quickly. Of course, if you are trying to parse a huge BLAST alignment this one time than follow @cschu1981 additions.

ADD REPLYlink written 21 months ago by Jon270

For what I'm doing BLAST is speedy enough. It's a real pain that blastn just doesn't have an exact match option like -perc_identity across a whole sequence - would save me all this parsing.

ADD REPLYlink written 21 months ago by jrj.healey8.6k

Yes, but the reason is because it is doing local alignment.... global alignment does exactly that compares across the entire sequence.

ADD REPLYlink written 21 months ago by Jon270

Is there a reason you're using python? A few bash commands would make quick work of this since you know the cols. you want to look at for querying 100% matches.

ADD REPLYlink written 21 months ago by st.ph.n2.4k

Nope not especially. I just quite like python and am more familiar with it. I realise a bit of bash and awk magic would do it without a problem - my awk knowledge is abysmal however :p

ADD REPLYlink written 21 months ago by jrj.healey8.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 830 users visited in the last hour