How to filter short length blast results from output
3
1
Entering edit mode
8.4 years ago

I'm currently trying to work with a set of plant nucleotide contigs that are all greater than 300 bp in length. I'm trying to use NCBI blast to identify the contigs, but I'm running into a problem because my outputs tend to be flooded with ~50 bp long sequences in my contigs that match with random database sequences that definitely aren't real. For example, my last blast returned a large amount of ~50 bp sequences that matched with zebrafish, human, C. diff, and others. I upped the E value to be extremely strict (0.3), but I'm still getting these artifacts, and I feel like I'm losing a lot of quality results because I'm not working with a model system. I've been searching on Google and through the help file for a way to filter these short matches, but I haven't found anything. Does anyone have any advice on how I could clean these output files up?

blast • 7.8k views
ADD COMMENT
1
Entering edit mode

0.3 is not a strict e-value. Try something like 1e-50. That could be considered strict.

ADD REPLY
7
Entering edit mode
8.4 years ago
Siva ★ 1.9k

You can restrict the BLAST results to only those HSPs that match at least a certain percentage of your query length (coverage) by using qcov_hsp_perc option.

-qcov_hsp_perc <Real, 0..100>
Percent query coverage per hsp

ADD COMMENT
1
Entering edit mode
8.4 years ago

I wrote a tool to filter a XML blast output with a javascript expression: https://github.com/lindenb/jvarkit/wiki/BlastFilterJS

Example: Filter Hit having <Hit_len> <1500

$ java -jar dist/blastfilterjs.jar blastn.xml  -e 'parseInt(hit.getHitLen())<1500' 2> /dev/null |\
  xmllint --format - | grep "Hit_len"

   <Hit_len>1492</Hit_len>
   <Hit_len>1488</Hit_len>
   <Hit_len>1477</Hit_len>
   <Hit_len>1452</Hit_len>
   <Hit_len>1430</Hit_len>
   <Hit_len>1064</Hit_len>
   <Hit_len>1283</Hit_len>
   <Hit_len>1052</Hit_len>
   <Hit_len>1272</Hit_len>
   <Hit_len>693</Hit_len>

Example 2: keep hsp having 100> Hsp_align-len <= 200

$ cat filter.js

/** keep hsp having 100>= Hsp_align-len <= 300 */
function rmhsps()
    {
    var hsps = hit.getHitHsps().getHsp();
    var i=0;
    while(i< hsps.size())
        {
        var hsp = hsps.get(i);
        var hsplen = parseInt(hsp.getHspAlignLen());

        if( hsplen < 100 || hsplen > 300 )
            {
            hsps.remove(i);
            }
        else
            {
            i++;
            }
        }
    return true;
    }
rmhsps();

$ java -jar dist/blastfilterjs.jar -f filter.js blastn.xml 2> /dev/null |\
    xmllint --format - | grep -F 'Hsp_align-len'

     <Hsp_align-len>289</Hsp_align-len>
     <Hsp_align-len>291</Hsp_align-len>
     <Hsp_align-len>197</Hsp_align-len>
     <Hsp_align-len>227</Hsp_align-len>
ADD COMMENT
1
Entering edit mode
8.4 years ago
mkulecka ▴ 360

If you are familiar with python and BioPython, you can try this:

from Bio.Blast import NCBIXML

length_thresh=your_desired_alignment_length_threshold
result=open("your_blast_result.xml")
blast_records=NCBIXML.parse(result)
blast_records=list(blast_records)
for alignment in blast_records.alignments:
    for al in alignment.alignments:
        if al.length > length_thresh:
            print al.title

This should give you a list of alignments that match desired length. However, I agree with 5heikki's comment that 0.3 e-value threshold is not very strict. Also this solution is not memory efficient if you have many blast records.

ADD COMMENT
0
Entering edit mode

Thank you for this script! I'm having a few problems with it, however. I'm getting the following error:

Traceback (most recent call last):
  File "XML_filter2.py", line 6, in <module>
    for alignment in blast_records.alignments:
AttributeError: 'generator' object has no attribute 'alignments'

I know that the parse function should give back an alignments object, but I can't figure out what's going wrong. Do you have any idea what the problem is? If I remove the ".alignments" it gives me back the same error except for attribute "length". I'm on BioPython 1.6.5

ADD REPLY
1
Entering edit mode

Sorry, I think it was a mistake on my part - there should be read instead of parse like this:

blast.records=NCBIXML.read(result)

I fixed this in my original reply.

ADD REPLY
0
Entering edit mode

This gives me a different error:

Traceback (most recent call last):
  File "XML_filter2.py", line 5, in <module>
    blast_records=NCBIXML.read(result)
  File "/apps/python/2.7.6/lib/python2.7/site-packages/Bio/Blast/NCBIXML.py", line 542, in read
    raise ValueError("More than one record found in handle")
ValueError: More than one record found in handle

It looks like parse() is the right function because it can deal with multiple records.

Here's my code in case I'm making some dumb mistake:

from Bio.Blast import NCBIXML

length_thresh=100
result=open("/directory/filename.xml")
blast_records=NCBIXML.read(result)
for alignment in blast_records.alignments:
    if alignment.length > length_thresh:
        print aligment.title

Again, thank you for the help. I appreciate you taking the time to look at this.

ADD REPLY
0
Entering edit mode

Sorry, my BioPython knowledge is a bit rusty. Parsing was actually a good option and then you have to convert the parser into a list. If you have many records, it won't be memory efficient.

ADD REPLY

Login before adding your answer.

Traffic: 2591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6