Question: How to filter short length blast results from output
gravatar for Pryce Michener
3.6 years ago by
Memphis, TN
Pryce Michener10 wrote:

I'm currently trying to work with a set of plant nucleotide contigs that are all greater than 300 bp in length. I'm trying to use NCBI blast to identify the contigs, but I'm running into a problem because my outputs tend to be flooded with ~50 bp long sequences in my contigs that match with random database sequences that definitely aren't real. For example, my last blast returned a large amount of ~50 bp sequences that matched with zebrafish, human, C. diff, and others. I upped the E value to be extremely strict (0.3), but I'm still getting these artifacts, and I feel like I'm losing a lot of quality results because I'm not working with a model system. I've been searching on Google and through the help file for a way to filter these short matches, but I haven't found anything. Does anyone have any advice on how I could clean these output files up?

blast • 2.3k views
ADD COMMENTlink modified 3.6 years ago by Siva1.6k • written 3.6 years ago by Pryce Michener10

0.3 is not a strict e-value. Try something like 1e-50. That could be considered strict.

ADD REPLYlink written 3.6 years ago by 5heikki8.4k
gravatar for Siva
3.6 years ago by
United States
Siva1.6k wrote:

You can restrict the BLAST results to only those HSPs that match at least a certain percentage of your query length (coverage) by using 'qcov_hsp_perc' option.

-qcov_hsp_perc <Real, 0..100>
Percent query coverage per hsp


ADD COMMENTlink written 3.6 years ago by Siva1.6k
gravatar for Pierre Lindenbaum
3.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

I wrote a tool to filter a XML blast output with a javascript expression:

Example: Filter Hit having <Hit_len> <1500

$ java -jar dist/blastfilterjs.jar blastn.xml  -e 'parseInt(hit.getHitLen())<1500' 2> /dev/null |\
  xmllint --format - | grep "Hit_len"

Example 2: keep hsp having 100> Hsp_align-len <= 200

$ cat filter.js

/** keep hsp having 100>= Hsp_align-len <= 300 */
function rmhsps()
    var hsps = hit.getHitHsps().getHsp();
    var i=0;
    while(i< hsps.size())
        var hsp = hsps.get(i);
        var hsplen = parseInt(hsp.getHspAlignLen());
        if( hsplen < 100 || hsplen > 300 )
    return true;
$ java -jar dist/blastfilterjs.jar -f filter.js blastn.xml 2> /dev/null |\
    xmllint --format - | grep -F 'Hsp_align-len'

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Pierre Lindenbaum120k
gravatar for mkulecka
3.6 years ago by
European Union
mkulecka300 wrote:

If you are familiar with python and BioPython (, you can try this:

from Bio.Blast import NCBIXML

for alignment in blast_records.alignments:
    for al in alignment.alignments:
        if al.length > length_thresh:
            print al.title

This should give you a list of alignments that match desired length. However, I agree with 5heikki's comment that 0.3 e-value threshold is not very strict. Also this solution is not memory efficient if you have many blast records. 




ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by mkulecka300

Thank you for this script! I'm having a few problems with it, however. I'm getting the following error:

Traceback (most recent call last):
  File "", line 6, in <module>
    for alignment in blast_records.alignments:
AttributeError: 'generator' object has no attribute 'alignments'

I know that the parse function should give back an alignments object, but I can't figure out what's going wrong. Do you have any idea what the problem is? If I remove the ".alignments" it gives me back the same error except for attribute "length". I'm on BioPython 1.6.5

ADD REPLYlink written 3.6 years ago by Pryce Michener10

Sorry, I think it was a mistake on my part - there should be read instead of parse like this:

I fixed this in my original reply. 

ADD REPLYlink written 3.6 years ago by mkulecka300

This gives me a different error:

Traceback (most recent call last):
  File "", line 5, in <module>
  File "/apps/python/2.7.6/lib/python2.7/site-packages/Bio/Blast/", line 542, in read
    raise ValueError("More than one record found in handle")
ValueError: More than one record found in handle

It looks like parse() is the right function because it can deal with multiple records.

Here's my code in case I'm making some dumb mistake:

from Bio.Blast import NCBIXML

for alignment in blast_records.alignments:
    if alignment.length > length_thresh:
        print aligment.title

Again, thank you for the help. I appreciate you taking the time to look at this.

ADD REPLYlink written 3.6 years ago by Pryce Michener10

Sorry, my BioPython knowledge is a bit rusty. Parsing was actually a good option and then you have to convert the parser into a list. If you have many records, it won't be memory efficient. 

ADD REPLYlink written 3.6 years ago by mkulecka300
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1406 users visited in the last hour