Question

How to filter short length blast results from output

1

Entering edit mode

8.4 years ago

Pryce Michener ▴ 10

I'm currently trying to work with a set of plant nucleotide contigs that are all greater than 300 bp in length. I'm trying to use NCBI blast to identify the contigs, but I'm running into a problem because my outputs tend to be flooded with ~50 bp long sequences in my contigs that match with random database sequences that definitely aren't real. For example, my last blast returned a large amount of ~50 bp sequences that matched with zebrafish, human, C. diff, and others. I upped the E value to be extremely strict (0.3), but I'm still getting these artifacts, and I feel like I'm losing a lot of quality results because I'm not working with a model system. I've been searching on Google and through the help file for a way to filter these short matches, but I haven't found anything. Does anyone have any advice on how I could clean these output files up?

blast • 7.8k views

ADD COMMENT • link updated 8.4 years ago by Siva ★ 1.9k • written 8.4 years ago by Pryce Michener ▴ 10

1

Entering edit mode

0.3 is not a strict e-value. Try something like 1e-50. That could be considered strict.

ADD REPLY • link 8.4 years ago by 5heikki 11k

Ram · Answer 1 · 2015-11-17

7

Entering edit mode

8.4 years ago

Siva ★ 1.9k

You can restrict the BLAST results to only those HSPs that match at least a certain percentage of your query length (coverage) by using qcov_hsp_perc option.

-qcov_hsp_perc <Real, 0..100>
Percent query coverage per hsp

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Siva ★ 1.9k

Ram · Answer 2 · 2015-11-16

I wrote a tool to filter a XML blast output with a javascript expression: https://github.com/lindenb/jvarkit/wiki/BlastFilterJS

Example: Filter Hit having <Hit_len> <1500

$ java -jar dist/blastfilterjs.jar blastn.xml  -e 'parseInt(hit.getHitLen())<1500' 2> /dev/null |\
  xmllint --format - | grep "Hit_len"

   <Hit_len>1492</Hit_len>
   <Hit_len>1488</Hit_len>
   <Hit_len>1477</Hit_len>
   <Hit_len>1452</Hit_len>
   <Hit_len>1430</Hit_len>
   <Hit_len>1064</Hit_len>
   <Hit_len>1283</Hit_len>
   <Hit_len>1052</Hit_len>
   <Hit_len>1272</Hit_len>
   <Hit_len>693</Hit_len>

Example 2: keep hsp having 100> Hsp_align-len <= 200

$ cat filter.js

/** keep hsp having 100>= Hsp_align-len <= 300 */
function rmhsps()
    {
    var hsps = hit.getHitHsps().getHsp();
    var i=0;
    while(i< hsps.size())
        {
        var hsp = hsps.get(i);
        var hsplen = parseInt(hsp.getHspAlignLen());

        if( hsplen < 100 || hsplen > 300 )
            {
            hsps.remove(i);
            }
        else
            {
            i++;
            }
        }
    return true;
    }
rmhsps();

$ java -jar dist/blastfilterjs.jar -f filter.js blastn.xml 2> /dev/null |\
    xmllint --format - | grep -F 'Hsp_align-len'

     <Hsp_align-len>289</Hsp_align-len>
     <Hsp_align-len>291</Hsp_align-len>
     <Hsp_align-len>197</Hsp_align-len>
     <Hsp_align-len>227</Hsp_align-len>

Ram · Answer 3 · 2015-11-16

1

Entering edit mode

8.4 years ago

mkulecka ▴ 360

If you are familiar with python and BioPython, you can try this:

from Bio.Blast import NCBIXML

length_thresh=your_desired_alignment_length_threshold
result=open("your_blast_result.xml")
blast_records=NCBIXML.parse(result)
blast_records=list(blast_records)
for alignment in blast_records.alignments:
    for al in alignment.alignments:
        if al.length > length_thresh:
            print al.title

This should give you a list of alignments that match desired length. However, I agree with 5heikki's comment that 0.3 e-value threshold is not very strict. Also this solution is not memory efficient if you have many blast records.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by mkulecka ▴ 360

0

Entering edit mode

Thank you for this script! I'm having a few problems with it, however. I'm getting the following error:

Traceback (most recent call last):
  File "XML_filter2.py", line 6, in <module>
    for alignment in blast_records.alignments:
AttributeError: 'generator' object has no attribute 'alignments'

I know that the parse function should give back an alignments object, but I can't figure out what's going wrong. Do you have any idea what the problem is? If I remove the ".alignments" it gives me back the same error except for attribute "length". I'm on BioPython 1.6.5

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Pryce Michener ▴ 10

1

Entering edit mode

Sorry, I think it was a mistake on my part - there should be read instead of parse like this:

blast.records=NCBIXML.read(result)

I fixed this in my original reply.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by mkulecka ▴ 360

0

Entering edit mode

This gives me a different error:

Traceback (most recent call last):
  File "XML_filter2.py", line 5, in <module>
    blast_records=NCBIXML.read(result)
  File "/apps/python/2.7.6/lib/python2.7/site-packages/Bio/Blast/NCBIXML.py", line 542, in read
    raise ValueError("More than one record found in handle")
ValueError: More than one record found in handle

It looks like parse() is the right function because it can deal with multiple records.

Here's my code in case I'm making some dumb mistake:

from Bio.Blast import NCBIXML

length_thresh=100
result=open("/directory/filename.xml")
blast_records=NCBIXML.read(result)
for alignment in blast_records.alignments:
    if alignment.length > length_thresh:
        print aligment.title

Again, thank you for the help. I appreciate you taking the time to look at this.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Pryce Michener ▴ 10

0

Entering edit mode

Sorry, my BioPython knowledge is a bit rusty. Parsing was actually a good option and then you have to convert the parser into a list. If you have many records, it won't be memory efficient.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by mkulecka ▴ 360