[BioPython] Parse blast tabular output
1
5
Entering edit mode
3.9 years ago
Rox ★ 1.3k

Hi everyone,

I am trying to parse a blast result produced using outfmt 6 option.

I have made several tries, with iterators, without iterators... But each time it fails to parse my file.

Here some code that I try :

parser = argparse.ArgumentParser() parser.add_argument("blast_file", help="The path of the file containing blast result in xml format") args = parser.parse_args()

results = open(args.blast_file, "r")

blast_parser = NCBIStandalone.BlastParser() blast_records = blast_parser.parse(results)

for blast_record in blast_records:
E_VALUE_THRESH = 0.0004
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print('****Alignment****')
print('sequence:', alignment.title)
print('length:', alignment.length)
print('e value:', hsp.expect)
if len(hsp.query) > 75:
dots = '...'
else:
dots = ''
print(hsp.query[0:75] + dots)
print(hsp.match[0:75] + dots)
print(hsp.sbjct[0:75] + dots)


But then, it showed this error :

python parse_last_hit.py /media/loutre/SUZUKII/assembly/duplication_removal/2017/Blast/Contig_37_orf.txt
/usr/lib/python2.7/dist-packages/Bio/Blast/NCBIStandalone.py:57: BiopythonDeprecationWarning: This module has been deprecated. Consider Bio.SearchIO for parsing BLAST output instead.
/usr/lib/python2.7/dist-packages/Bio/ParserSupport.py:29: BiopythonDeprecationWarning: Bio.ParserSupport is now deprecated will be removed in a future release of Biopython.
"future release of Biopython.", BiopythonDeprecationWarning)
Traceback (most recent call last):
File "parse_last_hit.py", line 14, in <module>
blast_records = blast_parser.parse(results)
File "/usr/lib/python2.7/dist-packages/Bio/Blast/NCBIStandalone.py", line 836, in parse
self._scanner.feed(handle, self._consumer)
File "/usr/lib/python2.7/dist-packages/Bio/Blast/NCBIStandalone.py", line 118, in feed
File "/usr/lib/python2.7/dist-packages/Bio/ParserSupport.py", line 320, in read_and_call_until
File "/usr/lib/python2.7/dist-packages/Bio/ParserSupport.py", line 400, in safe_readline
raise ValueError("Unexpected end of stream.")
ValueError: Unexpected end of stream.


By googling it, I found that it may be a problem of the blast format issue (Problems With Biopython When Running The Ncbistandalone.Py Program)

I really don't want to go through XML, I can't allow it because I have a lot of Blast with huge sequences to do. Producing xml takes too much time and too much storage.

Does anyone know a way using Biopython to parse through blast result in tabular format ? Thanks a lot for your answers !

python biopython blast • 5.6k views
3
Entering edit mode

Have you tried using SearchIO with the blast tabular output? It's pretty straighforward. e.g:

blast_result = SearchIO.read(resultHandle, 'blast-tab')

print(blast_result[0][0])
start = blast_result[0][0].hit_start
end = blast_result[0][0].hit_end


Also, you can treat the outfmt 6 as just a plain delimited text file, so you don't actually need to use a specialised parser at all necessarily (e.g: https://github.com/MicroInfect/bioinfx/blob/master/blastfilterer.py )

0
Entering edit mode

SearchIO is a nice suggestion, though my blast result will contain a lot of raws, and the function read can work only with blast output with exactly one result.

But searchIO has also a parse function, which handle multi result blast file, but when I tried it didn't worked. I'm trying again see if you can understand the error.

0
Entering edit mode

Ah right I remember, parse from searchIO support only xml so that's why I'm not happy with it

0
Entering edit mode

About doing the parser myself without biopython, I already thought about that, But I was stucked at the part where I had to read powered numbers with python...

Like when you have something like this : 5e-43, how do you store this in python ? It's been a while since I didn't worked in python...

2
Entering edit mode

You can just use float("5e-43") to get your values converted.

0
Entering edit mode

Really ? Ow damn, I was stucked with this question for so long... Thanks a lot then I'm going to try ! it sounds dumb now

8
Entering edit mode
3.9 years ago

A long time ago I wrote a module for handling tabular BLAST output (-outfmt 6 / -m8). Maybe you will find it useful.

\gist

SAMPLE USAGE:

Filename: test.outfmt6

cgl|CAGL0A00187g    ecy|Ecym_8168   31.11   90  49  3   597 676 95  181 1e-04   42.7
cgl|CAGL0A00187g    ecy|Ecym_8168   36.36   44  28  0   594 637 235 278 7e-04   40.0
cgl|CAGL0A00187g    ecy|Ecym_4273   27.38   84  58  1   19  102 4055    4135    0.64    31.2
cgl|CAGL0A00209g    ecy|Ecym_1435   69.91   751 188 4   13  725 11  761 0.0 1082
cgl|CAGL0A00209g    ecy|Ecym_5414   25.84   89  41  3   431 508 117 191 7.0 27.3
cgl|CAGL0A00231g    ecy|Ecym_1436   28.76   845 398 32  2   789 7   704 4e-56    203
cgl|CAGL0A00231g    ecy|Ecym_6400   34.18   79  48  3   590 666 418 494 0.034   35.0
cgl|CAGL0A00231g    ecy|Ecym_4243   39.02   41  25  0   554 594 222 262 2.1 29.3


from blast import parse

fh = open('test.outfmt6')
for blast_record in parse(fh):
print('query id: {}'.format(blast_record.qid))
for hit in blast_record.hits:
for hsp in hit:
print('****Alignment****')
print('sequence:', hsp.sid)
print('length:', hsp.length)
print('e value:', hsp.evalue)
fh.close()


Output:

query id: cgl|CAGL0A00187g
****Alignment****
('sequence:', u'ecy|Ecym_8168')
('length:', 90)
('e value:', 0.0001)
****Alignment****
('sequence:', u'ecy|Ecym_8168')
('length:', 44)
('e value:', 0.0007)
****Alignment****
('sequence:', u'ecy|Ecym_4273')
('length:', 84)
('e value:', 0.64)
query id: cgl|CAGL0A00209g
****Alignment****
('sequence:', u'ecy|Ecym_1435')
('length:', 751)
('e value:', 0.0)
****Alignment****
('sequence:', u'ecy|Ecym_5414')
('length:', 89)
('e value:', 7.0)
query id: cgl|CAGL0A00231g
****Alignment****
('sequence:', u'ecy|Ecym_1436')
('length:', 845)
('e value:', 4e-56)
****Alignment****
('sequence:', u'ecy|Ecym_6400')
('length:', 79)
('e value:', 0.034)
****Alignment****
('sequence:', u'ecy|Ecym_4243')
('length:', 41)
('e value:', 2.1)

1
Entering edit mode

That is awesome !

I was doing the exact same steps you did before I realised that Biopython may already implemented it... But yours is way better, I think I may use it a lot ! thanks again

0
Entering edit mode

Very Helpful! Is there any way we can sort based on query coordinates (i.e hsp.qid, hspp.qstart and hsp.qend) using your this module?

0
Entering edit mode

Unfortunately you can't sort the blast records by query coordinate. However, you can sort it via Linux sort. For example, sort -k 1,1 -k 7,7n blast.txt will sort hsps by hsp.qstart.