Hi ...
I'm trying to parse a blast tabular format file (outfmt 6) to access it's individual elements for further manipulation .
I need to retrieve the qseqid, qstart, qend, qlen, qseq, sseqid, sstart, send, slen, sseq
and pident
for each hit
and here's the script :
blast_qresult=SearchIO.parse('Aegilops_Brachypodium.txt','blast-tab') for i,blast_hit in enumerate(blast_qresult) : for m,blast_hsp in enumerate(blast_hit): for n in range(len(blast_hsp)): print("Query ID: %s"%blast_hit.id) print("Query start: %s"%blast_hsp[n].query_start) print("Query end: %s"%blast_hsp[n].query_end) print("Query seq: %s"%blast_hsp[n].query) print("Hit ID: %s"%blast_hit[m].id) print("Hit start: %s"%blast_hsp[n].hit_start) print("Hit end: %s"%blast_hsp[n].hit_end) print("Hit seq: %s"%blast_hsp[n].hit) print("Identity: %s"%blast_hsp[n].ident_pct)
Till here it does produce some results , and halts at certain hit throwing this error :
builtins.ValueError: Hit 'BRADI2G54940.1' already present in this QueryResult.
And when I try to access the other individual values like :
print(blast_hsp[n].seq_len) print(blast_hit[m].seq_len)
They give these errors respectively :
builtins.AttributeError: 'HSP' object has no attribute 'seq_len' builtins.AttributeError: 'Hit' object has no attribute 'seq_len'
My questions are :
1. How can I access the qlen and slen attributes ?
2. Why does
print("Query seq: %s"%blast_hsp[n].query) print("Hit seq: %s"%blast_hsp[n].hit)
gives this output :
Query seq:None
Hit seq:None
How can I retrieve the sequences for both query and hit ?!
3. Both
print(blast_hit[m].id) print(blast_hit[n].id)
gives the sseqid , which one of them is more accurate ?
4. How can I overcome the error that complains about redundant Hit ID's in the tabular report ?!
5. I noticed that the qstart and sstart returns values decremented by one ( values -1 ) indices ... why is that ? Is it safe to use them for manipulation as is or shall I increment them by 1 ?!
I would be grateful for any help ... thanks .
What version of Biopython do you have? Can you share the data file causing the problem? e.g. using http://gist.github.com
I'm using Biopython 1.64 .
here's a portion of the file that's originally contains 13072 hits : https://gist.github.com/Baraa-1989/ad2453e7e62f4740f791
edit : modified the portion of the file where the hit causing the error locates .
That sample does not include the problematic identifier
BRADI2G54940.1
so probably won't help :(Speaking in general ... how can I overcome this problem ?
edit : updated the file that contains the hit in question .
I can now see
BRADI2G54940.1
in the file once. Can you run this at the command line to see where else it appears in the file?:grep "BRADI2G54940.1" your_blast_file
Perhaps also try with some neighbouring lines for context:
grep -C 2 "BRADI2G54940.1" your_blast_file
I didn't have to run that command ... the error was gone - suddenly - for this particular file Brachyoidium_Japonica.txt , but not for the Horedum_Japonica.txt !!
Previously , the interpreter raised the error complaining about the redundant hit ID BRADI2G54940.1 , but now blast produces the result containing one hit with BRADI2G54940.1 ID !!
BTW ... I haven't changed anything in my script lately !!
I'm really confused now ... what a strange blast behavior .
Would you please explain to me what that command is used for ?