Question: Can't access individual attributes of tabular outfmt 6 blast output while parsing
0
gravatar for Bara'a
4.0 years ago by
Bara'a220
Amman - Jordan
Bara'a220 wrote:

Hi ...

I'm trying to parse a blast tabular format file (outfmt 6) to access it's individual elements for further manipulation .

I need to retrieve the qseqid, qstart, qend, qlen, qseq, sseqid, sstart, send, slen, sseq and pident for each hit

and here's the script :

blast_qresult=SearchIO.parse('Aegilops_Brachypodium.txt','blast-tab')
for i,blast_hit in enumerate(blast_qresult) :
    for m,blast_hsp in enumerate(blast_hit):
        for n in range(len(blast_hsp)):
            print("Query ID: %s"%blast_hit.id)
            print("Query start: %s"%blast_hsp[n].query_start)
            print("Query end: %s"%blast_hsp[n].query_end)
            print("Query seq: %s"%blast_hsp[n].query)
            print("Hit ID: %s"%blast_hit[m].id)
            print("Hit start: %s"%blast_hsp[n].hit_start)
            print("Hit end: %s"%blast_hsp[n].hit_end)
            print("Hit seq: %s"%blast_hsp[n].hit)
            print("Identity: %s"%blast_hsp[n].ident_pct)

Till here it does produce some results , and halts at certain hit throwing this error :

builtins.ValueError: Hit 'BRADI2G54940.1' already present in this QueryResult.

And when I try to access the other individual values like :

print(blast_hsp[n].seq_len)

print(blast_hit[m].seq_len)

They give these errors respectively :

builtins.AttributeError: 'HSP' object has no attribute 'seq_len'

builtins.AttributeError: 'Hit' object has no attribute 'seq_len'

My questions are :

1. How can I access the qlen and slen attributes ?

2. Why does

print("Query seq: %s"%blast_hsp[n].query)

print("Hit seq: %s"%blast_hsp[n].hit)

gives this output :

Query seq: None 

Hit seq: None

How can I retrieve the sequences for both query and hit ?!

3. Both

print(blast_hit[m].id)

print(blast_hit[n].id)

gives the sseqid , which one of them is more accurate ?

4. How can I overcome the error that complains about redundant Hit ID's in the tabular report ?!

5. I noticed that the qstart and sstart returns values decremented by one ( values -1 ) indices ... why is that ? Is it safe to use them for manipulation as is or shall I increment them by 1 ?!

 

I would be grateful for any help ... thanks .

 
 
ADD COMMENTlink modified 4.0 years ago by bow780 • written 4.0 years ago by Bara'a220
1

What version of Biopython do you have? Can you share the data file causing the problem? e.g. using http://gist.github.com

ADD REPLYlink written 4.0 years ago by Peter5.7k

I'm using Biopython 1.64 .

here's a portion of the file that's originally contains 13072 hits : https://gist.github.com/Baraa-1989/ad2453e7e62f4740f791

 

edit : modified the portion of the file where the hit causing the error locates .

 
 
ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Bara'a220
1

That sample does not include the problematic identifier BRADI2G54940.1 so probably won't help :(

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Peter5.7k

Speaking in general ... how can I overcome this problem ?

edit : updated the file that contains the hit in question .

 
 
ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Bara'a220
1

I can now see BRADI2G54940.1 in the file once. Can you run this at the command line to see where else it appears in the file?:

grep "BRADI2G54940.1" your_blast_file

Perhaps also try with some neighbouring lines for context:

grep -C 2 "BRADI2G54940.1" your_blast_file

ADD REPLYlink written 4.0 years ago by Peter5.7k

I didn't have to run that command ... the error was gone - suddenly - for this particular file Brachyoidium_Japonica.txt , but not for the Horedum_Japonica.txt !!

Previously , the interpreter raised the error complaining about the redundant  hit ID BRADI2G54940.1 , but now blast produces the result containing one hit with BRADI2G54940.1 ID !!

BTW ... I haven't changed anything in my script lately !!

I'm really confused now ... what a strange blast behavior .

Would you please explain to me what that command is used for ?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Bara'a220
3
gravatar for Peter
4.0 years ago by
Peter5.7k
Scotland, UK
Peter5.7k wrote:

Bio.SearchIO is not intended to simply expose the BLAST tabular fields as they are (for that you could use the Python standard library module csv https://docs.python.org/2/library/csv.html or similar).

Rather Bio.SearchIO turns BLAST tabular and many other search results into a common abstracted object representation. For example, start/end values are converted into Python style interval counting (which is where the -1 on the start values comes from).

ADD COMMENTlink written 4.0 years ago by Peter5.7k

Thank you for explaining me how does python represent QueryResult objects of Bio.SearchIO .

In fact , I managed to get access to the attributes I wasn't able to retrieve in the previous code ... I've just invoked the customized fields option in parsing command :

blast_qresult=SearchIO.parse('Brachypodium_Japonica.txt','blast-tab',fields=['std','qlen', 'slen'])

and it does produce some results , but I'm not completely sure if they were accurate , since the following error remains unhandled :

builtins.ValueError: Hit 'OS01T0274800-01' already present in this QueryResult.
 
 
ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Bara'a220
1

That list of fields should match whatever you asked BLAST+ to produce with the -outfmt 6 option (by default the standard 12 fields it uses by default). This is not a fields you are interested in!

ADD REPLYlink written 4.0 years ago by Peter5.7k
1

The output file seems to contain 14 columns, though. @Bara'a, could you paste the command you used to get this results? If you did indeed specify two extra columns (qlen and slen), then I think your `fields` argument is correct.

ADD REPLYlink written 4.0 years ago by bow780
1

Opps. I missed you used "std" which is short hand for the standard 12 columns.

ADD REPLYlink written 4.0 years ago by Peter5.7k
comline=NcbiblastxCommandline(cmd='blastn', query=query+".fasta", db=database+".db", evalue=0.01, outfmt='"6 std qlen slen"', max_target_seqs=1, out=outname )

blast_qresult=SearchIO.parse('Bracypodium_Japonica.txt','blast-tab',fields=['std','qlen','slen'])

But the problem now is not about accessing those attributes , it's about parsing them smoothly without duplicates as the latter error complains .

the max_target_seq used to temporarily overcome this problem , but after I missed up with the Query.py module it didn't work any more !!

Any suggestions ?

ADD REPLYlink written 4.0 years ago by Bara'a220

I've consulted my thesis supervisor for the latter error in the code , and suggested me to alter the Query.py module to stop raising the error !!

He asked me to comment the following lines in the append function in the module :

 # if a custom hit_key_function is supplied, use it to define th hit key
        if self._hit_key_function is not None:
            hit_key = self._hit_key_function(hit)
        else:
            hit_key = hit.id

        '''if hit_key not in self:
            self[hit_key] = hit
        else:
            raise ValueError("Hit '%s' already present in this QueryResult." %
                    hit_key)'''

of course , I was faint-hearted to miss up with the module structure ... and it was indeed a big mistake as it result in a very strange behavior !!

when I tried to re-run the code on a different blast tabular file , it gave me this again :

builtins.ValueError: Hit 'OS01T0274800-01' already present in this QueryResult.

beside :

builtins.SystemError: Parent module '' not loaded, cannot perform relative import

despite of rolling back the change on the module !!

 
 
ADD REPLYlink written 4.0 years ago by Bara'a220

I have noticed that printing those attributes directly to the screen does work , but not when I try to save them to some external file !!

Why is that ?

 
 
ADD REPLYlink written 4.0 years ago by Bara'a220

I'm not sure what you are describing here.

ADD REPLYlink written 4.0 years ago by Peter5.7k

Never mind @Peter ... the real question here was scattered by the process workflow !!

I will try to put things together , and come back with an appropriate solution soon - with GOD willing -  :)

 
 
ADD REPLYlink written 4.0 years ago by Bara'a220
3
gravatar for bow
4.0 years ago by
bow780
Netherlands
bow780 wrote:

I see some of the questions have been addressed in the comments, so I will respond to some that have not (paraphrased):

> 1. How can I access the qlen and slen attributes ?

Since `qlen` is a property of the whole query, accessing it can be done via `blast_qresults.seq_len` in your case. Similarly, `slen` can be accessed via `blast_hit.seq_len`. The full API documentation for the BLAST format in SearchIO is here: http://biopython.org/DIST/docs/api/Bio.SearchIO.BlastIO-module.html.

> 2. Why does ... does not give `None` attributes?

The parser can only parse data present in the input file of course. In your case, the tabular data does not store any sequence information, so the `query` and `hit` attributes of your `blast_hsp` object does not contain anything. If there were sequences, they would be `SeqRecord` objects. To add the sequence to your output, I think you need to add a custom column for this (or just use another format like BLAST XML).

> 3. Both ... gives the sseqid, which one of them is more accurate?

Did you get different results from them? They should all give the same answer (if retrieved from the same Hit object).

4. How can I overcome the error that complains about redundant Hit ID's in the tabular report ?!

I need to take a look into this first. I can't see the problematic hit (OS01T0274800-01) in your uploaded gist, but my guess is the database you have may have duplicate entries of OS01T0274800-01. If that is the case, then you will need to rebuilt your database without the duplicate and then rerun the query.

> 5. I noticed that the qstart and sstart returns values decremented by one ( values -1 ) indices ... why is that ? Is it safe to use them for manipulation as is or shall I increment them by 1 ?!

This one has been addressed by Peter :).

ADD COMMENTlink written 4.0 years ago by bow780
1

One of the quirks of makeblastdb is it currently tolerates duplicate identifiers in the input FASTA file when making a database, which results in ambiguous output - see http://blastedbio.blogspot.co.uk/2013/12/blast-should-keep-its-blordid.html - I think this is an NCBI bug and hope they change it.

ADD REPLYlink written 4.0 years ago by Peter5.7k

If so , then I think I have to report this as a bug to the official NCBI site .

Hope they can fix this soon ... till then , I should figure something out to work around this obstacle , like writing a function to handle the duplicates in the blast result files .

Wish me luck with that :) 

ADD REPLYlink written 4.0 years ago by Bara'a220

If that was the problem then please do email the NCBI BLAST team (link to the blog post and this page too) to help them prioritise.
 

ADD REPLYlink written 4.0 years ago by Peter5.7k

I will as soon as I get my hands clear of the tasks I'm working on .

Thanks a lot @Peter ... I indeed appreciate your help .

 
 
ADD REPLYlink written 4.0 years ago by Bara'a220

@bow .. thank you for your comprehensive reply , it really helped a lot .

Regarding the commands :   

print(blast_hit[m].id)

print(blast_hit[n].id)

yes , they do give the same output .

But my question was : which command is more accurate , from a conceptual perspective?

Don't they refer to two different levels of hierarchy in the QueryResult object ?!

ADD REPLYlink written 4.0 years ago by Bara'a220

@bow ... I've just added the file that contains the redundant hit (OS01T0274800-01) : https://gist.github.com/Baraa-1989/e253fbe1252cf4c1002c

Please , note that this redundancy issue was raised previously by the file (Brachypodium_Japonica.txt) , and handled suddenly in a strange manner  on it's own without any interference from me of any kind neither by modifying the script nor by correcting the raw data , as I mentioned above in my reply to @Peter !!

But it raised again in this new blast query :(

Why do you think this is happening ?!

It's driving me crazy .. Uhhh  :s

ADD REPLYlink written 4.0 years ago by Bara'a220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1736 users visited in the last hour