Hi,
I'm using biopython for the first time to analyze blastp results.
I understand how to extract the E-value, but not sure as to how to calculate the "query cover" and "ident"
fields that appear in the website gui. I would appreciate any help!
Thanks,
Efrat
Last time I checked 2 years ago, BioPython did not have a good built in function to calculate the query coverage from blast output. Bummer! (This is seemingly because BioPython doesn't have a function for tiling HSPs) BioPerl has (and maybe BioPython developers have added this in the last 2 years, or I simply couldn't find it ;). The crux is that the HSP need to be tiled correctly. If you see any solution based on a trivial calculation (query_end - query_start) / query_length, or even including gap-length:These solutions are incorrect because they are not tiling the HSPs correctly. (E.g. this one: https://codereview.stackexchange.com/questions/39879/calculate-query-coverage-from-blast-output)
Instead, the query coverage per subject or per hsp can be output using tabular format either with option qcovhsp or qcovs, e.g.:
-outfmt 6 std qcovs
Will output the coverage per subject as the last column.
Thank you Michael! I don't mind doing the calculation myself and not using a built in function, but I'm not sure I understand how these values are calculated - are you familiar with a reference that might help me?
There is no correct way of calculating the query coverage without tiling the HSPs first. The safest way would be to either use the BioPerl builtin function or the Blast tabular output.
I am linking the code that does this in BioPerl, note that it depends on the other methods
blast_record = NCBIXML.parse(open('myfile.xml', 'r'))
for query in blast_record:
for alignment in query.alignments:
for hsp in alignment.hsps:
print('coverage', hsp.align_length / query.query_length)
print('identitiy', hsp.identities/ hsp.align_length)
This is what I used for a project I am working on.
Coverage is NOT hsp.align_length in the numerator. There could be gaps in the query that consume the aligned sequence but not the query (think cigar in SAM). Instead coverage of query is (I am using the XML from Blast but should have similar names in BioPython) calculated by floor((Hsp_query-to - Hsp_query-from + 1)/BlastOutput_query-len). You need to calculate the length of the query that is aligned, not the length of the aligned sequence that is aligned for the numerator. Test HIT #5 for blastp for "P13773". The coverage should be 85% (337 residues consumed of the query/392 query length). The hsp.align_length is 349 from the aligned sequence (so there are 12 gaps in the coverage of query).
If I understand you correctly, you're attempting to read output from BLAST.
If you run BLAST locally, with output format 6 you'd be able to parse the e-value, ident field pretty smoothly.
I'm not running the BLAST locally, I'm using the remote option - is there any reason not to run it locally if I have the full nt database on my server?
Which output format are you using when you run blast? I believe that percent identity is in the list of default output fields. You can specify that blast also store the query coverage in the output.
For example -outfmt "6 std qcovs" would output the default blast fields (qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore) as well as query coverage, in that order. See the output of blastn/p/x -help for more information.
As for calculating it in biopython, it looks like qcovs is roughly equal to (hit_length - gap_open) / query_length. You'd need the lengths of your query sequences.
Last time I checked 2 years ago, BioPython did not have a good built in function to calculate the query coverage from blast output. Bummer! (This is seemingly because BioPython doesn't have a function for tiling HSPs) BioPerl has (and maybe BioPython developers have added this in the last 2 years, or I simply couldn't find it ;). The crux is that the HSP need to be tiled correctly. If you see any solution based on a trivial calculation
(query_end - query_start) / query_length
, or even including gap-length:These solutions are incorrect because they are not tiling the HSPs correctly. (E.g. this one: https://codereview.stackexchange.com/questions/39879/calculate-query-coverage-from-blast-output)Instead, the query coverage per subject or per hsp can be output using tabular format either with option qcovhsp or qcovs, e.g.:
Will output the coverage per subject as the last column.
see https://www.ncbi.nlm.nih.gov/books/NBK279675/
Thank you Michael! I don't mind doing the calculation myself and not using a built in function, but I'm not sure I understand how these values are calculated - are you familiar with a reference that might help me?
There is no correct way of calculating the query coverage without tiling the HSPs first. The safest way would be to either use the BioPerl builtin function or the Blast tabular output.
I am linking the code that does this in BioPerl, note that it depends on the other methods
http://doc.bioperl.org/bioperl-live/Bio/Search/Hit/GenericHit.html#CODE31
Thank you! That's very helpful. What about the identity? is there a way to calculate that?
(query_end - query_start + 1)