Question

Best Hits From Blast Tabular Output Of Multiple-Queries

4

Entering edit mode

11.3 years ago

Rohit ★ 1.5k

Hello.

I have been trying to get the best hits out of blast output files. I know that -v option or maxtargetseqs in standalone blast gives the best ones. But I already have a huge output of multiple queries which have to be sorted according to best cut-off values. I am not using XML parser as my output is already tabular.

Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escherichia coli DH1 DNA,... 100.00 280 0 0 1 280 3402569 3402290 4e-140 506
Locus3035v2rpkm3.64 gi|87248160|gb|DQ311188.1| Bombyx mori hydroxymethylglutaryl-Co... 98.94 94 1 0 1 94 523 616 4e-37 165
Locus303v1rpkm4.98 gi|346706507|dbj|AK380484.1| Bombyx mori mRNA clone: ftes50G24 100.00 159 0 0 1 159 4395 4553 1e-74 288

This is the tabular output I have. I have more than one criterion to take. I am looking for hits such that my the alignment length should be high but no compromise on the sequence identity. E-value should be less than 1e-06, less gaps and mismatches.

I tried to introduce a score such that - score = (alignmentlength) / ( 101- %identity) s = L / (101 - I%)

I used 101 as if 100 is taken there can be undefined values if completely identical. But I am missing the point for alignment length here. How can I select such that all these criterion are followed to give best hits in the following order of preference -

Alignment length
%identity
less gaps and mismatches
e-value 1e-06

Regards,
Rohit

statistics blast • 12k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 11.3 years ago by Rohit ★ 1.5k

0

Entering edit mode

Play around with awk, sort and sed. It can be easily achieved. Alternatively try biopython.

ADD REPLY • link 11.3 years ago by Pappu ★ 2.1k

0

Entering edit mode

I don't get it. Why not to take first hit from each BLAST output? Hits are sorted by decreasing score which depends on the alignment length: the higher score, the longer the alignment.

ADD REPLY • link 11.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Yes but the top scoring alignment may not align the whole query sequence. In that case, one has to select several alignments (Threading). It all depends on the research question to address.

ADD REPLY • link 11.3 years ago by Pappu ★ 2.1k

score 7 · Answer 1 · 2013-01-02

7

Entering edit mode

11.3 years ago

Andrzej Zielezinski 11k

In Salse et al. 2009 they introduced two parameters to allow the identification of the best BLAST alignment - the highest cumulative percentage of identity in the longest cumulative length.

The first parameter, cumulative identity percentage (CIP) corresponds to the cumulative percentage of sequence identity obtained for all the HSPs (CIP = [∑ ID by HSP/AL] × 100) where AL is the sum of all hsp lengths.

The second parameter, CALP, is the cumulative alignment length percentage which represents the sum of the HSP lengths (AL) for all the HSPs divided by the length of the query sequence (CALP = AL/query).

ADD COMMENT • link 11.3 years ago by Andrzej Zielezinski 11k

1

Entering edit mode

This scoring I ought to try. Thank you for the advice.

Think taking the HSP and the alignments makes a huge difference when trying to work on my data.

Regards, Rohit

ADD REPLY • link 11.3 years ago by Rohit ★ 1.5k

score 1 · Answer 2 · 2013-01-02

1

Entering edit mode

11.3 years ago

Pavel Senin ★ 1.9k

Maybe there is an easier way, but what I do in similar situations - is to dump my BLAST output into MySQL at first using BioPerl. Secondly, I use Java/MyBATIS to iterate over query sequences spawning custom best hit search threads over the DB. I have some code around, but you will need to adjust it for your own needs.

Overall, I found that database provide a lot of advantage when used this way - imposing some structure over flat BLAST output - it enables you to slice and dice those huge BLAST outputs any way you like, so it might be worth investing into some code and SQL stuff.

ADD COMMENT • link 11.3 years ago by Pavel Senin ★ 1.9k

1

Entering edit mode

Thank you.

I will try SQL and PHP when I make a db for my blast hits. That idea was unique btw.

ADD REPLY • link 11.3 years ago by Rohit ★ 1.5k

1

Entering edit mode

Thank you. Well, in DB world, they successfully solved the problem of structuring and interactively querying large datasets with some "crazy" questions. SQL can do magic in real time - taking away most of that bioinformatics shell scripting. But yes, one need resources to run DB engine and to get some knowledge/experience before getting comfortable.

ADD REPLY • link 11.3 years ago by Pavel Senin ★ 1.9k

Ram · Answer 3 · 2013-01-02

The bit-score takes into account the other criteria you want to consider in that task so I'm wondering why you want to make it more difficult. I could understand you need a min alignment length for post-analysis purpose though, so why don't you filter with awk your results according to a min length and after you keep only the best hits (using the score)?

Awk syntax to keep hits with at least 200 aligned residues:

awk '$4>=200' file.blast

For the latter, I made a little script:

http://code.google.com/p/bioman/source/browse/BlastBestHit.py

I made a more elaborate one intended to do the same task for hmmer table outputs, with which you can choose a min score cutoff. As you can choose your score column in the options, it should work on a blast output with -c 12 .

http://code.google.com/p/bioman/source/browse/HMMerBestHit.py

In case you want to use your home-made score (which seems weird to me as, again, it's a combination of 2 non independent criteria), you can compute it for every hit (awk is once again your best friend) and insert it as a new column, and use -c new_column_number to keep the best hit based on that criterium.

As mentioned in the source, both of my scripts assume the queries are grouped together (which I think is a "classical" blast or hmmscan output, but not hmmsearch for instance). If you prefer using e-values instead of the score, you would have to modify my script because one increases as the other decreases. However, if you find out what score corresponds to your e-value in your specific search, you can you that score and then obtain what you want.