Question: Best Hits From Blast Tabular Output Of Multiple-Queries
gravatar for Rohit
7.9 years ago by
Rohit1.4k wrote:


I have been trying to get the best hits out of blast output files. I know that -v option or maxtargetseqs in standalone blast gives the best ones. But I already have a huge output of multiple queries which have to be sorted according to best cut-off values. I am not using XML parser as my output is already tabular.

Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escherichia coli DH1 DNA,... 100.00 280 0 0 1 280 3402569 3402290 4e-140 506
Locus3035v2rpkm3.64 gi|87248160|gb|DQ311188.1| Bombyx mori hydroxymethylglutaryl-Co... 98.94 94 1 0 1 94 523 616 4e-37 165
Locus303v1rpkm4.98 gi|346706507|dbj|AK380484.1| Bombyx mori mRNA clone: ftes50G24 100.00 159 0 0 1 159 4395 4553 1e-74 288

This is the tabular output I have. I have more than one criterion to take. I am looking for hits such that my the alignment length should be high but no compromise on the sequence identity. E-value should be less than 1e-06, less gaps and mismatches.

I tried to introduce a score such that - score = (alignmentlength) / ( 101- %identity) s = L / (101 - I%)

I used 101 as if 100 is taken there can be undefined values if completely identical. But I am missing the point for alignment length here. How can I select such that all these criterion are followed to give best hits in the following order of preference -

  1. Alignment length
  2. %identity
  3. less gaps and mismatches
  4. e-value 1e-06



ADD COMMENTlink modified 7.9 years ago by a.zielezinski9.3k • written 7.9 years ago by Rohit1.4k

Play around with awk, sort and sed. It can be easily achieved. Alternatively try biopython.

ADD REPLYlink written 7.9 years ago by Pappu1.9k

I don't get it. Why not to take first hit from each BLAST output? Hits are sorted by decreasing score which depends on the alignment length: the higher score, the longer the alignment.

ADD REPLYlink written 7.9 years ago by a.zielezinski9.3k

Yes but the top scoring alignment may not align the whole query sequence. In that case, one has to select several alignments (Threading). It all depends on the research question to address.

ADD REPLYlink written 7.9 years ago by Pappu1.9k
gravatar for a.zielezinski
7.9 years ago by
a.zielezinski9.3k wrote:

In Salse et al. 2009 they introduced two parameters to allow the identification of the best BLAST alignment - the highest cumulative percentage of identity in the longest cumulative length.

The first parameter, cumulative identity percentage (CIP) corresponds to the cumulative percentage of sequence identity obtained for all the HSPs (CIP = [∑ ID by HSP/AL] × 100) where AL is the sum of all hsp lengths.

The second parameter, CALP, is the cumulative alignment length percentage which represents the sum of the HSP lengths (AL) for all the HSPs divided by the length of the query sequence (CALP = AL/query).

ADD COMMENTlink written 7.9 years ago by a.zielezinski9.3k

This scoring I ought to try. Thank you for the advice.

Think taking the HSP and the alignments makes a huge difference when trying to work on my data.

Regards, Rohit

ADD REPLYlink written 7.9 years ago by Rohit1.4k
gravatar for Pavel Senin
7.9 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

Maybe there is an easier way, but what I do in similar situations - is to dump my BLAST output into MySQL at first using BioPerl. Secondly, I use Java/MyBATIS to iterate over query sequences spawning custom best hit search threads over the DB. I have some code around, but you will need to adjust it for your own needs.

Overall, I found that database provide a lot of advantage when used this way - imposing some structure over flat BLAST output - it enables you to slice and dice those huge BLAST outputs any way you like, so it might be worth investing into some code and SQL stuff.

ADD COMMENTlink written 7.9 years ago by Pavel Senin1.9k

Thank you.

I will try SQL and PHP when I make a db for my blast hits. That idea was unique btw.

ADD REPLYlink written 7.9 years ago by Rohit1.4k

Thank you. Well, in DB world, they successfully solved the problem of structuring and interactively querying large datasets with some "crazy" questions. SQL can do magic in real time - taking away most of that bioinformatics shell scripting. But yes, one need resources to run DB engine and to get some knowledge/experience before getting comfortable.

ADD REPLYlink written 7.9 years ago by Pavel Senin1.9k
gravatar for Manu Prestat
7.9 years ago by
Manu Prestat4.0k
Lyon, France
Manu Prestat4.0k wrote:

The bit-score takes into account the other criteria you want to consider in that task so I'm wondering why you want to make it more difficult. I could understand you need a min alignment length for post-analysis purpose though, so why don't you filter with awk your results according to a min length and after you keep only the best hits (using the score)?

Awk syntax to keep hits with at least 200 aligned residues:

awk '$4>=200' file.blast

For the latter, I made a little script:

I made a more elaborate one intended to do the same task for hmmer table outputs, with which you can choose a min score cutoff. As you can choose your score column in the options, it should work on a blast output with -c 12 .

In case you want to use your home-made score (which seems weird to me as, again, it's a combination of 2 non independent criteria), you can compute it for every hit (awk is once again your best friend) and insert it as a new column, and use -c new_column_number to keep the best hit based on that criterium.

As mentioned in the source, both of my scripts assume the queries are grouped together (which I think is a "classical" blast or hmmscan output, but not hmmsearch for instance). If you prefer using e-values instead of the score, you would have to modify my script because one increases as the other decreases. However, if you find out what score corresponds to your e-value in your specific search, you can you that score and then obtain what you want.

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Manu Prestat4.0k

I did not try that home-made score, but I guess using the alignment and HSP will help a lot.

But your work is bound to help me in the latter stages too. Thank you for the code. I have a small-script which will help in extracting according to the various cut-offs we need. I have no idea where to post it though.

Regards, Rohit

ADD REPLYlink written 7.9 years ago by Rohit1.4k

I am a programmer analphabet. I can not figure out from the script that if it is able to sort out from a hmmscan output table file only the top HMMer model hit. I want to sort my proteins in to clusters this is the reason why I want to get rid of the query proteins lower hits.  

ADD REPLYlink written 6.1 years ago by h.botond40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2163 users visited in the last hour