I have been trying to get the best hits out of blast output files. I know that -v option or maxtargetseqs in standalone blast gives the best ones. But I already have a huge output of multiple queries which have to be sorted according to best cut-off values. I am not using XML parser as my output is already tabular.
Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escherichia coli DH1 DNA,... 100.00 280 0 0 1 280 3402569 3402290 4e-140 506 Locus3035v2rpkm3.64 gi|87248160|gb|DQ311188.1| Bombyx mori hydroxymethylglutaryl-Co... 98.94 94 1 0 1 94 523 616 4e-37 165 Locus303v1rpkm4.98 gi|346706507|dbj|AK380484.1| Bombyx mori mRNA clone: ftes50G24 100.00 159 0 0 1 159 4395 4553 1e-74 288
This is the tabular output I have. I have more than one criterion to take. I am looking for hits such that my the alignment length should be high but no compromise on the sequence identity. E-value should be less than 1e-06, less gaps and mismatches.
I tried to introduce a score such that - score = (alignmentlength) / ( 101- %identity) s = L / (101 - I%)
I used 101 as if 100 is taken there can be undefined values if completely identical. But I am missing the point for alignment length here. How can I select such that all these criterion are followed to give best hits in the following order of preference -
- Alignment length
- less gaps and mismatches
- e-value 1e-06