However, I'm wondering what to do when you have multiple HSP's in one hit - would you rank by the total summed HSP bit score for a hit, or by the highest scoring HSP within a hit? The first seems like it could boost lower bit scores if they're present at a high frequency, whereas the second seems like it might need additional filtering for length of the hit/alignment in case you get a really high hit that has tiny coverage. Any best practices here/am i misunderstanding something?
Good question, not an easy straightforward answer I'm afraid.
There are a number of things to take into account here:
If you add up all the bit score of the HSPs you will often "overcount", HSPs in protein can often overlap each other and as such you will double count those regions in the final bitscore.
Taking the 'best HSP' is not a bad approach, given that you work with protein sequences you will have less occasions of split alignments (with nucleotides you have that more) , the best scoring HSPs will thus in most cases a continuous stretch of alignment.
If you want super accurate results and have time to do some scripting to get it , the best way is the adding up bitscore approach. Here you need to take into account that you can only add up non-overlapping regions (you need to sort of re-create the full alignment using the given HSPs). If on the other hand you want reliable result but don't want to spend much time on it go for the best HSP approach. This will be in the vast majority of cases an excellent approximation (certainly for protein sequences) and can get parsed directly from the original blast output efficiently.