However, I'm wondering what to do when you have multiple HSP's in one hit - would you rank by the total summed HSP bit score for a hit, or by the highest scoring HSP within a hit? The first seems like it could boost lower bit scores if they're present at a high frequency, whereas the second seems like it might need additional filtering for length of the hit/alignment in case you get a really high hit that has tiny coverage. Any best practices here/am i misunderstanding something?
Good question, not an easy straightforward answer I'm afraid.
There are a number of things to take into account here:
If you add up all the bit score of the HSPs you will often "overcount", HSPs in protein can often overlap each other and as such you will double count those regions in the final bitscore.
Taking the 'best HSP' is not a bad approach, given that you work with protein sequences you will have less occasions of split alignments (with nucleotides you have that more) , the best scoring HSPs will thus in most cases a continuous stretch of alignment.
If you want super accurate results and have time to do some scripting to get it , the best way is the adding up bitscore approach. Here you need to take into account that you can only add up non-overlapping regions (you need to sort of re-create the full alignment using the given HSPs). If on the other hand you want reliable result but don't want to spend much time on it go for the best HSP approach. This will be in the vast majority of cases an excellent approximation (certainly for protein sequences) and can get parsed directly from the original blast output efficiently.
Perfect, thanks for your answer!
Just to clarify, would you check for overlapping of the hsp from the perspective of both the query and database sequence? So if there are two database subsequences that overlap and matched 2 separated query subsequences, you would count this as an overlap. And if there are two query subsequences that overlap and have matched 2 separated database subsequences, you would count this as an overlap also.
If I understand correctly, no . You should only consider those on a per query/hit pair .
And I start to get the feeling that I might have interpreted your initial question not fully correctly.
In my answer I only consider (and you should too) multiple HSPs hits within a single query/hit pair. If you have a query that hits several DB entries, then you only consider the best scoring pair and neglect all others.
Thanks for the clarification, I think I was a bit unclear - I am talking about HSPs in a single query/hit pair.
So for example, if you have a single query sequence and a single hit sequence (
H, say 100 aa's long), and there is a HSP for
H[30:40]. Then a second HSP for
H[41:51]- this is overlapping on the query but not the hit. Then a third HSP for
H[45:55]- this is overlapping on the hit but not the query. Would you consider HSPs two and three to both be overlapping?
Please let me know if this doesn't make sense, I may have gone too far down the rabbit hole :P
I don't know how good this method is but if you are specifically looking for homology then this may be of interest.
What logic are you using to decide on homology in your current implementation using just HSP?
I had a look at Rubble - it's a nice idea, thanks for sharing! I think it might be a bad idea for me as I'd want to compare all possible orthologs and it seems like the sequence clustering they do might filter out weak homologous (and therefore orthologous) relationships.
In terms of the logic, the original plan was to get the best or total non-overlapping bit-score for a given hit, then compare that against all the other hits' best (or total) bit score, then assign a homolog relationship to the best hit (i.e. say it is homologous to the query). I would then reverse this and make the database homologs my query, and my entire query a database, then repeat the exercise. Any pairs that both have the other as the strongest homolog are reciprocal best hits / orthologous to some degree, rest will just be labelled as homologs. Is that what you meant by logic?
The reason I'm starting to lean towards counting them up is because I've got one hit with 46 hsp's, this might all be overlapping though. Also, I fell like counting all HSPs might be good for longer proteins as it might be more representative of the whole sequence and you might be more likely to get a random hsp in a longer protein...
If you are after homologs then you also need to consider the overall length/coverage of query/hits in the comparison. You may have great HSP hits for shared domains and such but the rest of the protein may be something else. I guess this may get flagged in your reverse search but then you would have done a lot of work by that point.
Just thinking aloud on a Friday. Not sure what species you are working with incorporating
OrthoFinder(LINK) may be useful.
ha, you clearly are pointing to the difficulties when doing the add up bitscore approach ;)
not sure I (or anyone) can answer this.
You might start to see between the lines I would go for the other approach, take the best HSP, and call it a day
as GenoMax also points out, you can also go for a tool that does this. another one I would suggest is InParanoid (works basically on a RBH blast analysis)
Yea I probably will end up doing that X) InParanoid looks good though - might be overkill for me but it's nifty! Thanks for sharing, I'll have a proper read tomorrow :D