Dear community, I wish you all have a fruitful day.
I am trying to answer this question: I have a sequence, and try to figure out what organism this sequence belong to. So I blast it, and obtain a list of Blast hits. The top 10 hits are from species a, with an average bit score = 500, followed up 25 hits from species b, c, d, which belong to the same genus A, with an average bit score = 400, then 40 hits from genera B, C, D... So on so forth... In the end of the 500-hit table there are some weirdos with bit scores between 50 and 100. This is typically what everyone does everyday.
Can we name one certain taxonomic group, that best answers the question "what organism this sequence belongs to?"
Species a is the highest-score match, genus A is broader but safer, family AA and order AAA are even broader. Which one should I choose? Maybe we can do a lowest shared common ancestor of top n hits or all hits with bitscore ≥ x; But what at will be n and x? Also what to do if there are outliers X, Y, and Z?
I guess there are many things to consider when one assigns taxonomy based on multiple matches: taxonomy tree, number of hits per group, average bit score, chance of contamination, etc.
I wonder if there is already an algorithm out there, that perfectly solves this question?