Dear community, I wish you all have a fruitful day.

I am trying to answer this question: I have a sequence, and try to figure out what organism this sequence belong to. So I blast it, and obtain a list of Blast hits. The top 10 hits are from species a, with an average bit score = 500, followed up 25 hits from species b, c, d, which belong to the same genus A, with an average bit score = 400, then 40 hits from genera B, C, D... So on so forth... In the end of the 500-hit table there are some weirdos with bit scores between 50 and 100. This is typically what everyone does everyday.

Can we name one certain taxonomic group, that best answers the question "what organism this sequence belongs to?"

Species a is the highest-score match, genus A is broader but safer, family AA and order AAA are even broader. Which one should I choose? Maybe we can do a lowest shared common ancestor of top n hits or all hits with bitscore ≥ x; But what at will be n and x? Also what to do if there are outliers X, Y, and Z?

I guess there are many things to consider when one assigns taxonomy based on multiple matches: taxonomy tree, number of hits per group, average bit score, chance of contamination, etc.

I wonder if there is already an algorithm out there, that perfectly solves this question?

If top 10 hits are from the same species (and if the hits show good identity/conservation across the full length of the sequence) then that should be a reasonably good evidence that the sequence is indeed homologous and could well belong to that species.

ADD REPLYlink written 2.8 years ago by genomax65k

Thanks for your comments. I understand and agree with your reasons. It's hard to statistically / automatically solve it though. For example, what if the top 10 hits are composed of 9 Escherichia coli and 1 Salmonella, with almost equally high bit score? Should I call it E. coli or Enterobacteriaceae?

ADD REPLYlink written 2.8 years ago by qiyunzhu130

I assume you have many such samples so you would need to make a decision upfront about how many (50% or more from X) need to be from an organism before you call it that or if you should take this up one level to the family (and look for 80% or more hits from the same family).

There isn't an easy answer and this would be a moving target since each time you do the search new sequences are going to be added.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax65k

I see. Maybe I should design an algorithm myself...

I see. Maybe I should design an algorithm myself...
