How to get "best" taxon identifier from diamond output with staxids?
2.4 years ago
O.rka

I recently discovered the staxids field with diamond (staxid is not a thing with diamond). I'm trying to assign taxonomy identifiers to all of my ORFs but I'm encountering many instances of when there are 2 or more (sometimes many more).

What is the recommended way for picking the "best" one? I don't want to randomly choose one, grab the first, etc. Is there a systematic way I can do this that is robust? Maybe the one that is the "most reliable"?

Here's the example output below. I can't use a regex search for [] in the stitle because not all of them have this suffix.

qseqid                                      NODE_100002_length_1286_cov_2.42892_1132_1285_-
sseqid                                                                       WP_021626941.1
pident                                                                                 94.4
length                                                                                   18
mismatch                                                                                  1
gapopen                                                                                   0
qstart                                                                                    1
qend                                                                                     18
sstart                                                                                  110
send                                                                                    127
evalue                                                                                 0.22
bitscore                                                                               44.3
staxids                                                                     1227265;1227266
sscinames    Capnocytophaga sp. oral taxon 863;Capnocytophaga sp. oral taxon 863 str. F0517
stitle              WP_021626941.1 hypothetical protein [Capnocytophaga sp. oral taxon 863]
Name: 6422120, dtype: object
What do you mean by best one? Both are pointing to the same genus. If you using a 18 AA long hit it is likely not enough to give you an absolute confidence.

That's a good point! I hadn't realized this is one of the shorter ORF calls. So in this case, would it have mapped equally well to 1227265 and 1227266?


