I recently discovered the
staxids field with
staxid is not a thing with
diamond). I'm trying to assign taxonomy identifiers to all of my ORFs but I'm encountering many instances of when there are 2 or more (sometimes many more).
What is the recommended way for picking the "best" one? I don't want to randomly choose one, grab the first, etc. Is there a systematic way I can do this that is robust? Maybe the one that is the "most reliable"?
Here's the example output below. I can't use a regex search for  in the
stitle because not all of them have this suffix.
qseqid NODE_100002_length_1286_cov_2.42892_1132_1285_- sseqid WP_021626941.1 pident 94.4 length 18 mismatch 1 gapopen 0 qstart 1 qend 18 sstart 110 send 127 evalue 0.22 bitscore 44.3 staxids 1227265;1227266 sscinames Capnocytophaga sp. oral taxon 863;Capnocytophaga sp. oral taxon 863 str. F0517 stitle WP_021626941.1 hypothetical protein [Capnocytophaga sp. oral taxon 863] Name: 6422120, dtype: object