Question

Unranked nodes in NCBI's taxonomic tree

3

Entering edit mode

7.3 years ago

pignottisimone ▴ 30

Hello to everyone!

I am using NCBI's taxonomic information for the evaluation of metagenomics tool, but I noticed that many nodes are not ranked (their rank is 'no rank'). To be precise, this is the distribution of the leaves for the bacterial refseq database:

2570 leaves in total;
221 leaves with rank species;
3 leaves with rank subspecies;
1363 leaves with no rank but with an ancestor ranked as species or subspecies;
983 leaves have no species ancestor (their nearest ranked ancestor is a genus or family etc.).

Now, let's suppose I have a read simulated from the last group. When calculating the assignment precision at the species level (this could be extended to other ranks and to internal nodes too), how can I decide if the assignment is correct at the species level or not? How do mainstream tools deal with this kind of situation?

Let's take a read from the penultimate group instead. Am I right considering it correctly assigned at the species level, since it has an ancestor with rank species? I would consider it so because one invariant which is stated in NCBI's documentation is that the descendants of a node cannot have a higher rank than the node itself.

Thank you in advance for your help.

Simone

genome taxonomy rank metagenomics kraken • 2.5k views

ADD COMMENT • link 7.3 years ago by pignottisimone ▴ 30

2

Entering edit mode

I have written some tools to go through the tree and toss out junk nodes. What are they? Well...

Some nodes are labeled things like "environmental sample". Some are no rank. Some have, say, a species classification but the parent is "life" with no middle nodes.

These are all worse than useless, because when you BLAST things, they might be the best match, but give no useful information... so I use them to remove the corresponding sequences from nt. That way, BLASTing will hit the next-closest sequence instead, which might actually be informative.

Generally, I don't mind nodes with just one level missing (say, subspecies and genus but no species, or missing family only). They're annoying, but don't seem to cause major problems unless you happen to be studying things at that specific taxonomic level.

Additionally - NCBI gives many nodes strange, archaic, and often equivalent tax levels like "tribe" and "subfamily". I promote, demote, and remove those as necessary to only leave these canonical levels:

NO_RANK=0, SUBSPECIES=1, SPECIES=2, GENUS=3, FAMILY=4, ORDER=5, CLASS=6, PHYLUM=7, KINGDOM=8, DOMAIN=9, LIFE=10

Well, NO_RANK just gets discarded. But it's much easier to deal with taxonomy when it kind of fits a schema rather than needing to constantly deal with unnecessary oddball taxonomic levels that some crazy guy invented and somehow got published.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian. Do you happen to have a repository for the node pruning tool? It sounds useful.

ADD REPLY • link 7.3 years ago by Steven Lakin ★ 1.8k

1

Entering edit mode

Yep, it's part of the BBMap package. The usage is documented in /bbmap/docs/guides/TaxonomyGuide.txt. Unfortunately I don't have a mode currently for reading NCBI's tree and then dumping out a similar representation of the modified tree, but it just occurred to me when replying to this thread that it might be useful, so I'll add one.

Generally I do sequence filtering with the "filterbytaxa.sh" tool, which allows you to remove sequences not defined at a specific taxonomic level, or with taxonomic names that fit some pattern (like "environmental sample"). Currently it requires the sequences to be named in the traditional NCBI format with "gi|1234|etc" and does gi->taxid conversions, but it will soon support accession numbers additionally. Accession numbers are already supported in TaxServer taxserver.sh).

JGI runs TaxServer internally for its taxonomy lookups; it has a nifty interface which lets you submit an accession, gi number, or taxid, and it will return the full taxonomic information (the tax node and all ancestors up to domain, including name and tax id) in JSON format. You can do this in a browser or via curl or whatever. Very soon it will be externally-facing as well, which I think is really exciting, but I don't have a timeline for that as it depends on how long it takes NERSC to assign me a port. So far it's been a month... hopefully not too much longer? :) But, you can run it yourself if you want.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

score 4 · Answer 1 · 2017-01-13

In short, I think your method is fine. Do proceed with caution in terms of making comparisons though; the main problem with unranked nodes is that they aren't comparable across the taxonomic tree. This is an issue for all who are developing or testing microbiome classifiers, and there likely isn't a consensus on how to handle it. Here's how I've been handling it for our tool development and data analysis:

For evaluation of accuracy metrics:

Calculate sensitivity/specificity on only the true taxonomic ranks (KPCOFGS). To do this, aggregate counts from lower levels up through the annotation structure, such that a given phylum contains all counts from its children nodes, and so on. While the unranked nodes don't have a true "ranking" and can't be compared across branches of the annotation graph, they still do have a hierarchical relationship (they have parent and sometimes children nodes). Therefore, in order for the comparison across branches to be accurate, I stick to the true taxonomic levels for calculation of these metrics. Note that some of these ranks are also unfilled (they are not existent) for some "organisms" such as the viruses (example: Cytomegalovirus); for these I still build the standard phylogenetic structure but do not include these counts in a taxonomic level that isn't present (I call phylum as NA for Cytomegalovirus, for example, and ignore its count for phyla, but still track it at the other levels where it is classified).

For normalization of count data in analysis:

This one is slightly more tricky, because the assumption of these normalization techniques are typically based on probability distributions which have assumptions that we are recording actual counts based on sample draws. Any normalization performed should therefore be on the raw assignment matrix (the counts generated directly from the classifier, including unranked nodes). Aggregation can then be performed in order to sum the counts up through the annotation graph. Be careful when using some of these programs, such as metagenomeSeq, since it will calculate library size based on raw counts, but if you extract the normalized counts, you have to also save the raw library sizes; when you put it back into their experiment objects, it will recalculate library size on the normalized counts, which leads to inaccurate p-values. Thankfully there is usually a way to pass raw library sizes into their modeling equations for downstream analysis.

score 0 · Answer 2 · 2017-01-17

Thank you very much for your answers, I have found them very useful. I think that the best option for my purposes is building a custom tree which only includes nodes with known rank, then simulating reads over the species of this tree. In this way I don't have to ignore assignments to unranked nodes; I believe that such a behavior might skew the precision calculation, since my approach is to ignore assignments at higher levels, and unranked nodes are an unknown which could lead to mistakes. I am happy to hear a confirmation of the fact that I am not the only one who has to deal with this problem. I found weird not to find any explanation on the web. Thank you, again.