Question: Relative abundance (evenness) estimation/prediction in metagenomic datasets
gravatar for Tim
5.5 years ago by
United Kingdom
Tim110 wrote:

It is a simple task to predict the ultimate richness (diversity) of a metagenomic sample (for example, by using Chao1 estimator to get the total number of different species/OTUs that could be present in the sample). It is also very easy to calculate relative abundances of species identified in metagenomic dataset - just by dividing the total number of sequences by the number of sequences corresponding to specific species.

I was wondering if there is a way to predict relative abundances of unidentified species (e.g., predict the relative abundance of the least abundant species of all species present in the sample according to Chao1 estimator).

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Tim110
gravatar for mikhail.shugay
5.5 years ago by
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:


First I would like to note that it is not so easy to get a reliable estimate of total sample diversity for metagenomic studies. Estimators like Chao1 were initially designed for experiments with dozens niches and tens to hundreds of different species. In metagenomics we are dealing with millions of cells (sample size) and thousands of orthologous groups in case of microbiome samples and hundreds thousands of variants in case of immune repertoire profiling. So my own experience tells that one should never forget those diversity estimates are lower bound estimates. Sample more cells from the same individual and they could increase by an order of magnitude.

Next to your question. For the estimate of unseen species fraction you can try to use the Turing estimator f1 / n, where f1 is the number of singletons and n is the sample size, as suggested by Chao here. For TCR repertoire data this estimator works quite well. In T-cells all the diversity of one's repertoire is concentrated in naive T-cell pool, and as our recent work has shown, the fraction of singletons has a nice correlation with the percentage of naive T-cells in blood as measured by flow cytometry.

Still I doubt todays sequencing depth and accuracy is sufficient to correctly determine the minimal clone size using a metagenome sample.

ADD COMMENTlink written 5.5 years ago by mikhail.shugay3.4k

Hi Mikhail, thank you very much for your helpful suggestion and the papers.

I agree that Chao1 and similar estimators (ACE) tend to underestimate true species richness, especially when sequencing depth is low; nevertheless, these estimators are widely accepted and used in metagenomic studies (Huber 2007). 

The Good-Turing estimator is almost what I wanted, but it gives the total relative abundance of unobserved species in the community. Is it possible to estimate individual relative abundances of unobserved species? In other words, let's assume that after sequencing of 1000 sequences of a metagenomic sample 6 different OTUs were found - and Chao1 suggested that there would be 10 species/OTUs in the sample in total, so 4 of 10 species remained unobserved. Is there a way to estimate individual relative abundances of unobserved species (relative abundances of 7th, 8th, 9th and 10th species)? 

At the moment I have some preliminary data about my sample obtained using 16S pyrosequencing; the diversity is quite low (I found 180 different OTUs and according to Chao1 there should be about 200 OTUs in total) and these results are in good agreement with literature values. The next step will be shotgun Illumina sequencing of my total metagenomic DNA; I don't want to spend more money on sequencing than really needed, so I am trying to assess the sequencing depth required to get at least 9x coverage of genome of the least abundant species in my sample using Lander-Waterman calculations:



  • C stands for coverage
  • G is the haploid genome length
  • L is the read length
  • N is the number of reads
  • a is the relative abundance of the least abundant species
ADD REPLYlink written 5.5 years ago by Tim110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1705 users visited in the last hour