4.8 years ago by

Czech Republic, Brno, CEITEC

Hi!

First I would like to note that it is not so easy to get a reliable estimate of total sample diversity for metagenomic studies. Estimators like Chao1 were initially designed for experiments with dozens niches and tens to hundreds of different species. In metagenomics we are dealing with millions of cells (sample size) and thousands of orthologous groups in case of microbiome samples and hundreds thousands of variants in case of immune repertoire profiling. So my own experience tells that one should never forget those diversity estimates are lower bound estimates. Sample more cells from the same individual and they could increase by an order of magnitude.

Next to your question. For the estimate of unseen species fraction you can try to use the Turing estimator `f1 / n`

, where `f1`

is the number of singletons and `n`

is the sample size, as suggested by Chao here. For TCR repertoire data this estimator works quite well. In T-cells all the diversity of one's repertoire is concentrated in naive T-cell pool, and as our recent work has shown, the fraction of singletons has a nice correlation with the percentage of naive T-cells in blood as measured by flow cytometry.

Still I doubt todays sequencing depth and accuracy is sufficient to correctly determine the minimal clone size using a metagenome sample.