Question: How To Estimate Sequencing Depth Needed To Fully Capture Metagenomic Data?
gravatar for Pawel Szczesny
6.5 years ago by
Pawel Szczesny3.2k
Pawel Szczesny3.2k wrote:

I want to accurately predict how many sequencing runs I need to capture fully microbial biodiversity (including genomes). I know some obvious basics, such as using rarefaction curves from marker genes (SSU for example), although I assume there are more sophisticated methods to estimate what I need. Any ideas? Google foo fails me on that.

ADD COMMENTlink modified 5.5 years ago by Biostar ♦♦ 20 • written 6.5 years ago by Pawel Szczesny3.2k
gravatar for Istvan Albert
6.4 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

There are several factors that one needs to deal with here:

  1. number of species
  2. size of the genome for each species
  3. relative abundance of each species

If you do not have a range for each of these values then I can't really see any apriori way to estimate the required depth. Perhaps consulting results of studies on similar samples would be a start. In the end this is the main purpose of rarefaction curves is to estimate species richness from results of incomplete sampling.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Istvan Albert ♦♦ 79k

I know that it depends on these factors (among others) and assuming I can provide the ranges, I'm wondering how to estimate required depth in a reproducible way. One approach would be to use Lander/Waterman equation, but given its tendency to underestimate I thought people already invented something better by now (I've just found some extensions of this equation, but haven't yet looked into details).

ADD REPLYlink written 6.4 years ago by Pawel Szczesny3.2k

But isn't it true that the equation that you refer to in its simplest form does not really correspond to the actual observed coverages. A search turns up a modified version of the equation. I think in the end one needs to go in reverse. The question should not be how to fully capture the metagenome - I don't think that is possible to estimate. But if one say assumes a power law type of decay of abundances where the tail represents rare species what is the minimal abundance that we could detect at a given coverage.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Istvan Albert ♦♦ 79k

Indeed, the LW equation isn't a good solution either, that's why I assumed something better is available (anything that from sampling data would create the answer: "X runs to capture 99%, under assumptions A,B and C"). Good point about asking the reverse question - is there any agreed-on approach to that?

ADD REPLYlink written 6.4 years ago by Pawel Szczesny3.2k
gravatar for Manu Prestat
6.4 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

I agree with Istvan comments. The distribution of species (and their genome size) are mandatory to assess a good estimate of the depth of sequencing required to assemble everything. The fact is, to have that, one would have to take into account the hundreds to hundreds of millions of species that are in the sample and follow a kind of Fisher's butterfly approach [*] for each of them, that would lead to build a Poisson mixture model (one / species) with thousands of parameters very difficult to fit. To conclude, same as Istvan: more than difficult (especially a priori).

Nevertheless, rarefaction curves give an information you don't want to discard, BTW, "EMIRGE" can help improving the curve estimation process. As an alternative, what we did in my former lab, was to look at the number of "orphan reads" (= not assembled) in function of assembly effort.

Good luck, if you find more accurate answers, please share! :-)

[*] Fisher, R., Corbet, A., & Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, 12(1), 42–58.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Manu Prestat3.9k

Thanks for the link to EMIRGE. I wanted to try this one as well ( ) but the source code does not seem to be available.

Looking at the number of orphan reads is also a neat solution.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Pawel Szczesny3.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1719 users visited in the last hour