How To Estimate Sequencing Depth Needed To Fully Capture Metagenomic Data?
2
0
Entering edit mode
11.9 years ago

I want to accurately predict how many sequencing runs I need to capture fully microbial biodiversity (including genomes). I know some obvious basics, such as using rarefaction curves from marker genes (SSU for example), although I assume there are more sophisticated methods to estimate what I need. Any ideas? Google foo fails me on that.

metagenomics illumina depth-of-coverage • 6.0k views
2
Entering edit mode
11.9 years ago

There are several factors that one needs to deal with here:

1. number of species
2. size of the genome for each species
3. relative abundance of each species

If you do not have a range for each of these values then I can't really see any apriori way to estimate the required depth. Perhaps consulting results of studies on similar samples would be a start. In the end this is the main purpose of rarefaction curves is to estimate species richness from results of incomplete sampling.

0
Entering edit mode

I know that it depends on these factors (among others) and assuming I can provide the ranges, I'm wondering how to estimate required depth in a reproducible way. One approach would be to use Lander/Waterman equation, but given its tendency to underestimate I thought people already invented something better by now (I've just found some extensions of this equation, but haven't yet looked into details).

1
Entering edit mode

But isn't it true that the equation that you refer to in its simplest form does not really correspond to the actual observed coverages. A search turns up a modified version of the equation. I think in the end one needs to go in reverse. The question should not be how to fully capture the metagenome - I don't think that is possible to estimate. But if one say assumes a power law type of decay of abundances where the tail represents rare species what is the minimal abundance that we could detect at a given coverage.

0
Entering edit mode

Indeed, the LW equation isn't a good solution either, that's why I assumed something better is available (anything that from sampling data would create the answer: "X runs to capture 99%, under assumptions A,B and C"). Good point about asking the reverse question - is there any agreed-on approach to that?

2
Entering edit mode
11.9 years ago

I agree with Istvan comments. The distribution of species (and their genome size) are mandatory to assess a good estimate of the depth of sequencing required to assemble everything. The fact is, to have that, one would have to take into account the hundreds to hundreds of millions of species that are in the sample and follow a kind of Fisher's butterfly approach [*] for each of them, that would lead to build a Poisson mixture model (one / species) with thousands of parameters very difficult to fit. To conclude, same as Istvan: more than difficult (especially a priori).

Nevertheless, rarefaction curves give an information you don't want to discard, BTW, "EMIRGE" can help improving the curve estimation process. As an alternative, what we did in my former lab, was to look at the number of "orphan reads" (= not assembled) in function of assembly effort.

Good luck, if you find more accurate answers, please share! :-)

[*] Fisher, R., Corbet, A., & Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, 12(1), 42–58.

0
Entering edit mode

Thanks for the link to EMIRGE. I wanted to try this one as well (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0039948 ) but the source code does not seem to be available.

Looking at the number of orphan reads is also a neat solution.