Question

How To Estimate Sequencing Depth Needed To Fully Capture Metagenomic Data?

0

Entering edit mode

12.8 years ago

Pawel Szczesny 3.2k

I want to accurately predict how many sequencing runs I need to capture fully microbial biodiversity (including genomes). I know some obvious basics, such as using rarefaction curves from marker genes (SSU for example), although I assume there are more sophisticated methods to estimate what I need. Any ideas? Google foo fails me on that.

metagenomics illumina depth-of-coverage • 6.7k views

ADD COMMENT • link updated 11.9 years ago by Biostar 20 • written 12.8 years ago by Pawel Szczesny 3.2k

score 2 · Answer 1 · 2012-09-11

2

Entering edit mode

12.8 years ago

Istvan Albert 102k

There are several factors that one needs to deal with here:

number of species
size of the genome for each species
relative abundance of each species

If you do not have a range for each of these values then I can't really see any apriori way to estimate the required depth. Perhaps consulting results of studies on similar samples would be a start. In the end this is the main purpose of rarefaction curves is to estimate species richness from results of incomplete sampling.

ADD COMMENT • link 12.8 years ago by Istvan Albert 102k

0

Entering edit mode

I know that it depends on these factors (among others) and assuming I can provide the ranges, I'm wondering how to estimate required depth in a reproducible way. One approach would be to use Lander/Waterman equation, but given its tendency to underestimate I thought people already invented something better by now (I've just found some extensions of this equation, but haven't yet looked into details).

ADD REPLY • link 12.8 years ago by Pawel Szczesny 3.2k

1

Entering edit mode

But isn't it true that the equation that you refer to in its simplest form does not really correspond to the actual observed coverages. A search turns up a modified version of the equation. I think in the end one needs to go in reverse. The question should not be how to fully capture the metagenome - I don't think that is possible to estimate. But if one say assumes a power law type of decay of abundances where the tail represents rare species what is the minimal abundance that we could detect at a given coverage.

ADD REPLY • link 12.8 years ago by Istvan Albert 102k

0

Entering edit mode

Indeed, the LW equation isn't a good solution either, that's why I assumed something better is available (anything that from sampling data would create the answer: "X runs to capture 99%, under assumptions A,B and C"). Good point about asking the reverse question - is there any agreed-on approach to that?

ADD REPLY • link 12.8 years ago by Pawel Szczesny 3.2k

score 2 · Answer 2 · 2012-09-11

I agree with Istvan comments. The distribution of species (and their genome size) are mandatory to assess a good estimate of the depth of sequencing required to assemble everything. The fact is, to have that, one would have to take into account the hundreds to hundreds of millions of species that are in the sample and follow a kind of Fisher's butterfly approach [*] for each of them, that would lead to build a Poisson mixture model (one / species) with thousands of parameters very difficult to fit. To conclude, same as Istvan: more than difficult (especially a priori).

Nevertheless, rarefaction curves give an information you don't want to discard, BTW, "EMIRGE" can help improving the curve estimation process. As an alternative, what we did in my former lab, was to look at the number of "orphan reads" (= not assembled) in function of assembly effort.

Good luck, if you find more accurate answers, please share! :-)

[*] Fisher, R., Corbet, A., & Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, 12(1), 42–58.