I agree with Istvan comments. The distribution of species (and their genome size) are mandatory to assess a good estimate of the depth of sequencing required to assemble everything. The fact is, to have that, one would have to take into account the hundreds to hundreds of millions of species that are in the sample and follow a kind of Fisher's butterfly approach [*] for each of them, that would lead to build a Poisson mixture model (one / species) with thousands of parameters very difficult to fit. To conclude, same as Istvan: more than difficult (especially a priori).

Nevertheless, rarefaction curves give an information you don't want to discard, BTW, "EMIRGE" can help improving the curve estimation process. As an alternative, what we did in my former lab, was to look at the number of "orphan reads" (= not assembled) in function of assembly effort.

Good luck,
if you find more accurate answers, please share! :-)

[*] Fisher, R., Corbet, A., & Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, 12(1), 42–58.

I know that it depends on these factors (among others) and assuming I can provide the ranges, I'm wondering how to estimate required depth in a

reproducibleway. One approach would be to use Lander/Waterman equation, but given its tendency to underestimate I thought people already invented something better by now (I've just found some extensions of this equation, but haven't yet looked into details).But isn't it true that the equation that you refer to in its simplest form does not really correspond to the actual observed coverages. A search turns up a modified version of the equation. I think in the end one needs to go in reverse. The question should not be how to fully capture the metagenome - I don't think that is possible to estimate. But if one say assumes a power law type of decay of abundances where the tail represents rare species what is the minimal abundance that we could detect at a given coverage.

Indeed, the LW equation isn't a good solution either, that's why I assumed something better is available (anything that from sampling data would create the answer: "X runs to capture 99%, under assumptions A,B and C"). Good point about asking the reverse question - is there any agreed-on approach to that?