I want to accurately predict how many sequencing runs I need to capture fully microbial biodiversity (including genomes). I know some obvious basics, such as using rarefaction curves from marker genes (SSU for example), although I assume there are more sophisticated methods to estimate what I need. Any ideas? Google foo fails me on that.
There are several factors that one needs to deal with here:
- number of species
- size of the genome for each species
- relative abundance of each species
If you do not have a range for each of these values then I can't really see any apriori way to estimate the required depth. Perhaps consulting results of studies on similar samples would be a start. In the end this is the main purpose of rarefaction curves is to estimate species richness from results of incomplete sampling.
I agree with Istvan comments. The distribution of species (and their genome size) are mandatory to assess a good estimate of the depth of sequencing required to assemble everything. The fact is, to have that, one would have to take into account the hundreds to hundreds of millions of species that are in the sample and follow a kind of Fisher's butterfly approach [*] for each of them, that would lead to build a Poisson mixture model (one / species) with thousands of parameters very difficult to fit. To conclude, same as Istvan: more than difficult (especially a priori).
Nevertheless, rarefaction curves give an information you don't want to discard, BTW, "EMIRGE" can help improving the curve estimation process. As an alternative, what we did in my former lab, was to look at the number of "orphan reads" (= not assembled) in function of assembly effort.
Good luck, if you find more accurate answers, please share! :-)
[*] Fisher, R., Corbet, A., & Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, 12(1), 42–58.