We have N=a few million 18S reads from some environment. The reads have been clustered into OTUs, and the OTUs annotated against a reference database.
To generate a rarefaction curve, my understanding is that one randomly samples n reads where n ranges from 0 to N with some interval, and counts the number of OTUs observed at each such sub-sampling.
In standard practice--as implemented by suites such as qiime and mothur--which of the two ways I can think of to do this is employed?
1. Treat the original assignments of reads to OTUs as truth, and when resampling n reads, just count the number of "original" OTUs observed in this sub-sample.
2. Re-cluster the sub-sampled reads, and then count the number of "new" OTUs in the sub-sample.
My sense from reading through the qiime documentation is that #1 is what is done, but I'm not positive. I'm also not quite sure why #2 wouldn't be the way to go (though of course it would be computationally more expensive).