I'm doing amplicon sequencing of a virus across many different regions. Lets say I have 20k unique species that I put into my pcr assay and after sequencing and amplifications I am left with, 19k species that appear. But many of the species appear at really low read count and I don't know if that read is real or just noise. I have a threshold T that I say if a variant appears above T I'm counting it as real. Out of the 20k species about only 15% of species appear to be "real". I have some data on different input species counts (2k, 10k, 20k, 40k), and their corresponding fraction of "real".
My question is how do I determine which input species count is best? I only looked at 4 different values and maybe the correct value isn't sampled, I obviously can't try everything between 2k and 40k. I think I want a balance between largest fraction of "real" species and total number of species.
Is there any research (or better yet code) that answers/discusses this problem?
Thanks in advanced.