I've been tasked with the following:
We want to verify the quality of (mostly denovo assembled) sequences from a lab without redoing the assembly ourselves.
These "query" sequences (say influenza) from the other lab are compared to known high-quality sequences (handpicked from genbank) to see if they fit within the (poissonic) 95% confidence interval of the latter's (HQ reference sequences) mutation accumulation over time. Because the mutation accumulation is a poisson process, we want to compute the interval that way. If the query sequences fit within the interval, they are probably good.
So we take the high quality sequences, find the one that is oldest, mark it as the "base reference." Then we compute the p-distance of all the other HQ references to the base reference as a function of time and plot it. Then, we plot a line of fit, and the 95% confidence interval of those points. Finally, we plot the query sequnces using the same method and see if the query sequences fit within the interval. Two problems:
Firstly, the poisson interval:
The closest I have come to calculating the poisson interval is the following equation (which, when plotted as a connected line, "looks right")
interval_low = y + y*cdf
interval_hi = y - y*cdf
for each y in the p-distances from the base reference. Is this at all correct?
Secondly, our query sequences sometimes have a date *later* than any of the references, making it impossible to see if they fit in the interval (which is only plotted and calculated up to the latest reference). How would we estimate whether or not the newer query sequence is acceptable?