9.0 years ago by
Maastricht, The Netherlands
While the idea is interesting I think you have another problem here. If you really want to estimate allele frequencies down to 0.1% then a sample size of 2000 just isn't big enough, regardless of how good your technology is. That is like trying to estimate how often one side of a coin shows up by flipping twice. The chance of a given variation that really is present at .1% does not show up in a population of 2000 at all is larger than 13% (.999^2000) (That assumes that your technology gives perfect calls).
But OK... Let's assume you would combine the two technologies and try to think about what you would get. Basically the array results give rather good calls for up to 1 million known SNPs for which we have at least some idea about how often they occur in at least one population. That is after all why they are on the array. So you can assume that the arrays themselves will not find your low frequency variations. Simply because they were not designed to do that.
What I could imagine is that there might be low frequency variations very close to such known SNPs, so close that they are in fact covered by the array reporter sequence and would influence the detection. Such low frequency SNP might in fact lead to wrong calls from the array. You could map high throughput sequencing results to those array reporter sequences and evaluate that problem. That in itself might be quite interesting. In the end it might help you improve the calls for the arrays (but remember that these were quite good already).
I don't really see how it could work the other way around. Typically the sequencing technology seems to be able to pick up about 10 million real variations in an individual genome see [?]this[?]. So most of these will not be covered by the array. Now you could improve your reference genome a little by actually putting the array results in. That should improve the mapping of your NGS results to the (now individualized) reference genome a little. But of course that works best for the parts that you really modify (the ones covered by the array, the same ones as above). So I don't think that will really help substantially.
You might be temped to think that the array results might help you to do some kind of Hapmap analysis and thus make predictions about variations near the ones you measured. That might indeed work and is how arrays are often used. But... You will not be able to find Hapmap linked variations that are unlinked with a frequency lower than the frequency of the variation measured on the array, in fact the estimate is based on the linkage being real.
In other words I think the sequencing might help to improve the array results. But there is not much contribution the other way around.
Let me add a disclaimer ;-). I have never really done this. So I may have missed something obvious. But I found it an interesting thought experiment. Thanks for a nice question!
modified 9.0 years ago
9.0 years ago by
Chris Evelo ♦ 10.0k