Question

Low-Coverage-Sequencing Combined With Array-Based Genotyping To Identify Allele-Frequencies In Our Population.

5

Entering edit mode

14.2 years ago

Thomas ▴ 770

Hi all.

I am working in a small group on genetic disease prediction. And we are currently discussing a set up of whole genome sequencing of our study population of appr. 2000 well characterized individuals.

Our goal is to identify variants down to 0.1%.

An easy answer to our goal could be to do whole genome sequencing with very high depths in all IDs. But I am more interested in the idea of applying low depth (appr 4X) whole genome sequencing combined with an array based genotyping assay.

Applying low depths sequencing, results in high uncertainty in the called genotypes and it will be almost impossible to estimate allele frequencies down to 0.1%. However, if we also apply an array based genotyping assay, we can impute the genotype-information to decrease the uncertainty in the sequenced-called genotypes and then be more confident in the low frequency estimates. (please correct me if I am wrong!!)

Does anyone have hands on experience in combining low depth sequencing data and array based genotyping in terms of achieving rare allele-frequencies (preferable down to 0.1%)? Are there simulation studies including sequencing depths, genotyping arrays and the power to estimate allele frequencies in the genome.

Any comments/suggestions are very welcome.

I do apologies if this Q have already been discussed elsewhere at biostar!!

all my best

Thomas

next-gen sequencing array imputation statistics snp • 5.6k views

ADD COMMENT • link updated 14.2 years ago by lh3 33k • written 14.2 years ago by Thomas ▴ 770

score 5 · Answer 1 · 2011-04-16

While the idea is interesting I think you have another problem here. If you really want to estimate allele frequencies down to 0.1% then a sample size of 2000 just isn't big enough, regardless of how good your technology is. That is like trying to estimate how often one side of a coin shows up by flipping twice. The chance of a given variation that really is present at .1% does not show up in a population of 2000 at all is larger than 13% (.999^2000) (That assumes that your technology gives perfect calls).

Edit: But OK... Let's assume you would combine the two technologies and try to think about what you would get. Basically the array results give rather good calls for up to 1 million known SNPs for which we have at least some idea about how often they occur in at least one population. That is after all why they are on the array. So you can assume that the arrays themselves will not find your low frequency variations. Simply because they were not designed to do that.

What I could imagine is that there might be low frequency variations very close to such known SNPs, so close that they are in fact covered by the array reporter sequence and would influence the detection. Such low frequency SNP might in fact lead to wrong calls from the array. You could map high throughput sequencing results to those array reporter sequences and evaluate that problem. That in itself might be quite interesting. In the end it might help you improve the calls for the arrays (but remember that these were quite good already).

I don't really see how it could work the other way around. Typically the sequencing technology seems to be able to pick up about 10 million real variations in an individual genome see [?]this[?]. So most of these will not be covered by the array. Now you could improve your reference genome a little by actually putting the array results in. That should improve the mapping of your NGS results to the (now individualized) reference genome a little. But of course that works best for the parts that you really modify (the ones covered by the array, the same ones as above). So I don't think that will really help substantially.

You might be temped to think that the array results might help you to do some kind of Hapmap analysis and thus make predictions about variations near the ones you measured. That might indeed work and is how arrays are often used. But... You will not be able to find Hapmap linked variations that are unlinked with a frequency lower than the frequency of the variation measured on the array, in fact the estimate is based on the linkage being real.

In other words I think the sequencing might help to improve the array results. But there is not much contribution the other way around.

Let me add a disclaimer ;-). I have never really done this. So I may have missed something obvious. But I found it an interesting thought experiment. Thanks for a nice question!

Ram · Answer 2 · 2011-04-16

3

Entering edit mode

14.2 years ago

lh3 33k

Firstly, 0.1% I was talking about means 0.1% in the samples we sequenced, not in the entire population. Lots of 0.1% SNPs in the population may not be present in, say, 1000 samples at all. There are simulations assessing how many 0.1% SNPs in a population can be found (below).

If you are asking if we can find 0.1% SNPs from 500 samples, you may check more recent 1000g SNPs. There are lots of (EDIT: 20%) singletons from ~600 samples, already below the 0.1% line. The FPR of singletons will be high, but should be within 10%. The FNR is surely much higher.

Discovering a SNP is quite different sequenced-based genotyping in each individual. You can call a SNP when the overall signal is strong, but you may not know the genotype.

I do not think combining with an array helps SNP discovery. When a rare SNP is not on the array, imputation only smooths out the signal in the sequencing data rather than enhance it. What we do is the opposite: impute SNPs found in sequencing into the array data.

Several groups have done extensive simulations. The published ones include the QCALL paper from Richard Durbin's group, the supplementary of the 1000g paper and a more recent work published online in Genome Research by Goncalo's group. If you are interested, you should read. I have forgotten the conclusions. Summarizing those papers would give you a better answer.

ADD COMMENT • link 14.2 years ago by lh3 33k

0

Entering edit mode

I don't understand. I assume you will indeed find a large number of singletons (variants occurring only once) in a group of 600 samples. Some of these will be false positives because of technology (they were just called wrong), from the reference I gave above you could estimate that that would be about 5% (so indeed a False Positive Rate of less than 10% because of technology), but that also depends on coverage. Some of them will be real but still false positives since they occur much less. You can in fact calculate how many of these there are. But, How can you ever be sure for a specific SNP?

ADD REPLY • link 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

I mean you say that from a set of 600 a large number were assigned to have an occurrence of 0.1 % or lower. That doesn't make sense. The measured occurrence after all is almost 0.2%. As said a lot of these will really be lower. But... You don't know which once. So you can only give a chance value how many of these will really below .1% not assign that to specific SNPs.

ADD REPLY • link 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

Could you please add references?

ADD REPLY • link 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

The 1000 genomes paper is here. Could you please add the other references?

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

I mean you say that from a set of 600 a large number were assigned to have an occurrence of 0.1 % or lower. That doesn't make sense. The measured occurrence after all is almost 0.2%. As said a lot of these will really be lower. But... You don't know which ones. So you can only give a chance value how many of these will really below .1% not assign that to specific SNPs.

ADD REPLY • link 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

The 1000 genomes paper is here: dx.doi.org/doi:10.1038/nature09534 Could you please add the other references?

ADD REPLY • link 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

600 samples have 1200 chromosomes. Singletons are 0.1% in frequency.

ADD REPLY • link 14.2 years ago by lh3 33k

0

Entering edit mode

Aha. Yes that makes sense. I though SNPs were defined as allele frequencies, not as sequence variants.

ADD REPLY • link 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

Aha. I understand what you mean. But, I thought SNPs were defined as allele frequencies, not as sequence variants in a single chromosome. In other words mutations in the two chromosomes would not be discriminated, mathematically meaning they count as 1 not as 2. See e.g. http://www.dnabaser.com/articles/SNP/SNP-single-nucleotide-polymorphism.html You could indeed treat it differently when using NGS data when you really know which chromosome to map the mutation to. But that doesn't seem very likely at this kind of coverages and it calls for a redefinition of what a SNP is.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

Practically, SNP is defined a difference between "chromosomes". The SNP frequency is always defined as the frequency between chromosomes, never between individuals. I have not seen a single exception.

ADD REPLY • link 14.2 years ago by lh3 33k

0

Entering edit mode

Thanks a lot for all the comments... I have read through the article by Gonzalos group (as suggested) which was very enlightening.

best Thomas

ADD REPLY • link 14.2 years ago by Thomas ▴ 770

0

Entering edit mode

Yes, that is a very good paper if you are thinking about similar things.

ADD REPLY • link 14.2 years ago by lh3 33k