Question: Criteria For Calling Genotypes With High Certainty
5
gravatar for Thomas
9.7 years ago by
Thomas730
Copenhagen, DK
Thomas730 wrote:

Dear Friends,

We are preparing a large scale genetic association study where we want to sequence the exomes of a large number of individuals (up to 2000 IDs).

It is very important for us to be able to call genotypes at all sites, since we want to do direct genetic association studies in our individuals with the given disease.

The sequencing part is outsourced and thus it is extremely important for us to set the correct criteria to our collaborators to reach our goal, which is to call genotypes with very high probability at all sites.

What information do you suggest we provide to our collaborators to reach our goal?

Some of our concerns are the depth (ex min. 50X), the coverage (ex min 90% per sample/per region), A quality score (ex. 90% of all sites [depth > 50X and Q20])

Are there any pitfalls we should be aware off??

Thanks in advance

Thomas and Karina

To clarify the above question we are now adding this simplified example:

We tell our sequence-collaborator that we want a mean depth of 30X per individual. After sequencing we start in-house QC: remove reads with multiple hits, remove reads with Qscore<20 etc. Finally, we have a depth of 18 and we can't assign genotypes and we can't do association study. How can we make sure that after QC, our data is suitable for genotype calling?

Our sequence-collaborator use: Roche Nimblegen for exome target and paired-end sequencing (Illumina HiSeq2000)

ADD COMMENTlink modified 9.7 years ago by lh332k • written 9.7 years ago by Thomas730
1
gravatar for Larry_Parnell
9.7 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

I can see a potential pitfall with the coverage filter you intend to use. Small insertions/deletions may be missed. This would be of particular concern with exomes and triplet repeats. That is likely to be a small number and perhaps these exons can be flagged up front either by your data from other samples (where the expansion of this type of repeat is smaller scale) or by listing genes/exons known to be susceptible to triplet repeats.

I'd also offer the question: How will your data look for CNV deletions, where an individual has either one or zero copies of a gene or exon? Coverage could be zero in this case. So, is it a failed attempt to sequence or a deletion?

ADD COMMENTlink written 9.7 years ago by Larry_Parnell16k

Thanks for the answer, much appreciated. Our question might be misunderstood. I will try to clarify with an example - a simplified example

We tell our collaborator that we want a mean depth of 30X per individual. After sequencing we start in-house QC: remove reads with many hit, remove reads with Qscore<20 etc. Finally, we have a depth of 18 and we can't assign genotypes and we can't do association study.

How can we make sure that after QC, our data is suitable for genotype calling?

Our collaborator use: Roche Nimblegen for exome target and paired-end sequencing (Illumina HiSeq2000)

Thomas

ADD REPLYlink written 9.7 years ago by Thomas730

For association studies, these are not major concerns IMO.

ADD REPLYlink written 9.7 years ago by lh332k
1
gravatar for lh3
9.7 years ago by
lh332k
United States
lh332k wrote:

How your average depth is calculated? Number of uniquely aligned bases divided by the target length? If so, this is unfair to your collaborators. A more proper way to compute depth is to get the depth at HapMap SNPs. But even so, an average depth is not telling. In exome sequencing, the read depth varies greatly. "What is the fraction of target regions covered by >20 reads" is a better criterion.

For association study, an important thing is to get unbiased data. You may want to barcode cases and controls in the same lane, or at least sequence them in the same run, to alleviate the batch effect. In my opinion, though others may disagree, we should not call genotypes in the first place. This adds noises especially when there is a batch effect. The right way is to compute the differences from the data, which is really the thing an association study cares about. It is theoretically possible to do association test given 4X coverage. 18X is more than enough in my view.

On the other hand, all these have not been fully explored. You are among the first batch of people who do similar things. I could be wrong, too.

EDIT:

The latest samtools performs association test on sequencing data. Tested on 1504X WG sequencing and 20002X target sequencing.

ADD COMMENTlink modified 9.7 years ago • written 9.7 years ago by lh332k

Thanks a lot the answer... please also see this comment. To remove any batch effect we randomise our samples before they are sent to the collaborators. When we do our association studies we adjust for the uncertainty of the genotypes (which has to be as negligible as possible). If the depth is only 4X we loose considerable amount of stat. power for identifying true effect sizes in common diseases. And rare SNPs (MAF<5%) will not be successful. Unfortunately, I do not have power cal. to support this. As for the coverage of target region, we hope to set: 90% coverage(50X depths and Qscore>20).

ADD REPLYlink written 9.7 years ago by Thomas730
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2099 users visited in the last hour