Question

Criteria For Calling Genotypes With High Certainty

5

Entering edit mode

13.1 years ago

Thomas ▴ 760

Dear Friends,

We are preparing a large scale genetic association study where we want to sequence the exomes of a large number of individuals (up to 2000 IDs).

It is very important for us to be able to call genotypes at all sites, since we want to do direct genetic association studies in our individuals with the given disease.

The sequencing part is outsourced and thus it is extremely important for us to set the correct criteria to our collaborators to reach our goal, which is to call genotypes with very high probability at all sites.

What information do you suggest we provide to our collaborators to reach our goal?

Some of our concerns are the depth (ex min. 50X), the coverage (ex min 90% per sample/per region), A quality score (ex. 90% of all sites [depth > 50X and Q20])

Are there any pitfalls we should be aware off??

Thanks in advance

Thomas and Karina

To clarify the above question we are now adding this simplified example:

We tell our sequence-collaborator that we want a mean depth of 30X per individual. After sequencing we start in-house QC: remove reads with multiple hits, remove reads with Qscore<20 etc. Finally, we have a depth of 18 and we can't assign genotypes and we can't do association study. How can we make sure that after QC, our data is suitable for genotype calling?

Our sequence-collaborator use: Roche Nimblegen for exome target and paired-end sequencing (Illumina HiSeq2000)

next-gen sequencing association snp • 4.8k views

ADD COMMENT • link updated 13.1 years ago by lh3 33k • written 13.1 years ago by Thomas ▴ 760

score 1 · Answer 1 · 2011-03-15

1

Entering edit mode

13.1 years ago

Larry_Parnell 16k

I can see a potential pitfall with the coverage filter you intend to use. Small insertions/deletions may be missed. This would be of particular concern with exomes and triplet repeats. That is likely to be a small number and perhaps these exons can be flagged up front either by your data from other samples (where the expansion of this type of repeat is smaller scale) or by listing genes/exons known to be susceptible to triplet repeats.

I'd also offer the question: How will your data look for CNV deletions, where an individual has either one or zero copies of a gene or exon? Coverage could be zero in this case. So, is it a failed attempt to sequence or a deletion?

ADD COMMENT • link 13.1 years ago by Larry_Parnell 16k

0

Entering edit mode

Thanks for the answer, much appreciated. Our question might be misunderstood. I will try to clarify with an example - a simplified example

We tell our collaborator that we want a mean depth of 30X per individual. After sequencing we start in-house QC: remove reads with many hit, remove reads with Qscore<20 etc. Finally, we have a depth of 18 and we can't assign genotypes and we can't do association study.

How can we make sure that after QC, our data is suitable for genotype calling?

Our collaborator use: Roche Nimblegen for exome target and paired-end sequencing (Illumina HiSeq2000)

Thomas

ADD REPLY • link 13.1 years ago by Thomas ▴ 760

0

Entering edit mode

For association studies, these are not major concerns IMO.

ADD REPLY • link 13.1 years ago by lh3 33k

score 1 · Answer 2 · 2011-03-17

How your average depth is calculated? Number of uniquely aligned bases divided by the target length? If so, this is unfair to your collaborators. A more proper way to compute depth is to get the depth at HapMap SNPs. But even so, an average depth is not telling. In exome sequencing, the read depth varies greatly. "What is the fraction of target regions covered by >20 reads" is a better criterion.

For association study, an important thing is to get unbiased data. You may want to barcode cases and controls in the same lane, or at least sequence them in the same run, to alleviate the batch effect. In my opinion, though others may disagree, we should not call genotypes in the first place. This adds noises especially when there is a batch effect. The right way is to compute the differences from the data, which is really the thing an association study cares about. It is theoretically possible to do association test given 4X coverage. 18X is more than enough in my view.

On the other hand, all these have not been fully explored. You are among the first batch of people who do similar things. I could be wrong, too.

EDIT:

The latest samtools performs association test on sequencing data. Tested on 1504X WG sequencing and 20002X target sequencing.