We are preparing a large scale genetic association study where we want to sequence the exomes of a large number of individuals (up to 2000 IDs).
It is very important for us to be able to call genotypes at all sites, since we want to do direct genetic association studies in our individuals with the given disease.
The sequencing part is outsourced and thus it is extremely important for us to set the correct criteria to our collaborators to reach our goal, which is to call genotypes with very high probability at all sites.
What information do you suggest we provide to our collaborators to reach our goal?
Some of our concerns are the depth (ex min. 50X), the coverage (ex min 90% per sample/per region), A quality score (ex. 90% of all sites [depth > 50X and Q20])
Are there any pitfalls we should be aware off??
Thanks in advance
Thomas and Karina
To clarify the above question we are now adding this simplified example:
We tell our sequence-collaborator that we want a mean depth of 30X per individual. After sequencing we start in-house QC: remove reads with multiple hits, remove reads with Qscore<20 etc. Finally, we have a depth of 18 and we can't assign genotypes and we can't do association study. How can we make sure that after QC, our data is suitable for genotype calling?
Our sequence-collaborator use: Roche Nimblegen for exome target and paired-end sequencing (Illumina HiSeq2000)