Question: What is the expected rate of SNPs within a human exome sequencing with HiSeq?
gravatar for JacobS
2.8 years ago by
Cleveland, Ohio
JacobS890 wrote:

I am sequencing human exome data and looking for clinically relevant SNPs. I am using the standard GATK workflow, applying a hard filter, and then evaluating with snpeff and looking for ClinVar SNPs.

Overall, I'm getting about 1 in 25,000 exome bases being reported as a SNP at the end of GATK. Additionally, a single human exome results in about 450 ClinVar SNPs that are annotated with known disease states.

This seems quite high for me. Does anyone have a good idea about what frequency of SNPs I should be finding for a normal, healthy human exome? I assume I have lots of false positives due to my crude hard filtering method, but these are SNPs that survived the entire GATK workflow, including recalibration, etc., so I thought they would be higher quality.

Thanks for any perspective.

highseq snp exome • 1.8k views
ADD COMMENTlink modified 12 months ago by predeus770 • written 2.8 years ago by JacobS890

That depends on multiple factors such as sample origin, exome kit, parameters used for alignment and variant calling. Usually after filtering and annotating the final VCF with Snpeff or VEP and restricting your variant list only to the exon regions you should get between 20-40k variants.

ADD REPLYlink modified 12 months ago • written 12 months ago by Raony Guimarães950
gravatar for predeus
12 months ago by
predeus770 wrote:

The original number is insanely low, and Wouter's number is very, very high. A good place for reference is flagship ExAc paper:

Normal number of variants is approximately 1 per 1kb of exonic sequence. More precisely, about 25,000 variants for European, and around 30,000 variants for African that should be passing filtering in GATK. Number of variants could be (much) more for designs that include UTRs, of course. But I would warn against calling variants using manufacturer-provided intervals - at the very least, take the latest Gencode CDS and make a union bed file with the manufacturer's BED. That way you won't miss any important stuff that's omitted in the BED but still covered.

ADD COMMENTlink written 12 months ago by predeus770

Our kit for enrichment is SeqCap EZ Exome v3.0 Kit from Roche, including UTRs, some flanking regions and miRNA's, capturing about 64Mb. With lower conservation, the number of variants in non-exonic regions is likely higher than 1/1kb. In addition, no filter was applied to my set of variants if I remember correctly. So that would (mainly) explain the differences in numbers obtained.

ADD REPLYlink written 12 months ago by WouterDeCoster36k
gravatar for WouterDeCoster
2.8 years ago by
WouterDeCoster36k wrote:

I just looked up some exomes (but the kit you use for enrichment does matter) and find ~160k-180k variants.

ADD COMMENTlink written 2.8 years ago by WouterDeCoster36k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1599 users visited in the last hour