Question: What is the expected rate of SNPs within a human exome sequencing with HiSeq?
gravatar for JacobS
4.5 years ago by
Cleveland, Ohio
JacobS930 wrote:

I am sequencing human exome data and looking for clinically relevant SNPs. I am using the standard GATK workflow, applying a hard filter, and then evaluating with snpeff and looking for ClinVar SNPs.

Overall, I'm getting about 1 in 25,000 exome bases being reported as a SNP at the end of GATK. Additionally, a single human exome results in about 450 ClinVar SNPs that are annotated with known disease states.

This seems quite high for me. Does anyone have a good idea about what frequency of SNPs I should be finding for a normal, healthy human exome? I assume I have lots of false positives due to my crude hard filtering method, but these are SNPs that survived the entire GATK workflow, including recalibration, etc., so I thought they would be higher quality.

Thanks for any perspective.

highseq snp exome • 3.3k views
ADD COMMENTlink modified 2.8 years ago by predeus1.4k • written 4.5 years ago by JacobS930

That depends on multiple factors such as sample origin, exome kit, parameters used for alignment and variant calling. Usually after filtering and annotating the final VCF with Snpeff or VEP and restricting your variant list only to the exon regions you should get between 20-40k variants.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Raony Guimarães1.1k
gravatar for predeus
2.8 years ago by
predeus1.4k wrote:

The original number is insanely low, and Wouter's number is very, very high. A good place for reference is flagship ExAc paper:

Normal number of variants is approximately 1 per 1kb of exonic sequence. More precisely, about 25,000 variants for European, and around 30,000 variants for African that should be passing filtering in GATK. Number of variants could be (much) more for designs that include UTRs, of course. But I would warn against calling variants using manufacturer-provided intervals - at the very least, take the latest Gencode CDS and make a union bed file with the manufacturer's BED. That way you won't miss any important stuff that's omitted in the BED but still covered.

ADD COMMENTlink written 2.8 years ago by predeus1.4k

Our kit for enrichment is SeqCap EZ Exome v3.0 Kit from Roche, including UTRs, some flanking regions and miRNA's, capturing about 64Mb. With lower conservation, the number of variants in non-exonic regions is likely higher than 1/1kb. In addition, no filter was applied to my set of variants if I remember correctly. So that would (mainly) explain the differences in numbers obtained.

ADD REPLYlink written 2.8 years ago by WouterDeCoster44k
gravatar for WouterDeCoster
4.5 years ago by
WouterDeCoster44k wrote:

I just looked up some exomes (but the kit you use for enrichment does matter) and find ~160k-180k variants.

ADD COMMENTlink written 4.5 years ago by WouterDeCoster44k

Did you get ~160k-180k variants with your WES analysis? Now I get confused. I get 130K variations for only exome region and 400K including exomes, UTRs, and intron! However, the above comments mention that it should be ~ 30K-40K for exomes. So is 30K normal or 160K that you mentioned?

ADD REPLYlink written 19 months ago by maria2019100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1585 users visited in the last hour