Question: What is the expected rate of SNPs within a human exome sequencing with HiSeq?
0
gravatar for JacobS
21 months ago by
JacobS850
West Lafayette
JacobS850 wrote:

I am sequencing human exome data and looking for clinically relevant SNPs. I am using the standard GATK workflow, applying a hard filter, and then evaluating with snpeff and looking for ClinVar SNPs.

Overall, I'm getting about 1 in 25,000 exome bases being reported as a SNP at the end of GATK. Additionally, a single human exome results in about 450 ClinVar SNPs that are annotated with known disease states.

This seems quite high for me. Does anyone have a good idea about what frequency of SNPs I should be finding for a normal, healthy human exome? I assume I have lots of false positives due to my crude hard filtering method, but these are SNPs that survived the entire GATK workflow, including recalibration, etc., so I thought they would be higher quality.

Thanks for any perspective.

highseq snp exome • 797 views
ADD COMMENTlink modified 17 days ago by predeus400 • written 21 months ago by JacobS850

That depends on multiple factors such as sample origin, exome kit, parameters used for alignment and variant calling. Usually after filtering and annotating the final VCF with Snpeff or VEP and restricting your variant list only to the exon regions you should get between 20-40k variants.

ADD REPLYlink modified 17 days ago • written 17 days ago by Raony Guimarães760
1
gravatar for predeus
17 days ago by
predeus400
Russia
predeus400 wrote:

The original number is insanely low, and Wouter's number is very, very high. A good place for reference is flagship ExAc paper:

https://www.nature.com/articles/nature19057

Normal number of variants is approximately 1 per 1kb of exonic sequence. More precisely, about 25,000 variants for European, and around 30,000 variants for African that should be passing filtering in GATK. Number of variants could be (much) more for designs that include UTRs, of course. But I would warn against calling variants using manufacturer-provided intervals - at the very least, take the latest Gencode CDS and make a union bed file with the manufacturer's BED. That way you won't miss any important stuff that's omitted in the BED but still covered.

ADD COMMENTlink written 17 days ago by predeus400

Our kit for enrichment is SeqCap EZ Exome v3.0 Kit from Roche, including UTRs, some flanking regions and miRNA's, capturing about 64Mb. With lower conservation, the number of variants in non-exonic regions is likely higher than 1/1kb. In addition, no filter was applied to my set of variants if I remember correctly. So that would (mainly) explain the differences in numbers obtained.

ADD REPLYlink written 17 days ago by WouterDeCoster25k
0
gravatar for WouterDeCoster
21 months ago by
Belgium
WouterDeCoster25k wrote:

I just looked up some exomes (but the kit you use for enrichment does matter) and find ~160k-180k variants.

ADD COMMENTlink written 21 months ago by WouterDeCoster25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 617 users visited in the last hour