Question

What is the expected rate of SNPs within a human exome sequencing with HiSeq?

0

Entering edit mode

9.2 years ago

JacobS ▴ 1000

I am sequencing human exome data and looking for clinically relevant SNPs. I am using the standard GATK workflow, applying a hard filter, and then evaluating with snpeff and looking for ClinVar SNPs.

Overall, I'm getting about 1 in 25,000 exome bases being reported as a SNP at the end of GATK. Additionally, a single human exome results in about 450 ClinVar SNPs that are annotated with known disease states.

This seems quite high for me. Does anyone have a good idea about what frequency of SNPs I should be finding for a normal, healthy human exome? I assume I have lots of false positives due to my crude hard filtering method, but these are SNPs that survived the entire GATK workflow, including recalibration, etc., so I thought they would be higher quality.

Thanks for any perspective.

SNP exome highseq • 5.1k views

ADD COMMENT • link updated 7.5 years ago by predeus ★ 2.1k • written 9.2 years ago by JacobS ▴ 1000

1

Entering edit mode

That depends on multiple factors such as sample origin, exome kit, parameters used for alignment and variant calling. Usually after filtering and annotating the final VCF with Snpeff or VEP and restricting your variant list only to the exon regions you should get between 20-40k variants.

ADD REPLY • link 7.5 years ago by Raony Guimarães ★ 1.5k

score 2 · Answer 1 · 2018-01-03

2

Entering edit mode

7.5 years ago

predeus ★ 2.1k

The original number is insanely low, and Wouter's number is very, very high. A good place for reference is flagship ExAc paper:

https://www.nature.com/articles/nature19057

Normal number of variants is approximately 1 per 1kb of exonic sequence. More precisely, about 25,000 variants for European, and around 30,000 variants for African that should be passing filtering in GATK. Number of variants could be (much) more for designs that include UTRs, of course. But I would warn against calling variants using manufacturer-provided intervals - at the very least, take the latest Gencode CDS and make a union bed file with the manufacturer's BED. That way you won't miss any important stuff that's omitted in the BED but still covered.

ADD COMMENT • link 7.5 years ago by predeus ★ 2.1k

0

Entering edit mode

Our kit for enrichment is SeqCap EZ Exome v3.0 Kit from Roche, including UTRs, some flanking regions and miRNA's, capturing about 64Mb. With lower conservation, the number of variants in non-exonic regions is likely higher than 1/1kb. In addition, no filter was applied to my set of variants if I remember correctly. So that would (mainly) explain the differences in numbers obtained.

ADD REPLY • link 7.5 years ago by WouterDeCoster 48k

score 1 · Answer 2 · 2016-04-20

1

Entering edit mode

9.2 years ago

WouterDeCoster 48k

I just looked up some exomes (but the kit you use for enrichment does matter) and find ~160k-180k variants.

ADD COMMENT • link 9.2 years ago by WouterDeCoster 48k

0

Entering edit mode

Did you get ~160k-180k variants with your WES analysis? Now I get confused. I get 130K variations for only exome region and 400K including exomes, UTRs, and intron! However, the above comments mention that it should be ~ 30K-40K for exomes. So is 30K normal or 160K that you mentioned?

ADD REPLY • link 6.3 years ago by maria2019 ▴ 250