Question: comparison of exome data to the 1000Genomes WGS data
gravatar for gabili
2.2 years ago by
gabili0 wrote:

I’m trying to incorporate SNP data from 1000Genomes into my exome data. Since there are no available exome VCF’s, I downloaded the 1000Genomes whole genome sequence data and then just filtered it according to the genomic positions of my variants (obtained from the PLINK bim file). My data is referenced to hg19, so i used the GRCh37 version of the 1000Genomes that is found at ("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", etc.). However, when I compared the 2 datasets (using PLINK 1.9 to open and filter the VCF's), I was surprised to find only ~25% of my exome variants in the big 1000Genomes WGS (for example: I have 300,000 SNPs in chromosome 1, but only 80,000 of them were found in the 1000Genomes WGS chromosome 1 file). When I used the "exome pull down targets" data (  to focus my search, I got  very similar results.  I was looking for some differences, but 75% "missingness" seems not right.  Any suggestions?

snp plink 1000genomes exome • 1.1k views
ADD COMMENTlink modified 2.2 years ago by trausch1.4k • written 2.2 years ago by gabili0

What kind of data do you have? Disease or healthy donor. If disease, are there matched normals?

ADD REPLYlink written 2.2 years ago by ATpoint28k

My data contains 6500 people with type-2 diabetes and 6500 people without. It was used in “The genetic architecture of type 2 diabetes" , Nature 536 2016 (

ADD REPLYlink written 2.2 years ago by gabili0

300,000 SNPs on chr1 for an exome capture data set? We usually have <100,000 confident exonic SNPs across the whole-genome for whole-exome sequencing. What is the on-target rate and the fraction of targets >=30x?

ADD REPLYlink written 2.2 years ago by trausch1.4k

Hi, Unfortunately I didn’t take part in the creation of the dataset, I’m just a simple “end user”. The data is part of the international type-2 diabetes consortium, and was used in “The genetic architecture of type 2 diabetes" , Nature 536 2016 ( These 2 paragraphs are from the methods section of the paper, and maybe they can help:

“...Exome sequencing. Genomic DNA was sheared, end repaired, ligated with barcoded Illumina sequencing adapters, amplified, size selected, and subjected to in-solution hybrid capture using the Agilent SureSelect Human All Exon 44Mb v2.0 (DGI, FUSION, UK2T2D) and v3.0 (KORA) bait set (Agilent Technologies, USA). Resulting Illumina exome sequencing libraries were qPCR quantified, pooled, and sequenced with 76-bp paired-end reads using Illumina GAII or HiSeq 2000 sequencers to ~82-fold mean coverage...”

“...Coverage and QC of aligned sequence reads.We excluded 151 exome samples with average coverage ≤20× in >20% of the target bases and 68 genome samples with average coverage ≤5×….”

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by gabili0

Thanks, so this is a population VCF with >10,000 samples? Then the callset is dominated by rare alleles such as singletons and doubletons of allele count 1 and 2. If you subset your VCF to common variants (MAF>1%) you will have a large intersection with 1000 Genomes and for the rare ones it is no surprise that many are absent in 1000 Genomes.

ADD REPLYlink written 2.2 years ago by trausch1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1867 users visited in the last hour