Question

comparison of exome data to the 1000Genomes WGS data

0

Entering edit mode

6.4 years ago

gabili • 0

I’m trying to incorporate SNP data from 1000Genomes into my exome data. Since there are no available exome VCF’s, I downloaded the 1000Genomes whole genome sequence data and then just filtered it according to the genomic positions of my variants (obtained from the PLINK bim file). My data is referenced to hg19, so i used the GRCh37 version of the 1000Genomes that is found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ ("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", etc.). However, when I compared the 2 datasets (using PLINK 1.9 to open and filter the VCF's), I was surprised to find only ~25% of my exome variants in the big 1000Genomes WGS (for example: I have 300,000 SNPs in chromosome 1, but only 80,000 of them were found in the 1000Genomes WGS chromosome 1 file). When I used the "exome pull down targets" data (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/) to focus my search, I got very similar results. I was looking for some differences, but 75% "missingness" seems not right. Any suggestions?

1000Genomes PLINK SNP EXOME • 2.5k views

ADD COMMENT • link updated 6.4 years ago by trausch ★ 1.9k • written 6.4 years ago by gabili • 0

0

Entering edit mode

What kind of data do you have? Disease or healthy donor. If disease, are there matched normals?

ADD REPLY • link 6.4 years ago by ATpoint 81k

0

Entering edit mode

My data contains 6500 people with type-2 diabetes and 6500 people without. It was used in “The genetic architecture of type 2 diabetes" , Nature 536 2016 (https://www.nature.com/articles/nature18642).

ADD REPLY • link 6.4 years ago by gabili • 0

0

Entering edit mode

300,000 SNPs on chr1 for an exome capture data set? We usually have <100,000 confident exonic SNPs across the whole-genome for whole-exome sequencing. What is the on-target rate and the fraction of targets >=30x?

ADD REPLY • link 6.4 years ago by trausch ★ 1.9k

0

Entering edit mode

Hi, Unfortunately I didn’t take part in the creation of the dataset, I’m just a simple “end user”. The data is part of the international type-2 diabetes consortium, and was used in “The genetic architecture of type 2 diabetes" , Nature 536 2016 (https://www.nature.com/articles/nature18642). These 2 paragraphs are from the methods section of the paper, and maybe they can help:

“...Exome sequencing. Genomic DNA was sheared, end repaired, ligated with barcoded Illumina sequencing adapters, amplified, size selected, and subjected to in-solution hybrid capture using the Agilent SureSelect Human All Exon 44Mb v2.0 (DGI, FUSION, UK2T2D) and v3.0 (KORA) bait set (Agilent Technologies, USA). Resulting Illumina exome sequencing libraries were qPCR quantified, pooled, and sequenced with 76-bp paired-end reads using Illumina GAII or HiSeq 2000 sequencers to ~82-fold mean coverage...”

“...Coverage and QC of aligned sequence reads.We excluded 151 exome samples with average coverage ≤20× in >20% of the target bases and 68 genome samples with average coverage ≤5×….”

ADD REPLY • link 6.4 years ago by gabili • 0

0

Entering edit mode

Thanks, so this is a population VCF with >10,000 samples? Then the callset is dominated by rare alleles such as singletons and doubletons of allele count 1 and 2. If you subset your VCF to common variants (MAF>1%) you will have a large intersection with 1000 Genomes and for the rare ones it is no surprise that many are absent in 1000 Genomes.

ADD REPLY • link 6.4 years ago by trausch ★ 1.9k