Question: comparison of exome data to the 1000Genomes WGS data
0
gravatar for gabili
5 months ago by
gabili0
gabili0 wrote:

I’m trying to incorporate SNP data from 1000Genomes into my exome data. Since there are no available exome VCF’s, I downloaded the 1000Genomes whole genome sequence data and then just filtered it according to the genomic positions of my variants (obtained from the PLINK bim file). My data is referenced to hg19, so i used the GRCh37 version of the 1000Genomes that is found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ ("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", etc.). However, when I compared the 2 datasets (using PLINK 1.9 to open and filter the VCF's), I was surprised to find only ~25% of my exome variants in the big 1000Genomes WGS (for example: I have 300,000 SNPs in chromosome 1, but only 80,000 of them were found in the 1000Genomes WGS chromosome 1 file). When I used the "exome pull down targets" data (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/)  to focus my search, I got  very similar results.  I was looking for some differences, but 75% "missingness" seems not right.  Any suggestions?

snp plink 1000genomes exome • 370 views
ADD COMMENTlink modified 5 months ago by trausch910 • written 5 months ago by gabili0

What kind of data do you have? Disease or healthy donor. If disease, are there matched normals?

ADD REPLYlink written 5 months ago by ATpoint3.5k

My data contains 6500 people with type-2 diabetes and 6500 people without. It was used in “The genetic architecture of type 2 diabetes" , Nature 536 2016 (https://www.nature.com/articles/nature18642).

ADD REPLYlink written 5 months ago by gabili0

300,000 SNPs on chr1 for an exome capture data set? We usually have <100,000 confident exonic SNPs across the whole-genome for whole-exome sequencing. What is the on-target rate and the fraction of targets >=30x?

ADD REPLYlink written 5 months ago by trausch910

Hi, Unfortunately I didn’t take part in the creation of the dataset, I’m just a simple “end user”. The data is part of the international type-2 diabetes consortium, and was used in “The genetic architecture of type 2 diabetes" , Nature 536 2016 (https://www.nature.com/articles/nature18642). These 2 paragraphs are from the methods section of the paper, and maybe they can help:

“...Exome sequencing. Genomic DNA was sheared, end repaired, ligated with barcoded Illumina sequencing adapters, amplified, size selected, and subjected to in-solution hybrid capture using the Agilent SureSelect Human All Exon 44Mb v2.0 (DGI, FUSION, UK2T2D) and v3.0 (KORA) bait set (Agilent Technologies, USA). Resulting Illumina exome sequencing libraries were qPCR quantified, pooled, and sequenced with 76-bp paired-end reads using Illumina GAII or HiSeq 2000 sequencers to ~82-fold mean coverage...”

“...Coverage and QC of aligned sequence reads.We excluded 151 exome samples with average coverage ≤20× in >20% of the target bases and 68 genome samples with average coverage ≤5×….”

ADD REPLYlink modified 5 months ago • written 5 months ago by gabili0

Thanks, so this is a population VCF with >10,000 samples? Then the callset is dominated by rare alleles such as singletons and doubletons of allele count 1 and 2. If you subset your VCF to common variants (MAF>1%) you will have a large intersection with 1000 Genomes and for the rare ones it is no surprise that many are absent in 1000 Genomes.

ADD REPLYlink written 5 months ago by trausch910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1584 users visited in the last hour