Question: (Closed) differences in covered regions in exome and whole genome sequencing
gravatar for miaowzai
19 months ago by
United States
miaowzai210 wrote:

I have a VCF dataset from whole exome sequencing of a cohort of people. I was considering to take some people from 1000genomes data and add them to my data so that I have a bigger cohort.

To make the data (variant loci) consistent, I subsetted the 1000genomes data by the variants positions from my exome VCF data.

Since 1000genomes data was done by whole genome sequencing, I just assumed that it covers all variant loci in my exome VCF data. But when I checked the resulting file, I found that there are many variant loci (around 40~50% of all variant loci in exome VCF) in the exome VCF but not in the 1000genomes VCF. (Both data are hg19 or b37)

I was wondering what are the possible reasons for this.

Is it because 1000genomes whole genome sequencing does not have enough coverage to call all possible variants? Any other reasons? Thanks!

ADD COMMENTlink modified 9 days ago by Biostar ♦♦ 20 • written 19 months ago by miaowzai210

Hello, what is the ethnicity of your sample cohort? Remember that, although 1000 Genomes was comprehensive, it only covers certain global populations. Also, how are you checking that variants are present or not in 1000 Genomes?

ADD REPLYlink written 19 months ago by Kevin Blighe54k

I only kept around 900 EUR individuals from my exome data and only the ~500 EUR from the 1000genomes data. One possible reason that I can think of is that my data has more people than 1000genomes, so maybe some variants are discovered even by exome sequencing but not covered in 1000genomes data. But still, I think 40%-50% is too many. I checked by the genotype VCF files in the 1000genomes ftp site (they have individual-level genotype call VCF files publicly available).

ADD REPLYlink modified 19 months ago • written 19 months ago by miaowzai210

Did you download the data as I do here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format

The 1000 Genomes data is so large such that it is still in the process of being curated.

Just another question: are the majority of the variants in your dataset private variants (i.e. only present in a single individual)?

ADD REPLYlink written 19 months ago by Kevin Blighe54k

Hello miaowzai!

We believe that this post does not fit the main topic of this site.

OP did not follow up.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.


ADD REPLYlink written 9 days ago by ATpoint29k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2183 users visited in the last hour