I have a VCF dataset from whole exome sequencing of a cohort of people. I was considering to take some people from 1000genomes data and add them to my data so that I have a bigger cohort.

To make the data (variant loci) consistent, I subsetted the 1000genomes data by the variants positions from my exome VCF data.

Since 1000genomes data was done by whole genome sequencing, I just assumed that it covers all variant loci in my exome VCF data. But when I checked the resulting file, I found that there are many variant loci (around 40~50% of all variant loci in exome VCF) in the exome VCF but not in the 1000genomes VCF. (Both data are hg19 or b37)

I was wondering what are the possible reasons for this.

Is it because 1000genomes whole genome sequencing does not have enough coverage to call all possible variants? Any other reasons? Thanks!

Hello, what is the ethnicity of your sample cohort? Remember that, although 1000 Genomes was comprehensive, it only covers certain global populations. Also, how are you checking that variants are present or not in 1000 Genomes?

I only kept around 900 EUR individuals from my exome data and only the ~500 EUR from the 1000genomes data. One possible reason that I can think of is that my data has more people than 1000genomes, so maybe some variants are discovered even by exome sequencing but not covered in 1000genomes data. But still, I think 40%-50% is too many. I checked by the genotype VCF files in the 1000genomes ftp site (they have individual-level genotype call VCF files publicly available).

Did you download the data as I do here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format

The 1000 Genomes data is so large such that it is still in the process of being curated.

Just another question: are the majority of the variants in your dataset private variants (i.e. only present in a single individual)?

