I have a VCF dataset from whole exome sequencing of a cohort of people. I was considering to take some people from 1000genomes data and add them to my data so that I have a bigger cohort.
To make the data (variant loci) consistent, I subsetted the 1000genomes data by the variants positions from my exome VCF data.
Since 1000genomes data was done by whole genome sequencing, I just assumed that it covers all variant loci in my exome VCF data. But when I checked the resulting file, I found that there are many variant loci (around 40~50% of all variant loci in exome VCF) in the exome VCF but not in the 1000genomes VCF. (Both data are hg19 or b37)
I was wondering what are the possible reasons for this.
Is it because 1000genomes whole genome sequencing does not have enough coverage to call all possible variants? Any other reasons? Thanks!