We have a list of a few thousand CNVs/SVs identified using array-based CNV calling methods. They are from five individuals sequenced by the 1000 Genomes Project. We would like to compare breakpoints on the CNVs identified by our calling and those identified by 1000 genomes.
So the question: how can I generate a vcf file (presumably using tabix and vcftools) for a select set of individuals overlapping a list of specified CNVs like the ones below? Moreover, can I do this (thousands of regions) in a single step? Should my regions I query be smaller or larger than the CNVs we identified? And can we filter the results by those that are above threshold size?
Note: it appears BrentP has an answer about how to get multiple regions at once from 1kG here , but if I am pulling all of these from the 1kG FTP server, must I do it by chromosome? And how would pulling regions handle imperfect overlaps as mentioned above?
Individuals: NA10851 NA18505 Regions: chr22:22680529-22726814 chr22:22613016-22670785 chr22:41234550-41276824