Hi, I've been searching for some advice on best practices for merging multiple vcfs with different variable positions across different sample cohorts but can't seem to find this information, so thought I'd start a discussion–
In a hypothetical scenario where you have vcfs (e.g. say 4) generated from separate exome and whole genome sequencing runs where genotypes were called separately within the 4 cohorts of study samples, how should one go about merging the 4 vcfs? Note that the set of variable sites called within each vcf are different and samples within each vcf are unique.
My thinking is that if you're dealing with very different experimental designs like exome vs. wgs, you should just retain the overlapping sites across the 4 vcfs. Otherwise, you'd have to set the positions not originally called in the exome vcfs to be homozygous reference once it's merged with the wgs datasets, right? That could create a significant bias giving some positions weren't sequenced in the first place (i.e. doesn't mean they are non-variant by default). On the other hand, if it is a matter of merging all-exome or all-wgs data, then one could retain the union of all sites present across datasets. You'd still have to set some sites as homozygous reference when they don't present as variable in the original genotype calls, but do we have concerns about this approach? Or should we consider letting those positions be missing? My worry is with extremely heterogeneous study cohorts and variants that are private to specific populations, this could skew the % of genotype missingness at the individual level, and those positions with high missingness might end up filtered out anyway downstream.
I think this is a fairly common problem for folks who use publicly-available / published data. Curious what you all think.