Question

Forum:Best practices for merging vcfs with different positions + different samples?

0

Entering edit mode

22 months ago

pabe ▴ 30

Hi, I've been searching for some advice on best practices for merging multiple vcfs with different variable positions across different sample cohorts but can't seem to find this information, so thought I'd start a discussion

In a hypothetical scenario where you have vcfs (e.g. say 4) generated from separate exome and whole genome sequencing runs where genotypes were called separately within the 4 cohorts of study samples, how should one go about merging the 4 vcfs? Note that the set of variable sites called within each vcf are different and samples within each vcf are unique.

My thinking is that if you're dealing with very different experimental designs like exome vs. wgs, you should just retain the overlapping sites across the 4 vcfs. Otherwise, you'd have to set the positions not originally called in the exome vcfs to be homozygous reference once it's merged with the wgs datasets, right? That could create a significant bias giving some positions weren't sequenced in the first place (i.e. doesn't mean they are non-variant by default). On the other hand, if it is a matter of merging all-exome or all-wgs data, then one could retain the union of all sites present across datasets. You'd still have to set some sites as homozygous reference when they don't present as variable in the original genotype calls, but do we have concerns about this approach? Or should we consider letting those positions be missing? My worry is with extremely heterogeneous study cohorts and variants that are private to specific populations, this could skew the % of genotype missingness at the individual level, and those positions with high missingness might end up filtered out anyway downstream.

I think this is a fairly common problem for folks who use publicly-available / published data. Curious what you all think.

vcf bcftools genotypes • 755 views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 22 months ago by pabe ▴ 30