I have a VCF file, lets call it file A, which was created by performing variant calling using GATK on whole genome sequencing data. For SNPs that do no appear in the VCF, can I assume that the SNP is monomorphic for the reference allele (i.e all 0 encoding). Note that sites that are monomorphic for the alternate allele appear in the file. I ask because I need to merge this file (A) with another VCF (file B) and I'm not sure how to handle the variants that appear in B but not A. Can I just fill in the genotypes with all 0 for these variants that do no appear in A, or do I have to impute first to account for the possibility of low coverage in the region? Would be great if answers provide a external source as well so I have to point of reference when I consult my advisor because he thinks that sequencing data should never require imputation. Thanks!
I wrote a tool to fix this problem : http://lindenb.github.io/jvarkit/FixVcfMissingGenotypes.html
(but it is slow)
After a VCF-merge, read a VCF, look back at some BAMS to tells if the missing genotypes were homozygotes-ref or not-called. If the number of reads is greater than min.depth, then the missing genotype is said hom-ref.