I have ~200 VCF files. Each of them contains Whole Exome Sequencing information for one sample. I am going to merge them together to be a single VCF file and then convert it into PLINK file to perform association analysis for each single variant.
Before merging VCF files, I filtered out these low quality variants which are “low snp quality, low variant reads, low variant ratio, single strand direction or low coverage”, just kept these “PASS” variants. Finally, in the merged VCF file, I observed two kinds of missing values (./.):
(1) Missing values caused by low quality: Sample A has a PASS variant CHR1:14590, but Sample B has a low coverage variant CHR1:14590. Sample B has a missing value at CHR1:14590 in the merged VCF file. This should be real missing value (./.) for Sample B.
(2) Missing values caused by no variant called: Sample C doesn’t have any information of CHR1:14590 in his orignal VCF file. There is no called variant for him. This kind of missing value should be ref genotype (0/0) for Sample C.
My question is when I am merging these VCF files, how can I reserve 1st kind of missing values still as missing values (./.), and change 2nd kind of missing values into (0/0)? Is there any software or packages available?
I appreciate if anyone can help with my question. Thank you a lot! Have a good one!
Do you have access to the bam files? The gvcf workflow using haplotypecaller (see GATK best practices) would probably be your best guess to avoid mistakes.
I only have BAM files for parts of samples, not all of them. This is a collaborated work with several study sites. Right now, I am not sure whether collaborators are willing to send us the BAM files of their samples. I preferred to solve this problem using VCF files. If there is no other ways, maybe I will try to get BAM files. Thank you so much, WouterDeCoster!