Question

"Update" Genotype fields from one VCF file to another

1

Entering edit mode

5.2 years ago

Ram 43k

This might be an XY question, so I'll explain my premise:

I have 3 VCF files, f1, f2 and f3.
f1 is an annotated VCF covering 50 samples
f2 is an annotated VCF covering 5 samples, but only sites that are not in f1
f3 is an un-annotated VCF covering 5 samples across sites in f1 as well as not in f1
All annotations are site-level

I now wish to get this as one VCF files with all sites annotated and all sample-level information present.

When I merge f1 and f2, I get a VCF with all annotated sites and all samples, but for those sites overlapping with f3, the GT/AD/... fields are empty, because that information is in f3. How do I merge these three datasets?

Question:

In essence, can I do an operation to update genotype fields in one VCF file based on a sample+site match in another VCF file? If they were 2 data.frames, the operation would be something like vcf1[site, sample] <- vcf2[site, sample].

Current solution:

The way I see it, I might have to subset f3 to f1-sites only, then bcftools merge <f1> <f3_subset> ><f1_F3_subset> - that way I do not add any site, only samples. Then I bcftools concat <f1+f3_subset> <f2> > <final_vcf>, so this time I add only sites, no samples. Any other solution will be appreciated. That solution does not work as bcftools concat cannot work on VCFs with different samples in them.

vcf • 1.4k views

ADD COMMENT • link updated 4.3 years ago by Biostar 20 • written 5.2 years ago by Ram 43k

score 2 · Accepted Answer · 2019-02-01

Here's my current solution:

Subset all f1-sites present in f3: bcftools isec -n=2 -w1 -c none -o f3_subset f3 f1
Pull annotations from f1 into f3_CommonSites_subset: bcftools annotate -c INFO -a f1 -o f3_subset_anno f3_subset
Concat the new annotated file with f2 to get all site annotations for the 5 samples: bcftools concat -o f3_plus_f2 f2 f3_subset_anno
Merge f1 and this 5-sample file to get final VCF: bcftools merge -m none -o final_vcf f1 f3_plus_f2

Just realized while I was writing this, I could just do bcftools merge -Ou -m none f1 f3 | bcftools annotate -c FORMAT -a f2 -o final_vcf -, so that way I would pick up the FORMAT fields exactly as I intended in the first place.

If anyone has a better solution, please add it in! Thank you!