Question: "Update" Genotype fields from one VCF file to another
1
gravatar for RamRS
4 months ago by
RamRS22k
Houston, TX
RamRS22k wrote:

This might be an XY question, so I'll explain my premise:

  1. I have 3 VCF files, f1, f2 and f3.
  2. f1 is an annotated VCF covering 50 samples
  3. f2 is an annotated VCF covering 5 samples, but only sites that are not in f1
  4. f3 is an un-annotated VCF covering 5 samples across sites in f1 as well as not in f1
  5. All annotations are site-level

I now wish to get this as one VCF files with all sites annotated and all sample-level information present.

When I merge f1 and f2, I get a VCF with all annotated sites and all samples, but for those sites overlapping with f3, the GT/AD/... fields are empty, because that information is in f3. How do I merge these three datasets?

Question:

In essence, can I do an operation to update genotype fields in one VCF file based on a sample+site match in another VCF file? If they were 2 data.frames, the operation would be something like vcf1[site, sample] <- vcf2[site, sample].

Current solution:

The way I see it, I might have to subset f3 to f1-sites only, then bcftools merge <f1> <f3_subset> ><f1_F3_subset> - that way I do not add any site, only samples. Then I bcftools concat <f1+f3_subset> <f2> > <final_vcf>, so this time I add only sites, no samples. Any other solution will be appreciated. That solution does not work as bcftools concat cannot work on VCFs with different samples in them.

vcf • 238 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by RamRS22k
2
gravatar for RamRS
4 months ago by
RamRS22k
Houston, TX
RamRS22k wrote:

Here's my current solution:

  1. Subset all f1-sites present in f3: bcftools isec -n=2 -w1 -c none -o f3_subset f3 f1
  2. Pull annotations from f1 into f3_CommonSites_subset: bcftools annotate -c INFO -a f1 -o f3_subset_anno f3_subset
  3. Concat the new annotated file with f2 to get all site annotations for the 5 samples: bcftools concat -o f3_plus_f2 f2 f3_subset_anno
  4. Merge f1 and this 5-sample file to get final VCF: bcftools merge -m none -o final_vcf f1 f3_plus_f2

Just realized while I was writing this, I could just do bcftools merge -Ou -m none f1 f3 | bcftools annotate -c FORMAT -a f2 -o final_vcf -, so that way I would pick up the FORMAT fields exactly as I intended in the first place.

If anyone has a better solution, please add it in! Thank you!

ADD COMMENTlink modified 4 months ago • written 4 months ago by RamRS22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1688 users visited in the last hour