What is the correct way of setting the genotype after splitting multi-allelic sites in a VCF file?
Typically, multi-allelic calls are split into separate records and then any indels are left-aligned. You may also wish to reset the ID field, and/or check that each base in your REF column is consistent with the reference genome.
I elaborate on this in Solution 1, here:
Remove duplicate SNPs only based on SNP ID in bcftools
bcftools norm -m-any myfile.vcf.gz | \
bcftools norm --check-ref w -f human_g1k_v37.fasta -Ob > out.bcf ;
bcftools index out.bcf ;
-m-any splits any multi-allele calls
bcftools norm in conjunction with
-f human_g1k_v37.fasta will
--check-ref w should result in each base in your VCF's REF column being checked against the supplied FASTA file, with a warning issued if any inconsistency identified
Regarding the FASTA, please use the same FASTA as that used for the original alignment.
To reset the ID field to, e.g.
CHROM:POS:REF:ALT, please do:
bcftools annotate -Ob -x 'ID' -I +'CHROM:%POS:%REF:%ALT'
Traffic: 2162 users visited in the last hour