Entering edit mode
2.8 years ago
mahsa77asadi
•
0
What is the correct way of setting the genotype after splitting multi-allelic sites in a VCF file?
What is the correct way of setting the genotype after splitting multi-allelic sites in a VCF file?
Typically, multi-allelic calls are split into separate records and then any indels are left-aligned. You may also wish to reset the ID field, and/or check that each base in your REF column is consistent with the reference genome.
I elaborate on this in Solution 1, here: Remove duplicate SNPs only based on SNP ID in bcftools
That is:
bcftools norm -m-any myfile.vcf.gz | \
bcftools norm --check-ref w -f human_g1k_v37.fasta -Ob > out.bcf ;
bcftools index out.bcf ;
-m-any
splits any multi-allele callsbcftools norm
in conjunction with -f human_g1k_v37.fasta
will
left-align indels--check-ref w
should result in each base in your VCF's REF column being checked against the supplied FASTA file, with a warning issued if any inconsistency identifiedRegarding the FASTA, please use the same FASTA as that used for the original alignment.
To reset the ID field to, e.g. CHROM:POS:REF:ALT
, please do:
bcftools annotate -Ob -x 'ID' -I +'CHROM:%POS:%REF:%ALT'
Kevin
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please search the forum before posting a new question. This question is an exact duplicate (at least the title matches, you put in near zero effort in your post) of What is the correct way of setting the genotype after splitting multi-allelic sites in a VCF file?