I'm working on a project where I'm working with variants ranging from low frequency (AF~=0.03) up to high frequency (AF=1) after variant calling with lofreq. There are some cases where positions in the vcf are duplicated and this is leading to issues with my downstream processing of the data. Consider the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
NC_045512.2 26299 . C CT 800 PASS DP=3085;AF=0.066775;SB=192;DP4=1321,1539,162,44;INDEL;HRUN=5;TYPE=INDEL GT 1
NC_045512.2 26299 . C CTT 77 PASS DP=3085;AF=0.007131;SB=73;DP4=1321,1539,22,0;INDEL;HRUN=5;TYPE=INDEL GT 1
NC_045512.2 26299 . C T 49314 PASS DP=3090;AF=0.889968;SB=12;DP4=175,150,1324,1426;TYPE=SNP GT 1
In these cases, I would like to keep the variant at the position that has the highest allele frequency. So, after filtering, I'd hope to only have the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
NC_045512.2 26299 . C T 49314 PASS DP=3090;AF=0.889968;SB=12;DP4=175,150,1324,1426;TYPE=SNP GT 1
Is there a simple way to filter out variants like these with duplicated positions using (e.g.) bcftools? I've looked through the options for bcftools norm and bcftools filter, but the options to remove duplicates don't seem to have the flexibility I need for these cases (unless I'm mistaken). Otherwise I can write function using (e.g.) cyvcf2 in python, but I'd prefer to stick with bcftools. Thanks!
Thanks Pierre - this is great. Accepted.