I'm working on a project where I'm working with variants ranging from low frequency (AF~=0.03) up to high frequency (AF=1) after variant calling with lofreq
. There are some cases where positions in the vcf are duplicated and this is leading to issues with my downstream processing of the data. Consider the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
NC_045512.2 26299 . C CT 800 PASS DP=3085;AF=0.066775;SB=192;DP4=1321,1539,162,44;INDEL;HRUN=5;TYPE=INDEL GT 1
NC_045512.2 26299 . C CTT 77 PASS DP=3085;AF=0.007131;SB=73;DP4=1321,1539,22,0;INDEL;HRUN=5;TYPE=INDEL GT 1
NC_045512.2 26299 . C T 49314 PASS DP=3090;AF=0.889968;SB=12;DP4=175,150,1324,1426;TYPE=SNP GT 1
In these cases, I would like to keep the variant at the position that has the highest allele frequency. So, after filtering, I'd hope to only have the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
NC_045512.2 26299 . C T 49314 PASS DP=3090;AF=0.889968;SB=12;DP4=175,150,1324,1426;TYPE=SNP GT 1
Is there a simple way to filter out variants like these with duplicated positions using (e.g.) bcftools
? I've looked through the options for bcftools norm
and bcftools filter
, but the options to remove duplicates don't seem to have the flexibility I need for these cases (unless I'm mistaken). Otherwise I can write function using (e.g.) cyvcf2
in python, but I'd prefer to stick with bcftools
. Thanks!
Thanks Pierre - this is great. Accepted.