I have vcf files that I want to convert into .bed files with plink to use for proxy search. One issue I am having is that each variant id must be unique. In these vcf's, the multi-allelic variants are formatted as bi-allelic records. Here is an example :
tabix gnomad.genomes.v3.1.2.hgdp_tgp.chr6.vcf.bgz chr6:29440751-29440751 | cut -f 1-5 chr6 29440751 rs2074464 A C chr6 29440751 rs2074464 A G chr6 29440751 rs2074464 A T
I know that with bcftools, you can simply keep the first occurrence of a variant with
-d, but this is problematic for LD calculations. I would like to be able to ensure that the record that gets preserved has the highest allele frequencies of the all the records with that ID, not simply the first occurrence of the variant. This way I will have a better chance of having high r2 values when I calculate LD between this multi-allelic variant and another variant. Is this possible?