vcf variant selection
Entering edit mode
4 months ago

I have vcf files that I want to convert into .bed files with plink to use for proxy search. One issue I am having is that each variant id must be unique. In these vcf's, the multi-allelic variants are formatted as bi-allelic records. Here is an example :

tabix gnomad.genomes.v3.1.2.hgdp_tgp.chr6.vcf.bgz chr6:29440751-29440751 | cut -f 1-5
chr6    29440751    rs2074464   A   C
chr6    29440751    rs2074464   A   G
chr6    29440751    rs2074464   A   T

I know that with bcftools, you can simply keep the first occurrence of a variant with -d, but this is problematic for LD calculations. I would like to be able to ensure that the record that gets preserved has the highest allele frequencies of the all the records with that ID, not simply the first occurrence of the variant. This way I will have a better chance of having high r2 values when I calculate LD between this multi-allelic variant and another variant. Is this possible?

bcftools ld plink vcf • 381 views
Entering edit mode

I think it can be done with a little work, but first an easier option - can you just collapse the biallelic sites to multiallelic sites, so that you have unique IDs? Does your downstream software support multiallelic sites?

You can collapse sites this way with bcftools, vcflib, or other tools, it's pretty standard.

If not, somebody may cook up more complex solution..

Entering edit mode
4 months ago
jena ▴ 270

Oh wait - do you have this problem with plink specifically? Because plink up to 1.9 can only handle biallelic sites and drops any multiallelic site or even sites with repeated indices IIRC.

But plink 2.0 handles both cases, or at most you may need to collapse to multiallelic sites, which plink 2 definitely handles.


Login before adding your answer.

Traffic: 2088 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6