I am performing a GWAS study on a data set. I have been using PLINK 1.9 for our pre-processing. A problem I’ve discovered is that we have duplicate variant IDs (some of the rs#s appear more than once). I originally thought this meant we have duplicate SNPs and we could handle the problem using the following code I found online:
./plink1_9 --bfile 23andMe_April26 --list-duplicate-vars ids-only suppress-first
./plink1_9 --bfile 23andMe_April26 -exclude plink.dupvar --out 23andMe_April26_snp_choose
This code is not working; it has an error saying there is a duplicate variant ID. I’ve believe the problem with the code is that duplicate SNPs is different than duplicate variant IDs (duplicate SNPs involves duplicate allele assignments I believe). There are a variety of solutions online, but in the end most of them say this is the fix in PLINK1.9. The reason this code is not working, if you go to this page:
https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars
it says “--list-duplicate-vars fails in 'ids-only' mode if any of the reported variant IDs are not unique.” Wherein this is where I'm figuring duplicate SNPs are different than duplicate variant IDs.
The ‘ids-only’ is necessary so that you can use –exclude, and ‘suppress-first’ is necessary so we can keep one of the duplicates.
What I want to do is the following: remove all but one of the duplicate variant IDs after we have removed all the rs#s with too much missing data
What I can currently do: -I can remove all duplicate SNPs, which means not keeping one of the duplicates. This is a problem because we end up losing information we may want to use later. -I can keep everything, but I believe we end up getting warnings later in the code because we have duplicate variant IDs.
I’ve also seen a post saying one can use PLINK2's option --set-all-var-ids to solve this problem. I downloaded PLINK2 and have run into a variety of errors trying to implement this code. I'm also confused as to how it is supposed to solve the problem. I believe it creates even more unique names by adding on the alleles into the title of each variant.
Please correct me if I have made any errors here. I am quite new to these kinds of analyses. Any help is much appreciated. Thank you in advance for any assistance.