Question

Duplicate Variant IDs Issue

0

Entering edit mode

4.8 years ago

goodmb • 0

I am performing a GWAS study on a data set. I have been using PLINK 1.9 for our pre-processing. A problem I’ve discovered is that we have duplicate variant IDs (some of the rs#s appear more than once). I originally thought this meant we have duplicate SNPs and we could handle the problem using the following code I found online:

./plink1_9 --bfile 23andMe_April26 --list-duplicate-vars ids-only suppress-first
./plink1_9 --bfile 23andMe_April26 -exclude plink.dupvar --out 23andMe_April26_snp_choose

This code is not working; it has an error saying there is a duplicate variant ID. I’ve believe the problem with the code is that duplicate SNPs is different than duplicate variant IDs (duplicate SNPs involves duplicate allele assignments I believe). There are a variety of solutions online, but in the end most of them say this is the fix in PLINK1.9. The reason this code is not working, if you go to this page:

https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars

it says “--list-duplicate-vars fails in 'ids-only' mode if any of the reported variant IDs are not unique.” Wherein this is where I'm figuring duplicate SNPs are different than duplicate variant IDs.

The ‘ids-only’ is necessary so that you can use –exclude, and ‘suppress-first’ is necessary so we can keep one of the duplicates.

What I want to do is the following: remove all but one of the duplicate variant IDs after we have removed all the rs#s with too much missing data

What I can currently do: -I can remove all duplicate SNPs, which means not keeping one of the duplicates. This is a problem because we end up losing information we may want to use later. -I can keep everything, but I believe we end up getting warnings later in the code because we have duplicate variant IDs.

I’ve also seen a post saying one can use PLINK2's option --set-all-var-ids to solve this problem. I downloaded PLINK2 and have run into a variety of errors trying to implement this code. I'm also confused as to how it is supposed to solve the problem. I believe it creates even more unique names by adding on the alleles into the title of each variant.

Please correct me if I have made any errors here. I am quite new to these kinds of analyses. Any help is much appreciated. Thank you in advance for any assistance.

SNP PLINK GWAS • 6.6k views

ADD COMMENT • link updated 4.8 years ago by chrchang523 10k • written 4.8 years ago by goodmb • 0

score 1 · Answer 1 · 2019-08-02

1

Entering edit mode

4.8 years ago

chrchang523 10k

PLINK2's --rm-dup flag addresses actual duplicate SNPs, and the --set-all-var-ids flag you mention should assign unique IDs to same-rs# records which are actually different in your dataset.

ADD COMMENT • link 4.8 years ago by chrchang523 10k

0

Entering edit mode

Thank you, this is very helpful. I found I can run the following no problem:

./plink2 --bfile 23andMe_April26 --rm-dup force-first --make-bed --out MyDataLessDuplicates

but when I add on the --set-all-var-ids:

./plink2 --bfile 23andMe_April26 --set-all-var-ids @:#[b37]\$r,\$a --rm-dup force-first --make-bed --out MyDataLessDuplicates

I get the following error:

Error: 140 allele codes too long for --set-all-var-ids. Use '--new-id-max-allele-len [len] missing' to set the IDs of all variants with an allele code longer than the given length to '.' (and then process those variants with another script, if necessary).

I am uncertain what using this fix might do with my data. It's possible there is just something wrong with my code.

Right now I am using --rm-dup force-first (which I know keeps the first instance of the SNP). Is there a way to remove the SNP with more missing data instead of arbitrarily picking the first one?

ADD REPLY • link 4.8 years ago by goodmb • 0

1

Entering edit mode

If you're fine with super-long variant IDs, you could instead add e.g. "--new-id-max-allele-len 1000" to your --set-all-var-ids command line.
"--rm-dup retain-mismatch" can be used to keep all instances of the duplicated SNP when the genomic data isn't identical, while writing those IDs to <output prefix>.rmdup.mismatch. You can then look at just those SNPs and decide how you want to handle them. (Yes, this will require a bit of scripting on your part.)

ADD REPLY • link 4.8 years ago by chrchang523 10k

0

Entering edit mode

Rather than remove the duplicates, is there a way to keep any 'lost' data via a merger? Consider the following example: There is a duplicate variant in my data, and lets call the first one variantA and the other variantB (despite that they are indeed the same variant, this is more for index purposes). Let's say there are 3 people in my data set (person1, person2, and person3). variantA has data for person1 and person2, whereas variantB has data for person1 and person3. Going by the suppress-first option, variantA will be kept and variantB will be removed. But since they are the same variant, we could do better by merging the person3 data from variantB with variantA, thus having data for all 3 people.

I am wondering if there is a nice way using syntax in PLINK to do this. In this manner keeping as much data as possible. Thank you in advance for any assistance.

ADD REPLY • link 4.8 years ago by goodmb • 0

0

Entering edit mode

plink 1.9 --bmerge, used to merge a dataset with itself, might do what you want here.

ADD REPLY • link 4.8 years ago by chrchang523 10k