Question

PLINK --update-name command error due to multiple SNPs with same chr:pos but different rs numbers in reference dataset

1

Entering edit mode

8.3 years ago

dam4l ▴ 200

Hi,

I have a data file containing autosomal SNPs imputed from the 1000 genomes data. The SNPs in my file are named as chr:pos but I want them to be named by rs number. I downloaded the 1000 genomes phase 1 data from the PLINK resources site, excluded sex chromosomes, and organized the file so that I have 2 colums: chr:pos (column 1) and corresponding rs number (column 2). I then tried to use PLINK --update-name command to update the SNP names in my file:

./plink --bfile my_data --update-name 1000_genomes_chrpos_rs.txt --make-bed --out my_data2

I got back the following error message:

Error: Duplicate variant ID '1:2351395' in --update-name file

In the 1000 genomes file, this (and likely other) chr:pos corresponds to multiple rs numbers. Is there a way to rectify this or modify the PLINK command so that I can change the naming of the SNPs in my data file from chr:pos to rs number?

Thanks so much!

PLINK SNP • 9.2k views

ADD COMMENT • link updated 19 months ago by Fazil • 0 • written 8.3 years ago by dam4l ▴ 200

0

Entering edit mode

Hi,

I have exactly same problem now. I was wondering, have you figured out this problem to remove the duplicate variant ID?

Thank you.

ADD REPLY • link 4.7 years ago by jystat ▴ 10

0

Entering edit mode

I am wondering too how did you solve this problem?

ADD REPLY • link 2.9 years ago by geno89 ▴ 10

0

Entering edit mode

Hi, You can use unix/linux command to remove or rename duplicated or triplicated lines of your file. Here I'm presenting example assuming that you want to make column 2 unique (you test with small file first). It will add _0 _1 _2 etc. to duplicated values. For example, if your file has 2 columns 18 15 44 16 55 15 77 15 will be turned to numbers 18 15_0 44 16_0 55 15_1 77 15_2 (note that changes are only in column 2) The next pipe (sed 's/_0//' ) removes _0 and keeps other _2 etc. 18 15 44 16 55 15_1 77 15_2 (so, the second column will have unique values)

The command is (I'm assuming you have 6 columns, if the number is different remove or add $3, $4 etc.):

awk '{print $1, $2"_"x[$2]++, $3, $4, $5, $6}' update_file's_name | sed 's/_0//' > result_file_name

If you would like to remove all other underscores like 15_2 15_3 etc you can proceed with extending pipe to grep -v _\ To use column 1 you need to replace $2_\x[$2]++ with $1_\x[$1]++

I hope it helps, Thanks

ADD REPLY • link 19 months ago by Fazil • 0

score 0 · Answer 1 · 2016-01-14

PLINK can't choose which rs number to use when you have two rs number at the same position in your --update-name file. You need to modify the --update-name file before using PLINK. You could use R to merge your bim file with the PLINK ref file and when you have duplicate, keep the rs number with the same alleles as your SNP.