I am performing quality control on genotyping results from the Illumina Global Screening array. The SNP names throughout the PLINK format .map and .ped files are not in a consistent format (i.e., some are rsIDs, some have prefixes/suffixes, some are in chr:pos format, etc). Illumina provides a support file to "easily" convert the locus names to rsID.
This file can be found here: https://support.illumina.com/array/array_kits/infinium-global-screening-array/downloads.html; Go to Infinium Global Screening Array v1.0 Support Files >> Infinium Global Screening Array v1.0 Loci Name to rsID Conversion File to download this file. This file is referred to as 'rsid_conversion.txt' in the following example.
I would like to use PLINK to convert all of the variants to rsID before beginning any additional QC on these data.
I am planning to use the binary PLINK files I have created and the --update-map with the --update-name flag to change the naming convention:
./plink --bfile mydata --update-map rsid_conversion.txt --update-name --make-bed --out mydata_2
The 'rsid_conversion.txt' document must contain 2 columns, 1 with the 'old' SNP ID and 1 with the 'new' SNP ID.
The issue I am having is that the 'rsid_conversion.txt' file (provided by Illumina) has ~1700 variants with multiple corresponding rsIDs. It looks like this:
1:100292476 rs568121721 1:101064936 rs573946207 1:103380393 rs577266494 1:104303716 rs565423312 1:104864464 rs572915890 1:106737318 rs577315876 1:109439680 rs755970517 1:111119214 rs574063395 1:114483147 rs563427365 1:118227370 rs550820657 1:1183442 rs566056983 1:118933200 rs566726162 1:11907740 rs770346667 1:119872141 rs587704005,rs775057557 1:120123727 rs587654226 1:120608075 rs61200250 1:143928232 rs10217823,rs587739047 1:145030589 rs587603332 1:147231345 rs782228278 1:147539994 rs587645706 1:152276828 rs527781212 1:152280670 rs536240526 1:153796202 rs533359379 1:153941698-CT . 1:154192463 rs551231942 1:1542721 rs532649680 1:156810888 rs544015279 1:158224461 rs762512961 1:158549106 rs146846805 1:158651421 rs746078222 1:159174749-C-T rs373611432 1:159175193-A-G rs3027016
I have checked a few of these manually in dbSNP and one of the rsIDs is historical in most cases (that I checked - a small minority of actual occurrences). There are, however, some variants with 3+ corresponding rsIDs and I do not want to manually check them all or manually remove all but 1.
Is there an automated way to retain only 1 rsID per variant? And to ensure that this is the 'correct' rsID (i.e., not historical)?
I am also concerned that some of these multiple rsID situations might actually correspond to different variants occurring at the same position (e.g., SNP vs indel). There are too many variants to go through each case manually - Is there an automated way to check for this?
Any input is greatly appreciated.