Question

How to tell Plink to merge based on Hg19 coordinates, not based on SNP ID

0

Entering edit mode

8.8 years ago

devenvyas ▴ 740

I am merging data sets, and the SNP IDs between the two are inconsistent, likely because Affx numbers changed between annotations. For example, merging SNP lists (in R) based on SNP ID loses me ~50,000 SNPs, but merging based on coordinate will only lose me ~10,000 SNPs (i.e., 545,956 sites vs. 585,413 sites).

If I have a command as such

plink --file data1 --merge data2.ped data2.map --recode --out merge

What do I do to tell Plink to ignore the SNP IDs from data2 and merge based on coordinate, not SNP ID?

Thanks!

-Deven

plink SNP • 4.4k views

ADD COMMENT • link updated 8.8 years ago by Maxime Lamontagne ★ 2.3k • written 8.8 years ago by devenvyas ▴ 740

0

Entering edit mode

You could change the SNP if by his position in the map or bim file: chr1:123456789

ADD REPLY • link 8.8 years ago by Maxime Lamontagne ★ 2.3k

0

Entering edit mode

I am not following quite what you mean, and I think that method may actually take longer than just getting Plink to ignore the .

Currently I have:

a) the two data sets in map/ped format

b) a list of 585,413 coordinates that match

c) a map file for the failed merger containing the 545,956 sites where both the SNP id and the coordinates both match

To begin with, I am not sure how to properly isolate the coordinates for the 39,457 that do not have matching SNP ids. After doing that I would need to find the old SNP id and the new SNP id for each.

Isn't there some simple way to tell Plink to ignore the SNP ids during merger?

ADD REPLY • link 8.8 years ago by devenvyas ▴ 740

score 3 · Accepted Answer · 2015-07-29

I don't think PLINK can merge by the coordinates. I think PLINK merge by using the SNP id because you can have more than one SNP at one genomic position.

If you want to merge by the coordinate, change the SNP id (rs123456789) by the genomic position in the map file.

a) replace the SNP id by the genomic position in both map files: replace "rs123456789" by "chr:position". If you are using Linux, use awk (awk '{ print $1"\t"$1":"$4"\t0\t"$4 }' old-file.map > new-file.map

b) merge both dataset by using the new map files. Since the SNP id is now the genomic position in both map files, you should have 585,413 SNPs in the merge file.

c) merge the new map file with one of the old map files with R to recover the real SNP id. You need to choose which map file you are using for SNP id since your old map files are not identical.

EDIT

New idea. Why not change the SNP id from one map file with the names of the other map file by using the command --update-map.