2
3
Entering edit mode
6.6 years ago
Tom ▴ 40

Hi everyone, wondering if someone can help me.

I have a 550K SNP array for 9,000 people, in PLINK format (.bim/.bed, or .ped/.map).

I want to narrow this 550K SNP set down, so that if there's a set of SNPs in LD with each other, this set of SNPs is represented by one SNP in the data set.

e.g. if in my data set I have SNP1, SNP2, SNP3, SNP4, LD > 0.8 with each other, then pick SNP1 to represent all of these SNPs.

Therefore, I want to have two output files:

one is a list of tag SNPs for use in further analysis (i.e. this file should have less SNPs than my starting file, as it's a list of tag SNPs), and then some sort of dictionary to tell me what other SNPs were grouped with each tag SNP.

I ran:

plink --bfile Affy550K --show-tags ListOfSNPs --list-all --out ListOfTags

where:

--bfile Affy550K is my genotypic data set, --show-tags ListOfSNPs is just a list of all 500,000 SNPs, and --out ListOfTags is what I want the output called.

The output looks like this. There are two files:

ListOfTags.tags is the same file as ListOfSNPs that I put into the command.

The ListOfTags.tags.list looks like this:

ss66376937    1    2898248    0    2898248    2898248        0 NONE
ss66208373    1    2911720    0    2911720    2911720        0 NONE
ss66266914    1    2939927    2    2939927    2947460    7.533 ss66374352|ss66433379
ss66374352    1    2940194    3    2939927    2947460    7.533 ss66266914|ss66235044|ss66433379
ss66235044    1    2941694    2    2940194    2947460    7.266 ss66374352|ss66433379
ss66177133    1    2942700    0    2942700    2942700        0 NONE
ss66433379    1    2947460    3    2939927    2947460    7.533 ss66266914|ss66374352|ss66235044

This file has the same number of lines as both ListOfSNPs and ListOfTags.tags

Questions:

In this example, does this mean that

1. ss66266914, ss66374352, ss66235044, ss66433379 are all in LD with each other and can be represented by one (randomly chosen?) SNP?

2. If this is true, do I have to code this myself to say "Take this input file, pick one of above four SNPs (from question 1) randomly, and make a dictionary like this: {ss66266914: ss66374352, ss66235044, ss6643379}".

Thanks

Tagging SNP plink • 5.2k views
3
Entering edit mode
6.6 years ago

#1 is correct.

However, you don't need to code your own SNP selection logic; PLINK's --indep-pairwise command does this for you.  Try something like

plink --bfile Affy550K --indep-pairwise 100 10 0.8
plink --bfile Affy550K --extract plink.prune.in --make-bed --out Affy550KPruned

This will automatically keep the higher-MAF variant whenever there is a choice.

2
Entering edit mode
6.6 years ago

>1. ss66266914, ss66374352, ss66235044, ss66433379 are all in LD with each other and can be represented by one (randomly chosen?) SNP?

You are correct in that, but I wouldn't randomly choose a SNP, I'd choose the one with the least missing alleles and only randomly choose if the candidate tagging SNPs have the same number of missing alleles.

>2. If this is true, do I have to code this myself to say "Take this input file, pick one of above four SNPs (from question 1) randomly, and make a dictionary like this: {ss66266914: ss66374352, ss66235044, ss6643379}".

This gets complicated rather quickly - I'd suggest running Haploview's Tagger instead as that's easier. You can use the GUI and click your way around, but with your relatively large amount of SNPs that may take a while. Or use the command line version like this:

java -jar Haploview.jar
-nogui # don't start the GUI
-memory 40000 # play around with this parameter if it crashes
-out your_output_file # will create your_output_file.TAGS and your_output_file.TESTS
-info  your_info_file
-skipCheck # OPTIONAL: will skip all standard SNP checks like MAF etc. You may want to leave this parameter out, as removing mostly empty SNPs etc. will speed up your analysis

You can create the info-file from your PLINK MAP file by just getting the second and last column using awk:

awk '{print $2 "\t"$4}' your_map_file > your_info_file

The file your_output_file.TAGS will have your tagging SNPs.