Hi everyone, wondering if someone can help me.
I have a 550K SNP array for 9,000 people, in PLINK format (.bim/.bed, or .ped/.map).
I want to narrow this 550K SNP set down, so that if there's a set of SNPs in LD with each other, this set of SNPs is represented by one SNP in the data set.
e.g. if in my data set I have SNP1, SNP2, SNP3, SNP4, LD > 0.8 with each other, then pick SNP1 to represent all of these SNPs.
Therefore, I want to have two output files:
one is a list of tag SNPs for use in further analysis (i.e. this file should have less SNPs than my starting file, as it's a list of tag SNPs), and then some sort of dictionary to tell me what other SNPs were grouped with each tag SNP.
I ran:
plink --bfile Affy550K --show-tags ListOfSNPs --list-all --out ListOfTags
where:
--bfile Affy550K
is my genotypic data set, --show-tags ListOfSNPs
is just a list of all 500,000 SNPs, and --out ListOfTags
is what I want the output called.
The output looks like this. There are two files:
ListOfTags.tags
is the same file as ListOfSNPs
that I put into the command.
The ListOfTags.tags.list
looks like this:
ss66376937 1 2898248 0 2898248 2898248 0 NONE
ss66208373 1 2911720 0 2911720 2911720 0 NONE
ss66266914 1 2939927 2 2939927 2947460 7.533 ss66374352|ss66433379
ss66374352 1 2940194 3 2939927 2947460 7.533 ss66266914|ss66235044|ss66433379
ss66235044 1 2941694 2 2940194 2947460 7.266 ss66374352|ss66433379
ss66177133 1 2942700 0 2942700 2942700 0 NONE
ss66433379 1 2947460 3 2939927 2947460 7.533 ss66266914|ss66374352|ss66235044
This file has the same number of lines as both ListOfSNPs
and ListOfTags.tags
Questions:
In this example, does this mean that
- ss66266914, ss66374352, ss66235044, ss66433379 are all in LD with each other and can be represented by one (randomly chosen?) SNP?
- If this is true, do I have to code this myself to say "Take this input file, pick one of above four SNPs (from question 1) randomly, and make a dictionary like this:
{ss66266914: ss66374352, ss66235044, ss6643379}
".
Thanks