I've got some gwas data I'd like to impute, but for that to happen, I need every snp to be aligned to the forward strand of the reference genome. This is not as simple as it sounds, due to many snps being ambiguous (A/T or C/G) combos.
Therefore I've tried looking at both strand data for the chips, and also the snp manifests, comparing them to the snps that are flipped, but I cannot see any pattern. What I'm looking for is a pattern in these files which explains why the snps on the first list is incorrect compared to the reference genome. If you see anything or would need more info please do ask.
Ps. the data might be botched by the researchers who used these data originally (they've moved on long since.)
Here is the head of a list of snps that are flipped compared to the reference (name, chr, position, a1, a2, reference nucleotide):
rs1774963 1 21703207 C T G rs2257576 1 83736947 T C A rs315041 1 77055775 A G C rs3094315 1 752565 C T G rs3737728 1 1021414 T C A rs11721 1 1152630 T G C rs2887286 1 1156130 G A T rs3813199 1 1158276 T C G rs3766186 1 1162434 T G C
Here are the corresponding entries from the strand file (http://www.well.ox.ac.uk/~wrayner/strand/):
rs1774963 1 21703208 99.1735537190083 + AG rs2257576 1 83736948 100 + AG rs315041 1 77055776 99.1735537190083 - AG rs3094315 1 752566 99.1735537190083 + AG rs3737728 1 1021415 100 + AG rs11721 1 1152631 99.1735537190083 + AC rs2887286 1 1156131 100 - AG rs3813199 1 1158277 99.1735537190083 + AG rs3766186 1 1162435 99.1735537190083 + AC
Here are the corresponding entries from the snp table/manifest:
Name SNP ILMN Strand Customer Strand rs1774963 [A/G] TOP BOT rs2257576 [A/G] TOP BOT rs315041 [T/C] BOT TOP rs3094315 [T/C] BOT TOP rs3737728 [A/G] TOP BOT rs11721 [A/C] TOP BOT rs2887286 [T/C] BOT TOP rs3813199 [A/G] TOP BOT rs3766186 [A/C] TOP BOT
What is the rule that explains why the snps on the first lists are opposite of the reference genome? Or might these data be non-sensical?