I am attempting to do some fine-mapping experiments. To increase the chances of capturing the pathogenic variants, I would like to include as many variants as I can, including AT and GC SNPs, and 15% of my data seems like a lot to drop if it can be retained.
I was reading from the Genotype Harmonizer webpage at https://github.com/molgenis/systemsgenetics/wiki/Genotype-Harmonizer#typical-usage-scenarios, and I noticed the following (it is about imputation, but the same logic would apply to my trans-ethnic fine-mapping experiments).
"When imputing genotype data the strand of both the study data to impute and the reference data used for imputation need to be identical. Some imputation tools can swap the strand of non-ambiguous SNPs but this is not possible for AT and GC SNPs. AT and GC can be swapped using minor allele frequency but this is not reliable, especially for variants with a high minor allele frequency. The Genotype Harmonizer solves these problems by using LD structure of nearby variants."
For some of the studies, I only have association summary statistics, including MAF. Thus, I cannot use an LD approach. However, regarding the portion that is bold, my question is, provided that the minor allele frequency is not close to 0.5, why is this unreliable? For example, if we have a SNP rs1234 with A: 0.800 and T: 0.200, just how unreliable is it to assign a strand using MAF? But more importantly, I would like to know the reasons why it is unreliable.
** Edit - I continued reading the Genotype Harmonizer manuscript. At one point, they clarify:
One solution to the problem of unknown strands is to compare the minor allele between two datasets. However, use of the minor allele is not ideal as it can differ between datasets and populations, especially for common variants.
This makes sense, because the minor allele of many variants, including AT and GC SNPs, can be different in different populations. Thus, a SNP that has MAF not near 0.5 in one subpopulation may still have MAF near 0.5 in another subpopulation. However, in my case, I know the population code of all the individuals in my data and in the reference datasets I used to align the other SNPs.
So, provided that I do the alignment to the + strand on a subpopulation by subpopulation basis, then am I OK to use MAF to assign strand? To clarify, I would drop any AT or GC SNP having 0.3 < MAF < 0.5 in one or more subpopulation. If I take this route, are there still any [systematic, rather than sporadic] issues with doing this, or if those precautions are taken is it alright?