Question

What makes assigning A/T and G/C SNPs to a strand unreliable if MAF is not near 0.5?

3

Entering edit mode

6.8 years ago

LauferVA 4.2k

Hello,

I am attempting to do some fine-mapping experiments. To increase the chances of capturing the pathogenic variants, I would like to include as many variants as I can, including AT and GC SNPs, and 15% of my data seems like a lot to drop if it can be retained.

I was reading from the Genotype Harmonizer webpage at https://github.com/molgenis/systemsgenetics/wiki/Genotype-Harmonizer#typical-usage-scenarios, and I noticed the following (it is about imputation, but the same logic would apply to my trans-ethnic fine-mapping experiments).

"When imputing genotype data the strand of both the study data to impute and the reference data used for imputation need to be identical. Some imputation tools can swap the strand of non-ambiguous SNPs but this is not possible for AT and GC SNPs. AT and GC can be swapped using minor allele frequency but this is not reliable, especially for variants with a high minor allele frequency. The Genotype Harmonizer solves these problems by using LD structure of nearby variants."

For some of the studies, I only have association summary statistics, including MAF. Thus, I cannot use an LD approach. However, regarding the portion that is bold, my question is, provided that the minor allele frequency is not close to 0.5, why is this unreliable? For example, if we have a SNP rs1234 with A: 0.800 and T: 0.200, just how unreliable is it to assign a strand using MAF? But more importantly, I would like to know the reasons why it is unreliable.

** Edit - I continued reading the Genotype Harmonizer manuscript. At one point, they clarify:

One solution to the problem of unknown strands is to compare the minor allele between two datasets. However, use of the minor allele is not ideal as it can differ between datasets and populations, especially for common variants.

This makes sense, because the minor allele of many variants, including AT and GC SNPs, can be different in different populations. Thus, a SNP that has MAF not near 0.5 in one subpopulation may still have MAF near 0.5 in another subpopulation. However, in my case, I know the population code of all the individuals in my data and in the reference datasets I used to align the other SNPs.

So, provided that I do the alignment to the + strand on a subpopulation by subpopulation basis, then am I OK to use MAF to assign strand? To clarify, I would drop any AT or GC SNP having 0.3 < MAF < 0.5 in one or more subpopulation. If I take this route, are there still any [systematic, rather than sporadic] issues with doing this, or if those precautions are taken is it alright?

A/T G/C SNP Strand • 3.9k views

ADD COMMENT • link 6.8 years ago by LauferVA 4.2k

0

Entering edit mode

I solved one of my questions.

Suppose we have a SNP rs1234 with A: 0.800 and T: 0.200, just how unreliable is it to assign a strand using MAF?

How likely it is depends on the size of the sample population. Suppose there are 1000 subjects in the sample and you wanted the know the probability that "A" is actually the minor allele, while in your study it appears (wrongly) to be the major allele. Then, in R, you could write:

binom.test(800, 1000, alternative="greater")

However, the bigger issue in my question is that I am unsure of other sources of systematic error that could introduce other considerations or invalidate the simple calculations above, such as the need to account for ancestry described in the edit.

I would still very much like other people's perspectives on those issues.

ADD REPLY • link 6.8 years ago by LauferVA 4.2k