Question

Association analysis with SNP

0

Entering edit mode

2.8 years ago

cwwong13 ▴ 40

These are purely statistical/ analysis/ theoretical method questions. I am now conducting a gene/ variant association study between a gene(G) and disease as well as SNP (S) and disease.

I wonder:

It is common to see the variation of association beta among the SNPs within the same gene?

This is to say, for a gene (G), there are let's say 100 missense/ putative loss of function (pLOF) variants. The direction of association between each of these variants is not the same. i.e. some are positively associated with the disease while some are negatively associated with D.

Is it common to see the p-value of these SNPs also varies a lot?

Similar to the first question, some of the SNPs are "significantly" associated (p <0.05). But in my case, there are only 9/299 SNPs that are significant. I would like to know if this is common, given that they are all found in the same gene. If this is common to see, how are we going to interpret the variation in the phenotype if the gene is "mutated". I know we may try to blame some of the variants may be gain of function while some are loss of function. I would like to know are there any journal articles that I can cite to support this claim.

In my case, all the significant SNPs are positively associated with the disease. Is it legitimate to conclude that having missense/ pLOF mutation in this gene is positively associated with the disease?

I know the above conclusion is highly possible to be flawed. I wonder is there a better way to summarize the results from SNPs to gene-level results? I know I probably can claim a specific SNP is associated with the disease. However, as I mentioned above, this seems cannot be generalizable to the gene level, because many other SNPs are not significantly associated with the disease.

Is there a way to objectively (and legitimately) filter the SNPs to be included in the analysis?

Firstly, maybe I would like to ask if that is a necessity to filter (in a proper and formal data analysis). However, given the heterogeneity I mentioned above, I would like to know how to perform such a filtering step (if that is legitimate). For SNP arrays, I know that we should probably filter out those extremely rare variants due to inaccurate variant calling. I wonder whether whole exome/ genome sequencing also requires such a step.

Thanks in advance!

association WGS GWAS SNP WES • 642 views

ADD COMMENT • link updated 2.8 years ago by Collin ▴ 1000 • written 2.8 years ago by cwwong13 ▴ 40

score 0 · Answer 1 · 2021-07-18

Most of your questions revolve around an entire scientific field of variant interpretation. Using a monolithic term such as "mutated gene" belies the inherent complexity of variant interpretation. The impact of a specific variant on protein function lies on a spectrum. You might want to read some of the saturation mutagenesis papers where they functionally assess all possible variants within a protein (e.g. PMIDS: 30209399 and 29706350). That being said, loss-of-function variants across a protein are generally considerably more common than gain-of-function variants.

Q: It is common to see the variation of association beta among the SNPs within the same gene?

A: I assume you are talking about protein-coding variants. While most pLOF variants may have a similar large impact on protein function (except caveats related to being near the c-terminus, etc.), missense variants can have profoundly different impacts ranging from virtually no impact to highly pathogenic. If the missense variant is truly causal, then the beta should reflect a combination of two factors: the degree of functional impact on the protein and the relevance of the protein for the disease. I've previously noted for somatic mutations in cancer that the frequency of particular oncogenic variants in PTEN is directly related to the extent those variants reduce phosphatase activity (i.e. functional impact, PMID: 31202631, Fig 5). Thus, I would find it more odd if all the variants had the same beta.

Q: Is it common to see the p-value of these SNPs also varies a lot?

A: P-values are impacted by allele frequency of the SNP that you are looking at, which in turn could reflect chance historical/demographic effects. Imagine you study SNP A in population X with an allele frequency of 0.02, you may get a very low p-value, e.g. p=1e-20. But a second population Y may only have an allele frequency of 0.00001, which could result in the same SNP A being not significant. Differences in allele frequencies between populations is, in part, why people do GWAS, etc. in diverse populations.

Q: Is there a way to objectively (and legitimately) filter the SNPs to be included in the analysis?

A: Yes, the most common way is to use machine learning (ML) models to trained to predict either "pathogenic" or "functionally damaging" variants. Conceptually, one could filter by thresholding the score from such ML methods, or use the score directly as a continuous variable that weighs the likelihood of functional impact. If you are concerned about "objectivity", then you should look at benchmarking studies for these methods and use appropriately (e.g. Cancer: PMID 32079540).