Question

SNP dataset and Z Score

4

Entering edit mode

6.0 years ago

nkhan.mscs15seecs ▴ 80

I know SNP is change at a single position in a genetic sequence like A to G or C to T in a GWAS studies. My basic question is how these type of data is represented as I got a SNP dataset here but having hard times what it is also I have seen VCF file format and it contains lot of information like LD,MAF etc

According to my understanding it should be discrete data. Also How we calculate Z score of such a discrete data as I have seen lot of papers filters SNP based on there low Z values.

My Understanding

One can obviously make a 2x3 contingency table where rows represents subjects and control and columns represent types of allelets like AA, Aa and aa and count numbers of those cells from the data given then apply chi-square and calculate p-value but how Z score would be calculated for such data?

So I am having two issues one in being understanding dataset related to SNP and how Z score are calculated?

SNP z-score gwas • 13k views

ADD COMMENT • link updated 3.6 years ago by zx8754 11k • written 6.0 years ago by nkhan.mscs15seecs ▴ 80

score 28 · Accepted Answer · 2018-04-20

Your questions are not a nuisance, so, do not feel bad for asking.

In association studies, the usual focus at each SNP position is the minor allele, i.e., the SNP allele that has the lowest frequency in the samples being studied in your dataset - I am assuming that you know this? At some genotyped sites, the minor allele may have a frequency (i.e. minor allele frequency - MAF) of 49% compared to 51% for the major allele, which is less interesting because, with a frequency of 49%, it is seen as a 'common' variant. At others, however, the minor allele may have a MAF of just 1%, which classes it as a 'very rare' variant (MAF 5% is usually the cut-off for rare / non-rare). Important to note, however, that both common and rare variants can be functional and have roles in disease. For further reading, read: Rare and common variants: twenty arguments.

In any case, if we just take the most basic type of association test and tabulate the number of minor and major alleles in our cases and controls, we can get an example 2x2 contingency table like this:

contingency.table
                  Cases  Controls
    Minor allele  27     6
    Major allele  73     94

You can see that the minor allele is more frequent in the cases for this particular SNP. We can easily derive a 1 degree of freedom Chi-square p-value for this in R Programming Language:

chisq.test(contingency.table)

    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency.table
X-squared = 14.516, df = 1, p-value = 0.0001389

Not genome-wide significance at all, but this is only a 100 sample dataset as an example.

We can then derive an odds ratio (OR) for the minor allele:

(27/6) / (73/94)
[1] 5.794521

Standard error of OR:

sqrt((1/27) + (1/6) + (1/73) + (1/94))
[1] 0.477536

Upper 95% confidence interval (CI) of the OR

5.794521 * exp(1.96 * 0.477536)
[1] 14.77421

Lower 95% CI of the OR:

5.794521 * exp(- 1.96 * 0.477536)
[1] 2.27264

With all of this useful information, we can then also calculate the Z-score. The Z-score is the log of the OR (log.OR) divided by the standard error of log.OR (SE.log.OR). The SE.log.OR calculation involves both the OR and the lower CI of the OR:

log.OR <- log(5.794521)
lower95.log.OR <- log(2.27264)
SE.log.OR <- (log.OR - lower95.log.OR) / 1.96

Then calculate Z:

log.OR / SE.log.OR
[1] 3.679121

----------------------------------------------------------------

Another way to calculate p-values, ORs, and Z-scores in association studies is through logistic regression analysis. In regression, one can encode the genotypes as categorical variables or, usually, numerical variables in 'additive' models. In these cases, one has the following:

homozygous minor allele = 2
heterozygous minor allele = 1
homozygous major allele = 0

One can also adjust for covariates in these models, such as smoking status, BMI, ethnicity and/or PCA eigenvectors, etc. From regression, the OR is the exponent of the estimate, and the Z-score (if not explicitly given) can be calculated in the same way as above. I built a pipeline for a complex type of trios family analysis using these types of metrics and conditional logistic regression (where cases and controls are matched into strata): GwasTriosCLogit

---------------------------------------------

If you are wondering from where I magically got 1.96 and used it in the calculations, then look HERE.

This example is to just give you a fundamental understanding of what is going on 'behind the scenes' in association studies. Obviously there are many dozens of types of analyses that involve different statistical tests, and programs like PLINK, etc, are undoubtedly doing further adjustments to the data than I have shown here.

Kevin