Question: SNP dataset and Z Score
gravatar for nkhan.mscs15seecs
2.4 years ago by
nkhan.mscs15seecs60 wrote:

My question is very basic and I beg pardon being new in this field as this question would sound stupid to the pros.

I know SNP is change at a single position in a genetic sequence like A to G or C to T in a GWAS studies. My basic question is how these type of data is represented as I got a SNP dataset here but having hard times what it is also I have seen VCF file format and it contains lot of information like LD,MAF etc

According to my understanding it should be discrete data. Also How we calculate Z score of such a discrete data as I have seen lot of papers filters SNP based on there low Z values.

My Understanding

One can obviously make a 2x3 contingency table where rows represents subjects and control and columns represent types of allelets like AA, Aa and aa and count numbers of those cells from the data given then apply chi-square and calculate p-value but how Z score would be calculated for such data?

So I am having two issues one in being understanding dataset related to SNP and how Z score are calculated?

snp Z • 5.0k views
ADD COMMENTlink modified 2.4 years ago by Kevin Blighe65k • written 2.4 years ago by nkhan.mscs15seecs60

He who asks a question is a fool for five minutes; he who does not ask a question remains a fool forever. ~ unknown

appreciating your courage to ask! there are people who don't ask and if they do , they do anonymously!

ADD REPLYlink written 2.4 years ago by lakhujanivijay5.2k

@Vijay Lakhujani thanks sir

ADD REPLYlink written 2.4 years ago by nkhan.mscs15seecs60

My question is very basic and I beg pardon being new in this field as this question would sound stupid to the pros.

No need for this, none of us was born with bioinformatics skills :-)

ADD REPLYlink written 2.4 years ago by WouterDeCoster44k

I don't think that applies to Pierre. @ Wouter

ADD REPLYlink written 2.4 years ago by cpad011214k

His first word probably was 'awk'.

ADD REPLYlink written 2.4 years ago by WouterDeCoster44k
gravatar for Kevin Blighe
2.4 years ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

Your questions are not a nuisance, so, do not feel bad for asking.

In association studies, the usual focus at each SNP position is the minor allele, i.e., the SNP allele that has the lowest frequency in the samples being studied in your dataset - I am assuming that you know this? At some genotyped sites, the minor allele may have a frequency (i.e. minor allele frequency - MAF) of 49% compared to 51% for the major allele, which is less interesting because, with a frequency of 49%, it is seen as a 'common' variant. At others, however, the minor allele may have a MAF of just 1%, which classes it as a 'very rare' variant (MAF 5% is usually the cut-off for rare / non-rare). Important to note, however, that both common and rare variants can be functional and have roles in disease. For further reading, read: Rare and common variants: twenty arguments.

In any case, if we just take the most basic type of association test and tabulate the number of minor and major alleles in our cases and controls, we can get an example 2x2 contingency table like this:

                  Cases  Controls
    Minor allele  27     6
    Major allele  73     94

You can see that the minor allele is more frequent in the cases for this particular SNP. We can easily derive a 1 degree of freedom Chi-square p-value for this in R Programming Language:


    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency.table
X-squared = 14.516, df = 1, p-value = 0.0001389

Not genome-wide significance at all, but this is only a 100 sample dataset as an example.

We can then derive an odds ratio (OR) for the minor allele:

(27/6) / (73/94)
[1] 5.794521

Standard error of OR:

sqrt((1/27) + (1/6) + (1/73) + (1/94))
[1] 0.477536

Upper 95% confidence interval (CI) of the OR

5.794521 * exp(1.96 * 0.477536)
[1] 14.77421

Lower 95% CI of the OR:

5.794521 * exp(- 1.96 * 0.477536)
[1] 2.27264

With all of this useful information, we can then also calculate the Z-score. The Z-score is the log of the OR (log.OR) divided by the standard error of log.OR (SE.log.OR). The SE.log.OR calculation involves both the OR and the lower CI of the OR:

log.OR <- log(5.794521)
lower95.log.OR <- log(2.27264)
SE.log.OR <- (log.OR - lower95.log.OR) / 1.96

Then calculate Z:

log.OR / SE.log.OR
[1] 3.679121


Another way to calculate p-values, ORs, and Z-scores in association studies is through logistic regression analysis. In regression, one can encode the genotypes as categorical variables or, usually, numerical variables in 'additive' models. In these cases, one has the following:

  • homozygous minor allele = 2
  • heterozygous minor allele = 1
  • homozygous major allele = 0

One can also adjust for covariates in these models, such as smoking status, BMI, ethnicity and/or PCA eigenvectors, etc. From regression, the OR is the exponent of the estimate, and the Z-score (if not explicitly given) can be calculated in the same way as above. I built a pipeline for a complex type of trios family analysis using these types of metrics and conditional logistic regression (where cases and controls are matched into strata): GwasTriosCLogit


If you are wondering from where I magically got 1.96 and used it in the calculations, then look HERE.

This example is to just give you a fundamental understanding of what is going on 'behind the scenes' in association studies. Obviously there are many dozens of types of analyses that involve different statistical tests, and programs like PLINK, etc, are undoubtedly doing further adjustments to the data than I have shown here.


ADD COMMENTlink modified 19 months ago • written 2.4 years ago by Kevin Blighe65k

We used to use <0.01% (not more than 1 in 10,000 alleles) as a rare variant cutoff. IMO 1% is common, as you're seeing the variant in 1 in 100 alleles, which could be as low as 1 in 50 people.

ADD REPLYlink written 2.4 years ago by RamRS30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1001 users visited in the last hour