Question: SNP dataset Understanding and Usage
gravatar for nkhan.mscs15seecs
2.3 years ago by
nkhan.mscs15seecs60 wrote:

I am just started to understand SNP and related information. So when I searched a particular SNP rs80334247. I got lot of information like

  1. Chromosome No.3
  2. Minor allele count T=0.0038/19 (1000 Genomes) 3.Gene ID SCN5A (6331)
  3. Major and Minor allele count on different populations around the world.

and lot of other information

Now my question is that where to search for information like Homozygous & Heterozygous,Dominant and Recessive allele and how this information can be downloaded on all population?

For example if I want to test a particular SNP by calculating Chi-Square and P-value then I need to make a contingency table like following

                    AA         Aa             aa


On the other hand if I want to calculate p -values using logistic regression where predictors are SNP and response is 1 or 0 for subject and control. Then what information would be needed in SNPs like SNP1 would have what type of values?

My Understanding

Take Y as response variable that takes 0 or 1 for control and subject and lets say I have 3 SNPs. SNP1 SNP2 and SNP3 on 10 subjects

Y       SNP1      SNP2    SNP3
1       ?          ?        ?`

I have confusion here that what will be the corresponding values in SNPs as a single SNP has lot of information like MAF, major allele count or minor allele count etc or these SNP can be encoded like 0 and 1 for example if our reference allele is A(by the way how I know this is reference?) then in each subjects either we have that major allele or not then we can encode it as 0 or 1.

So I have these number of confusions related to SNP dataset and its usage? If somebody could explain me with a small example dataset on SNP it would relieve me of much pain related to SNP dataset understanding and its usage. ?

snp dataset • 1.5k views
ADD COMMENTlink modified 2.3 years ago by Kevin Blighe63k • written 2.3 years ago by nkhan.mscs15seecs60
gravatar for Kevin Blighe
2.3 years ago by
Kevin Blighe63k
Kevin Blighe63k wrote:

Thank you - based on this and your other question ( SNP dataset and Z Score ), you are evidently learning about GWAS.

The information for each SNP is accumulated from numerous large-scale projects:

When the next large scale sequencing projects are complete, the data may also make its way to the information page for each SNP. It provides for an invaluable resource for researchers now and into the future.


In your previous question, you were more interested in the simple concept of association, and then I also briefly touched on regression. As mentioned, for regression analysis, one can encode AA, Aa, and aa as categorical variables, or numerically as follows:

  • homozygous minor allele (aa) = 2
  • heterozygous minor allele (Aa) = 1
  • homozygous major allele (AA) = 0

The MAF is not used in this. The MAF is used for pre- or post-filtering of SNPs from the dataset. For example, prior to regression, we may filter out all SNPs with MAF > 1%, or, after regression, we may stipulate that a SNP has to have a P<10e-6 and also MAF<1%

The data is more like this:

             CaseControl  SNP1   SNP2   ...   SNPX
    Sample1  1            2      2      ...   1
    Sample2  1            2      2      ...   2
    Sample3  0            0      0      ...   1
    ...      ...          ...    ...    ...   ...
    SampleX  1            0      0      ...   1

glm(CaseControl ~ SNP1 + covariates, data=MyData)
glm(CaseControl ~ SNP2 + covariates, data=MyData)
glm(CaseControl ~ SNPX + covariates, data=MyData)


ADD COMMENTlink written 2.3 years ago by Kevin Blighe63k

@Kevin Blighe Thank you very much indeed for your valuable answers. Please also refer me specific link for small scale data on SNP online as I want to see those information like AA aa Aa for particular some SNP like "rs80334247"


ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by nkhan.mscs15seecs60

No problem. The minor allele for each SNP is listed in dbSNP. For example, for the entry that you mentioned, rs80334247, I can see that the minor allele is T, with a MAF of 0.0038 (0.38%). The 'ancestral allele', i.e., the one that has become fixed in the human genome, is A.

ADD REPLYlink written 2.3 years ago by Kevin Blighe63k

@Kevin Blighe please Where are these heterozygous and homozygous information thats what I am not finding in these datasets?

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by nkhan.mscs15seecs60

If you only have a few SNPs, then you can search for them at the 1000 Genomes - A Deep Catalog of Human Genetic Variation. For example, Here is information for your SNP - this shows the major and minor allele in each population group:




If you have a large number of SNPs, you will require a different program that can automate the process. Let me know.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Kevin Blighe63k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 988 users visited in the last hour