Direct mapping of SNPs to a particular disease / pathways seems to be trivial, but from a practical perspective it is a tough task. Various SNPs are associated via GWAS with different phenotypes, a good number of these SNPs are not with in the genes or genomic elements, but that doesn't meant that these SNPs don't have any role in a pathway / disease responsible for the phenotype. The study on 9p21 locus is an excellent example. List of SNPs associated with diseases/traits via GWAS is maintained here.
There are chances that a given SNP in a non-coding region may have effects on neighboring genes, but ID mapping usually miss this. I think a direct mapping of IDs may not be able to give you accurate results with all SNPs. If the genomic location of SNP is with in the coding segment of the gene, it makes sense other wise a direct mapping may not give you exact results, but they could be the excellent starting points.
This is not an easy question because it calls to mind a lot of different ways to consider SNPs. For me, simply mapping the SNP to the gene in which it resides or that gene nearby can be misguided. Take for example the variants linked to lactase persistence in Whites and some Africans. These variants are 10 to 11 kbp upstream of the pertinent gene LCT (lactase), but actually map within MCM6 (minichromosome maintenance complex component 6). As an aside which pertains to my line of work - this is important stuff when drawing up dietary recommendations.
Gene ontology terms for LCT are:
Molecular Function: cation binding, glycosylceramidase activity, lactase activity, transferase activity
Biological Process: carbohydrate metabolic process, response to drug, response to estrogen stimulus, response to ethanol, response to hormone stimulus, response to hypoxia, response to lead ion, response to nickel ion, response to nutrient, response to starvation, response to sucrose stimulus
Cellular Component: apical plasma membrane, brush border, integral to plasma membrane, membrane fraction, plasma membrane
While the GO terms for MCM6 clearly indicate a different function of the encoded protein:
Molecular Function: ATP binding, DNA binding, DNA helicase activity, identical protein binding, nucleotide binding, protein binding, single-stranded DNA binding
Biological Process: DNA replication, DNA unwinding involved in replication, DNA-dependent DNA replication initiation, cell cycle, regulation of transcription
Cellular Component: nucleoplasm, nucleus
OK, we know from a lot of other evidence that the SNPs conferring lactase persistence would "map" or be assigned to a lactase pathway. But where to assign other SNPs? Khader is right, mapping to disease pathways based on GWAS results is one option, but one may want more detail or assignment to a different pathway, e.g., biochemical, physiological, etc. In essence, this comes down to allele-specific pathways and pathway fluxes (different alleles for one SNP may alter transit through that node in the pathway by a mere 10-25% and that could be significant over the years it takes to see the phenotypic effects of a diseae). Few such pathways or pathway fragments exist. It also brings up cell type or organ specific pathways. In this regard, I may be able to call up from KEGG, Reactome or other sources a list of inflammation genes, which would be quite important as adipose tissue in a lean individual is 10% macrophages, but 40% in an obese person, but I do not know which members of that inflammation pathway are actually relevant and expressed in the adipose.
In addition, a recent paper by Folkersen (Circ Cardiovasc Genet 3:365) shows that many disease SNPs for cardiovascular disease phenotypes map far from the gene whose mRNA levels associate with that SNP. Again, it is a gene expression thing similar to the LCT-MCM6 story above.
In all, this is tough and there is no satisfactory way to assign a SNP to a pathway. Assignment can be easier based on genetics - GWAS and classical mapping and mouse KOs - but those too may be population specific or altered by environment.
There are actually two questions in
related pathways/ diseases ?
The first first part can be solved by database queries such as biomart and KEGG, but the second part is about complex studies. Actually, IMHO, a large part of the already known SNPs are not connected to disease, they might not even have a phenotype (I would bet >99%) . As far as I understand, the known SNPs are sampled from "healthy" individuals and represent a large mix. So it seems likely to assume that they are not easily connected to diseases.
In short, the answer might be exome sequencing of affected individuals. I found this recent article which I think is really great to answer this question:
Ng SB, et al., Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010 Jan;42(1):30-5. Epub 2009 Nov 13.
In short they discovered point mutations common in few affected individuals and subtracted synonymously coding SNPs and already known SNPs until they retained only one gene.
I would use DAS -- Distributed Annotated System to retrieve all genes/phenotypes associated to a specific SNP.
DAS is a webservice for decentralised annotation that provides an esy protocol to retrieve features providing an url.
For example, retrieve me all OMIM genes in chromosome 18 between base pair 1 and 1000000
More on DAS here
Biomart's Martview (http://www.biomart.org/biomart/martview/) will get you from SNP IDs to many gene/protein identifiers. In a second step, Martview will also get you from gene IDs to GO Biological Process terms, but there are probably better tools that are specifically targeted toward pathways (KEGG, Reactome, WikiPathways, etc.)
As soon as you get the Entrez gene Id related to your SNPs you can query KEGG or WikiPathways that should provide Entrez gene Ids related to a given pathway. The good think with this two websites is that with some SVG you can customized the graphic view of the pathways in order to highlight genes that have the SNPs. Hope this helps.
I believe it best to describe individual SNPs in ALL/every which way imaginable: map location, gene centric (in cds or 5 kb upstream from this ORF etc), pathway involvement (if known) and finally disease/phenotype involvement (if known).
The next level of complexity arises when one wants to describe SNPs whose penetrance is modified by other factors (genetic or epigenetic), but maybe beyond the scope of this discussion.
Gene Set Analysis Toolkit V2 http://bioinfo.vanderbilt.edu/webgestalt/ You can just upload txt file containing list of rs, one rs per line. As result it gives KEGG_Pathway's or WikiPathway's with colored genes\proteins sorted by count of genes that each of them contains.
I have used ALIGATORpdf to find pathway enrichment from GWAS SNP data. It tests Gene-Ontoologies over overrepresented categories from SNP p-values. Program link[?]
GRASS is another ridge regression method that uses SNP data to find pathway enrichment. I believe a new R package is out for this now.
I am not sure if this is what Pierre is looking for..