Disease Associated Snps
11
19
Entering edit mode
12.8 years ago
pixie@bioinfo ★ 1.5k

Can anyone suggest some tool or validated database...where I can get disease associated SNP data ( like diabetes, etc) and the corresponding PMIDs/ the number of caeses,controls and population studied...I have checked with dbSNP...but there the information is not disease specific. I have also checked HugeNavigator ...but there the reported SNPs are not having any PMIDs and hence I cannot validate the data...

snp gwas database • 24k views
2
Entering edit mode

The NHGRI curates a list of all published GWA studies: http://genome.gov/gwastudies/

21
Entering edit mode
12.8 years ago

Inspired by Khader's comment. The following mysql query for the mysql anonymous server at UCSC answers the SNPs in the OMIM genes:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg18 -e '
select
concat(left(title1,30),"..."),
omimId,
S.name,
S.func,
G.chrom,
S.chromStart,
S.chromEnd
from
omimGene as G,
omimGeneMap as M,
snp130 as S
where
G.name=M.omimId and
G.chrom=S.chrom and
S.chromStart>=G.chromStart and
S.chromEnd <= G.chromEnd
limit 10;'


Result:

+-----------------------------------+--------+------------+--------------------+-------+------------+----------+
| concat(left(title1,30),"...")     | omimId | name       | func               | chrom | chromStart | chromEnd |
+-----------------------------------+--------+------------+--------------------+-------+------------+----------+
| Nucleolar complex-associated p... | 610770 | rs72904505 | untranslated-3     | chr1  |     869480 |   869481 |
| Nucleolar complex-associated p... | 610770 | rs6605067  | untranslated-3     | chr1  |     869538 |   869539 |
| Nucleolar complex-associated p... | 610770 | rs2839     | untranslated-3     | chr1  |     869549 |   869550 |
| Nucleolar complex-associated p... | 610770 | rs3196153  | untranslated-3     | chr1  |     869586 |   869587 |
| Nucleolar complex-associated p... | 610770 | rs1133980  | untranslated-3     | chr1  |     869614 |   869615 |
| Nucleolar complex-associated p... | 610770 | rs28453979 | untranslated-3     | chr1  |     869781 |   869782 |
| Nucleolar complex-associated p... | 610770 | rs61551591 | intron,near-gene-3 | chr1  |     870079 |   870080 |
| Nucleolar complex-associated p... | 610770 | rs3748592  | intron,near-gene-3 | chr1  |     870100 |   870101 |
| Nucleolar complex-associated p... | 610770 | rs3748593  | intron,near-gene-3 | chr1  |     870252 |   870253 |
| Nucleolar complex-associated p... | 610770 | rs74047418 | missense           | chr1  |     870364 |   870365 |
+-----------------------------------+--------+------------+--------------------+-------+------------+----------+

2
Entering edit mode

easy, there is a func column in snp130. Let me update the query...

1
Entering edit mode

Awesomeness ! Like++ !

0
Entering edit mode

This is awesome Pierre. Curious to know if we can use the location of SNP to see if it part of exon or intron using UCSC.

0
Entering edit mode

nice query, Pierre. I just edited my previous HVP answer to point out how important it is to think about why one would actually want to retrieve such table.

0
Entering edit mode

is there any special reason why you are using a where clause instead of a join?

0
Entering edit mode
Fri Jun 17 22:16:40 CEST 2011: " Table 'hg18.omimGene' doesn't exist".


The UCSC is currently changing the database...

0
Entering edit mode

the omimGene table is not in UCSC anymore http://redmine.soe.ucsc.edu/forum/index.php?t=msg&goto=5824&S=0e7dfb30fefa801e6571b8047ad60684

How can I get those disease associated SNPs now? Thanks!

0
Entering edit mode

I have exactly the same problem... And I want to apply the method for hg19 also...

1
Entering edit mode

see my post here

12
Entering edit mode
12.8 years ago

roughly speaking, what you (and lots of people around the world) would like to do is actually the main purpose of the HVP project, which is encouraging the creation of locus specific databases (LSDBs) that would collate disease specific variations. right now, all we can do are just 2 things:

1. disease based query you know the disease and you look for a particular database that may ideally have all the information available. benefits? the disease association of each SNP should have tested and validated. problems? you will sure find more than one database, built by different groups with different background, different curation strength, different maintenance effort, ... that is in fact what the HVP project tries to normalize.

2. SNP based query you know a region of interest and you go to your database of reference (such as dbSNP), and you expect it to contain disease specific information for each SNP. this will be "only" possible through automatic processes as mentioned. benefits? you have all the information available through large mesh websites (such as dbSNP) that cross all the information they have inside, and accessing it is fairly simple. problems? the validation of the information of each PubMed paper, for instance, is not at all done by the system, and the accuracy of the data on clinical papers (nomenclature, pathogenicity assessment, ...) is very unconsistent.

so after all, at least right now, you will have to decide what would you like to compromise. either you obtain a fairly simple list of SNPs associated with diseases, but these associations may not be completely real, either you build your own SNP list after collating all the disease specific databases you may be interested in, or you could even spend days/weeks/months reading disease associated papers in order to assess their accuracy. unfortunately, for clinical purposes, the 2 later options are completely necessary, but if you are just doing broad research you may get what you want from the first one.

Note: if you look for SNPs in disease genes it means that you are accepting that you are getting all the non-rare variations, which wouldn't necessary be associated with the disease of your interest. in fact, in a diagnostic lab, when a mutation (note that I call it mutation, and not polymorphism) is found on a SNP site, it gives the clinician some clue about its lack of association with the problem, specially in monogenic diseases. it's logical: if something is as bad as that it causes a genetic disease, it shouldn't appear so frequently (it could be related to the dissease incidence and its penetrance, but that would still be very low frequencies). dbSNP build 131 has now much lower frequency SNPs, trying to aim to the rarest ones, but even the NCBI knows that dbSNP won't be a dissease diagnostic tool, but a source to discard possibilities. in fact, that's the reason why NCBI is also supporting the HVP project.

(I was going to comment on the nice query to get the SNPs from OMIM genes wrote down by Pierre, but I thought I needed more than 500 characters, so I'm editing my original answer to include this note, which I think that points out a biological issue that maybe no one is paying the appropriate attention to when batch retrieving information.)

2
Entering edit mode

Hi Jorge, nice answer - HVP project is live ? I am not able to see a search or browse option. Please share the link to browse HVP.

0
Entering edit mode

The HVP concept started on ~2006, but since then I haven't seen any global and unified results page. maybe this is because this is not a close future goal for the project, but to encourage the creation of LSDBs around the world, as normalized as possible, that will eventually be queried from an unified interface. the only thing I could tell you for sure about is its roadmap.

8
Entering edit mode
12.1 years ago
lh3 33k

I was directed here from another question. I posted an answer because the top voted answer, while correct, is very inefficient. As BioStar is a professional Q&A site, I think we should get this straight for ourselves and for other users connecting to the UCSC MySQL server.

If we check the UCSC table schemas, most of tables do not have chromStart and chromeEnd indexed, which means querying on these such columns naively will incur unnecessary data loading and thus discouraged. For overlapping queries, UCSC uses the mystic bin' field, which is explained in the UCSC paper, the SAM spec and my tabix paper. Due to the use of this strategy, most of table joining and naive SQL are inefficient. One has to write multiple queries and use a small script to handle these. The following Perl source code shows how to compute bins that overlap a query region.

sub region2bin {
my ($beg,$end) = @_;
my @bin = (1);
push(@bin, (  1 + ($beg>>26) .. 1 + (($end-1)>>26)));
push(@bin, (  9 + ($beg>>23) .. 9 + (($end-1)>>23)));
push(@bin, ( 73 + ($beg>>20) .. 73 + (($end-1)>>20)));
push(@bin, (585 + ($beg>>17) .. 585 + (($end-1)>>17)));
return @bin;
}


and in SQL, we should explicitly query bins as is shown in the UCSC paper. We are professionals, and I hope our answers are also of the best quality.

EDIT:

As I have just tried, the naive SQL takes 6.5 seconds:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg19 -e 'set profiling=1;SELECT * FROM snp130 WHERE chrom="chr1" AND chromEnd>=100000000 AND chromStart<=100010000;show profiles'


while the SQL using the bin field only takes 0.0077 second (establishing the connection takes about 1 to 2 seconds):

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg19 -e 'set profiling=1;SELECT * FROM snp130 WHERE chrom="chr1" AND chromEnd>=100000000 AND chromStart<=100010000 AND (bin=1 OR bin=2 OR bin=20 OR bin=168 OR bin=1347 OR bin=1348);show profiles'


This is a huge difference. On smaller tables, the difference between the two SQLs will be smaller, but still matters. An easy way to write SQL is to use batchUCSC.pl. For example:

echo "chr1 100000000 100010000" | ./batchUCSC.pl -ed hg19 -p 'snp130:::'
`
0
Entering edit mode

+1 for the perl code

0
Entering edit mode

0
Entering edit mode

Is there any example to use the perl script to get all the disease related SNP?

7
Entering edit mode
12.8 years ago

Simple mapping of a SNP to disease makes sense, only if you are looking for an over all association of SNPs with diseases. But when you look closer you may realize that a SNP with significant p-value may exist in a coding or non-coding region of a gene. For example look at the list of disease association obtained from GWAS studies till date, you can see a considerable number of the significant SNPs falls in to non-coding region.

A SNP can have a synonymous or non-synonymous effect on the gene product. If it is on a coding region, direct disease association using ID mapping is a good approach. Which is the basis for most of the OMIM to dbSNP mapping or various ID mappings.

Mutation in protein 'Y' leads to disease 'X', so protein 'Y' is involved in disease 'X'. SNP 'rs12345' is present in the gene 'y' which codes for protein Y' SNP 'rs12345' is associated with disease 'X'

This simple concept works only if your SNP is in a coding region. If you are aware of the location of mutation on the protein and the type or effect of mutation you can get more clear results.

Several answers here could be a good starting point for you, best way to start will be to check in NHGRI GWAS catalogue to see the known association of SNPs with your disease(s) of interest. Other possible way is to check in OMIM or KEGG disease get the SNPs and perform a location and mutation aware analysis of the SNPs.

Also check related question on mapping of SNPs to Pathways.

0
Entering edit mode

Thanks for the interesting insights..I will try to solve the problem using some of the above approaches..

5
Entering edit mode
11.8 years ago
David John ▴ 50

I think this is exactly the tool you are looking for.

snp4disease.mpi-bn.mpg.de/

if you have any questions feel free to contact me.

1
Entering edit mode

Thanks so much...it looks like a very useful resource :)

0
Entering edit mode

This is exactly the tool I was looking for. Thank you.

0
Entering edit mode

Is there a similar link for psychiatric disorders?

4
Entering edit mode
12.8 years ago

You can use NCBI ELink to map from the diseases in OMIM to dbSNP: see this previous question on biostar about OMIM/STS.

Then, NCBI-EFetch can be used to retrieve all the informations about a given SNP ( e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=120435&retmode=xml ) but as far as I know, there is no place where you will find the number of cases , controls and the population: the informations for the Ss-ids (assays/population) is hidden somewhere in the deeps of the NCBI.

4
Entering edit mode
12.8 years ago

The National Human Genome Research Institute has put together a catalog of published genome-wide association studies. SNP-trait associations listed here are limited to those with p-values < 1.0 x 10-5. You can search by disease, trait, gene, SNP id, chromosomal region...

http://www.genome.gov/26525384/

3
Entering edit mode
12.8 years ago
Neilfws 49k

If you are working at the NCBI web site, it might be better to start from the disease, using OMIM and work your way to SNPs and publications, rather than starting from dbSNP.

When I enter a query for e.g. diabetes, I see a results tab labelled "OMIM dbSNP". Clicking on results in that list takes me to the OMIM page - on the right I see a link to "SNP". Clicking that link gives me another results tab labelled "Cited in PubMed". So all of the information is there and all of the Entrez databases cross-reference each other.

You can also access a lot of this information programmatically, using URLs with the appropriate parameters to link the databases. I don't recall a good example of the top of my head - this is Pierre's speciality, so we'll wait for him to come online.

3
Entering edit mode
12.8 years ago

I would recommend you SNPedia, a human manually curated wiki on SNPs and their associated diseases. If you look at the details of any snp (example), you will find a lot of links to other databases.

1
Entering edit mode

is there any file dump for snpedia, or do we have to use the mediawiki API to parse the infoboxes (if any) ?

1
Entering edit mode

I have checked with SNPedia. I am wondering why the list of SNPs associated with Type 2 diabetes is so small ...as compared to the number of reported candidate genes in the Type 2 diabetes database (T2D DB)..

0
Entering edit mode

There is a larger list by looking at [?]all of the SNPs which point to T2D[?].

0
Entering edit mode

There is a larger list by looking at which SNPs link to T2D

0
Entering edit mode

I used Promethease - utilty from SNPedia creator's. It's easy to add your own rs's to example file and get a report. But what I can't do for moment is to create a csv or tsv based on this html report. http://www.snpedia.com/index.php/Promethease

0
Entering edit mode
11.9 years ago
Cariaso ▴ 10

Gbrowse provides a dump

Paid promethease runs produce a tab delimited file like this

from the full report

0
Entering edit mode
9.2 years ago
vaibhav • 0

u can refer this database www.rasadbsnp.com for disease associated SNPs