Question: Does Dbsnp Not Contain The Data It Displays On The Webpage And Reports?
gravatar for Andrea_Bio
9.3 years ago by
Andrea_Bio2.6k wrote:


I am most perplxed by dbSNP I was hoping someone familiar with the schema of dbSNP might be able to help me. I have 2 issues and I was hoping you would address them individually. If you are short of time, please address issue 1 as this is my priority.

Issue 1

I have been investigating the contents of dbSNP directly by querying the database directly and it looks to me as though dbSNP doesn't store ALL the information about the effects of SNPs on transcript variants of a gene that it displays on its webpage and in its reports though it does store SOME

Take this example snp: rs2034920. You can see clearly from the web page that this snp affects 2 transcript variants using the GRCH37 assembly. These are: NM001007551 (versions 2 and 3) and NM001172288

If you look in the dbSNP database directly for the consequences of this SNP you only get one mRNA/protein in relation to the grch assembly which is NM001007551.2 (version 2). You don't get any details about NM001007551.3 or NM_001172288 but they are quite clearly displayed on the webpage. The query i ran was

SELECT * FROM b131SNPContigLocusId371 where snpid=2034920

You can see the results on line at this webpage which lets you query dbSNP schema directly:*+FROM+b131_SNPContigLocusId_37_1#dbSNPstatusMessageBox

b131SNPContigLocusId37_1 is the table that stores the effects of a SNP on a transcript and you would expect to see at least one row in this table per transcript (and multiple rows per transcript if the different alleles had different consequences e.g. a C/T SNP only has one consequence in an intron [INTRONIC] regardless of the allele but a C and T could produce different amino acids in an exon so would have one row per allele in this table)

This query only returns the affects on one transcript? So where are the other transcripts in the database. I thought perhaps they might be linked by some relationship i don't know. However, if you query SELECT * FROM b131SNPContigLocusId371 where mrnaacc="NM_001172288" you get an empty set???? So that transcript doesn't appear to be in the database at all.

Issue 2

you will also see from these results [SELECT * FROM b131SNPContigLocusId371 where snpid=2034920] the same SNP is linked to different genes in the different assemblies.

Assembly: contig id, mrna id, protein id
Huref: NW001842404, XM002346337,XP002346378
Celera: NW
927722, XM001716002, XP001716054
GRCH37: NT011786 NM001007551 NP_001007552

Why would a snp be linked to different genes in different assemblies? Hopefully this is simpler than the previous problem. I think the explanation in this case is that there is a family of very related genes on this chromosome and the SNP has been mapped to different members of this family in the different assemblies

My most grateful thanks for any light you can shed on this matter

dbsnp snp • 2.2k views
ADD COMMENTlink modified 9.3 years ago by Brad Chapman9.5k • written 9.3 years ago by Andrea_Bio2.6k
gravatar for Brad Chapman
9.3 years ago by
Brad Chapman9.5k
Boston, MA
Brad Chapman9.5k wrote:

For issue 1, your database query is against dbSNP131, while the web page is the latest version, 132. While I don't know if this explains your differences, there have been a lot of changes in the latest release, so you'll want to compare like to like.

As a more practical answer, have you thought about using the XML reports instead of querying their database?

The FxnSet tags under the GRCh37 Assembly has all of the transcripts you are looking for, along with the changes. Parsing XML should be an easier approach compared to deciphering the database structure.

For issue 2, those identifiers point to the full chromosomes, which you expect to be different since they are different builds:

Since the chromsomes and gene predictions are different, the actual genes will have different IDs as well. If you take a look at the two examples, they refer to the same gene but have different predicted sequences, hence the different IDs:

Practically, use GRCh37 as your reference to stay in-sync with the majority of analyses.

ADD COMMENTlink modified 9.3 years ago • written 9.3 years ago by Brad Chapman9.5k

For the versioning, data changes from release to release so you want to maintain a consistent release. Sorry to not be able to help with the database; your query seems correct but I don't have a local copy of 132 to compare. For question 2, I tried to edit the answer to be more clear: in summary, if you have different assemblies you also have different gene predictions.

ADD REPLYlink written 9.3 years ago by Brad Chapman9.5k

hi, i've seen the reports and don't want to use them. i specifically need to use the local database. The dbsnp version is not the issue as the print out from the webpage for this record that I have is actually from build 131. Regarding question 2, that wasn't what I was asking. I was wondering how the snp ended up being mapped to 3 different genes in the 3 different assemblies. Naturally i would only use one assembly for consistency, I was wondering for personal interest.

ADD REPLYlink written 9.3 years ago by Andrea_Bio2.6k

i can't see any schema changes for dbsnp 132 on the ncbi list of schema changes? What are the changes in the new release that you refer to please? Are you suggesting they had added more data rather than actually changing the schema? That is a possibility, but i have been assured thaat the website showed the same data for this record a few months ago during build 131.

ADD REPLYlink written 9.3 years ago by Andrea_Bio2.6k

I shall investigate your suggestion further to be sure but as far as I know the dbSNP website doesn't let you look at archive versions like ensembl does

ADD REPLYlink written 9.3 years ago by Andrea_Bio2.6k

I will download the database dumps and scan then

ADD REPLYlink written 9.3 years ago by Andrea_Bio2.6k

thanks for your edit. You were right about the database versions! dbSNP seem to have added a lot more info for build 132. You've saved me hours of hunting. I wouldn't have double checked the versions otherwise

ADD REPLYlink written 9.3 years ago by Andrea_Bio2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1668 users visited in the last hour