I am most perplxed by dbSNP I was hoping someone familiar with the schema of dbSNP might be able to help me. I have 2 issues and I was hoping you would address them individually. If you are short of time, please address issue 1 as this is my priority.
I have been investigating the contents of dbSNP directly by querying the database directly and it looks to me as though dbSNP doesn't store ALL the information about the effects of SNPs on transcript variants of a gene that it displays on its webpage and in its reports though it does store SOME
Take this example snp: rs2034920. You can see clearly from the web page that this snp affects 2 transcript variants using the GRCH37 assembly. These are: NM001007551 (versions 2 and 3) and NM001172288
If you look in the dbSNP database directly for the consequences of this SNP you only get one mRNA/protein in relation to the grch assembly which is NM001007551.2 (version 2). You don't get any details about NM001007551.3 or NM_001172288 but they are quite clearly displayed on the webpage. The query i ran was
SELECT * FROM b131SNPContigLocusId371 where snpid=2034920
You can see the results on line at this webpage which lets you query dbSNP schema directly: http://cgsmd.isi.edu/dbsnpq/submit.php?query=SELECT+*+FROM+b131_SNPContigLocusId_37_1#dbSNPstatusMessageBox
b131SNPContigLocusId37_1 is the table that stores the effects of a SNP on a transcript and you would expect to see at least one row in this table per transcript (and multiple rows per transcript if the different alleles had different consequences e.g. a C/T SNP only has one consequence in an intron [INTRONIC] regardless of the allele but a C and T could produce different amino acids in an exon so would have one row per allele in this table)
This query only returns the affects on one transcript? So where are the other transcripts in the database. I thought perhaps they might be linked by some relationship i don't know. However, if you query SELECT * FROM b131SNPContigLocusId371 where mrnaacc="NM_001172288" you get an empty set???? So that transcript doesn't appear to be in the database at all.
you will also see from these results [SELECT * FROM b131SNPContigLocusId371 where snpid=2034920] the same SNP is linked to different genes in the different assemblies.
Assembly: contig id, mrna id, protein id
Huref: NW001842404, XM002346337,XP002346378
Celera: NW927722, XM001716002, XP001716054
GRCH37: NT011786 NM001007551 NP_001007552
Why would a snp be linked to different genes in different assemblies? Hopefully this is simpler than the previous problem. I think the explanation in this case is that there is a family of very related genes on this chromosome and the SNP has been mapped to different members of this family in the different assemblies
My most grateful thanks for any light you can shed on this matter