Question: Dbsnp Individual Genotyping Information For Specific Snps
5
gravatar for Andrea_Bio
9.7 years ago by
Andrea_Bio2.6k
Andrea_Bio2.6k wrote:

Hello

I've gone through the schema for dbSNP 132 quite thoroughly and it looks like it is not possible to find the genotype for individuals for a particular SNP submission.

I am using the dbSNP-q resource to look at the schema and query the database directly so it is possible there are tables missing. I have tried the dbSNP website but the schema diagram on the dbSNP website is 7 years old. I dont have a local installation of dbSNP to examine the schema manually and as far as I know dbSNP don't provide direct access to their database server.

I can find out what individuals are in a population for a SNP submission (i.e. an ss entry). I can find out the allele frequency and genotype frequency for all the populations related to a SNP submission but I cannot find the specific genotypes for specific individuals for a specific subSNP (ss). There are no tables linking indidivuals, ss records and genotypes to give the information in question (the tables AlleleFreqbySSPop and GtyFreyBySSPop give you the allele and genotype frequencies for the populations as a whole). I have a paper that says dbSNP provides the information in question 'per chromosome' with relevant fields chromosomeid, indid, genotypeid and subsnpid. The tables are named SubSNPInd_chr1 etc but I cannot find any tables matching this description if i use the dbSNP-Q website to query the database

Please can you answer this question with respect to the dbSNP schema and not the website.

thanks a lot

dbsnp • 4.5k views
ADD COMMENTlink written 9.7 years ago by Andrea_Bio2.6k
4
gravatar for lh3
9.7 years ago by
lh332k
United States
lh332k wrote:

Have you had a look at here?

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/

or here from 1000g?

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/

If you have not found related schema, perhaps the database schema is too old and has not been updated yet to provide individual information, but this information is certainly there.

EDIT:

If you want to get SNPs in a region for 1000g VCFs, you may do that with tabix:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 1:10,000,000-10,100,000 > output.vcf

Each 1000g VCF itself is a minimal database which can be queried remotely. You cannot do this with dbSNP VCFs as they are not tabix indexed.

There is an 1000g browser. You may also try that.

EDIT2:

Now there is the question about whether individual genotypes are available in dbSNP at all. The answer is definite: YES. For now, I can get individual genotype information in two ways. Firstly, I can get that in the VCF format from the official dbSNP ftp, e.g.:

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByPopulation/ASW-12156-01.vcf.gz

Secondly, I can acquire individual genotypes for ssIDs, e.g.: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ss.cgi?ss=ss24527973

which means that the individual genotype information is almost certainly stored in the SQL database (I would be greatly surprised otherwise). I do not know how to query the database to get individual genotypes. It is possible that we have not found the right place or dbSNP-q/dbSNP is providing outdated schema.

ADD COMMENTlink modified 12 months ago by _r_am30k • written 9.7 years ago by lh332k
1

these are the only public raw genotypes I'm aware of at dbSNP. I would love to hear from anyone bringing to us any new available option.

ADD REPLYlink written 9.7 years ago by Jorge Amigo12k

You may get Venter and YH genotypes from their websites (they set up a browser as I remember), but the vast majority of genotype information is provided by 1000g. I believe dbSNP/Ensembl/UCSC will provide individual genotype information in future.

ADD REPLYlink written 9.7 years ago by lh332k

You may get Venter and YH genotypes from their websites (they set up a browser as I remember), but the vast majority of genotype information is provided by first HapMap and then 1000g.

ADD REPLYlink written 9.7 years ago by lh332k

Hi - thanks for your answer but my question is specifically about the dbSNP schema which i did make clear in the question.Even so, this data doesn't tell you which individuals have which genotypes for a SNP

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

hi - i agree with your edits. My point was that I cannot see how to get this data from the database directly!! I don't have a local installation of dbSNP 132. I am using dbSNPq. The interesting thing is that the tables i seem to be missing did exist a few months ago on dbSNPq (release 131 - i forget) and are in their documentation. They have now disappeared from their website for release 132

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

hi - i agree with your edits. My point was that I cannot see how to get this data from the database directly!! I don't have a local installation of dbSNP 132. I am using dbSNPq. The interesting thing is that the tables i seem to be missing did exist a few months ago on dbSNPq (release 131) and were in their documentation.

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

see page 12 http://nar.oxfordjournals.org/content/39/suppl_1/D901/suppl/DC1

These tables have gone http://cgsmd.isi.edu/dbdoc/db.php?db=dbsnp_human_132

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

Yes, I am well aware what you were asking for, but I do not know the answer. Nonetheless, when you talk about dbSNP-q, things become a little complicated because dbSNP-q is not the official release. For example, dbSNP-q developers may intentionally drop the tables if they feel that those tables take too much disk space while few users query on them. They indeed say in the documentation that they modified some schema. If it were me, I would try to contact the dbSNP-q developers.

ADD REPLYlink written 9.7 years ago by lh332k

i was able to access the latest version of dbSNP. The tables are indeed there. The dbSNPq site was extremely misleading!

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

BTW, if you keep the upcoming 1000g VCF without compression, it is going to be roughly 1TB, if not more.

ADD REPLYlink written 9.7 years ago by lh332k

dbSNPq developers also didnt respond. I emailed them months ago :(

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

That is good. So the problem is solved - you just use the dbSNP from the official website.

ADD REPLYlink written 9.7 years ago by lh332k

you are right - i was able to contact someone with dbSNP 132 installed. TAbles are there called SNPInd

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

That is good. So the problem is solved - you just use the dbSNP from the official website. Thank you very much for chasing this.

ADD REPLYlink written 9.7 years ago by lh332k
4
gravatar for Larry_Parnell
9.7 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

This question does indeed border on ethical issues. Being able to link genotypes to personal identifiers, even if coded in some way, does raise ethical concerns, particularly concerning confidentiality. This was essentially why NIH and Wellcome Trust SNPs sites were briefly closed to public access in 2008 (excerpt from Nature News):

*DNA databases shut after identities compromised

Several DNA databases run by the US National Institutes of Health (NIH) in Bethesda, Maryland, the Wellcome Trust in London and the Broad Institute in Cambridge, Massachusetts, were closed to public access last week after researchers showed it is possible to extract the supposedly confidential identities of the patients involved. The databases list the frequencies of small DNA variations called single nucleotide polymorphisms (SNPs) from patient groups.

In the August issue of PLoS Genetics, Nils Homer and his colleagues describe a method to mine individual SNP profiles from complex mixtures, even if the person's DNA is only 0.1% of the total. The method could be useful for ensuring patients are not listed twice when scientists combine data sets, as well as in forensic science.*

Since this time, there certainly are available other data repositories where one can access data that can link genotype - or haplotype - to a personal identifier. If this is want you really need for your research, then I would mine HapMap. Here, it is easy to obtain those data because that is what this database was designed to allow.

ADD COMMENTlink written 9.7 years ago by Larry_Parnell16k

Thank you Larry, I think I had that article in mind but forgotten about it when I wrote my answer, though the potential threat to privacy is even worse than I had imagined. I doubt that such consequences were foreseeable by the participants to the level of an informed consent even after signing the consent forms.

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

I will also found intersting to read the follow-up on the article you mention: Church G et al. 2009 Public Access to Genome-Wide Data: Five Views on Balancing Research with Privacy and Protection. PLoS Genet 5(10): e1000665. doi:10.1371/journal.pgen.1000665

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

Both hapmap and 1000g have a highly experienced subgroup considering ethnic issues. I do not know much but I trust them. To me, as long as individual genotypes can be released in HapMap and 1000g, I cannot think of a reason why they cannot be released in dbSNP. I know rules are far more stringent when genotypes and phenotypes are released together. That is why we can rarely get public GWAS data in US.

ADD REPLYlink written 9.7 years ago by lh332k

HapMap is very useful, but in some areas, 1000g is better.

ADD REPLYlink written 9.7 years ago by lh332k

Defiinitely. HapMap and 1000G each have their own strengths. Strictly speaking, dbSNP is a database of variants and not a sequence or haplotype or genotype repository. To include genotype data within dbSNP becomes a database design issue as well as one of broadening the current scope of that database. We'll see what unfolds in the coming months/years.

ADD REPLYlink written 9.7 years ago by Larry_Parnell16k
2
gravatar for Chris Evelo
9.7 years ago by
Chris Evelo10k
Maastricht, The Netherlands
Chris Evelo10k wrote:

Not sure whether that would help, but you can actually install it locally: ftp://ftp.ncbi.nih.gov/snp/database/

ADD COMMENTlink written 9.7 years ago by Chris Evelo10k

I have fired up an old installation of dbSNP and the tables i mention in the question that are supposed to contain this information do not exist.

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

this was version 128 btw

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k
1
gravatar for Michael Dondrup
9.7 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Edit: It is quite likely that the following assumption is wrong, but I won't delete because it triggered some interesting comments. However BioStar is not well suited for discussion about ethics so I will abstain from further comments on this topic.

In summary, that's what I read out of the other answers: genotype information is available on the per individual level in the original vcf files, but that information has not been represented in the database. Although the data-model found in the documentation of dbSNP would allow to store such relation. There seems to be a discrepancy between the data model in the documentation and the real data model of the dbSNP SQL database such that the relation is missing for whatever reason.


I think you won't find that information anywhere, because this is sensitive medical personalized data. Research code of ethics and data privacy would forbid to link SNP data with an individual and reveal this data without consent. Even when anonymized, a SNP pattern is as unique as a fingerprint and could be used for genetic fingerprinting and serve to identify an individual, infer inheritance relations, and even get information about genotypes of relatives. This is also the reason for GWA study data being anonymized and that it is only released under a strict confidentiality policy upon request.

ADD COMMENTlink modified 9.7 years ago • written 9.7 years ago by Michael Dondrup48k
1

I don't care about the -1 ;) really can stand this :) but in fact, I believe that this data wouldn't be shared for this ethical reason, and in fact I think that even if it would be shared, it shouldn't be. I have been working with some medical researchers for a while and those people have very high ethical standards when it comes to release of patient data and data privacy, they wouldn't share their raw data easily, and most patients wouldn't consent to contribute to the study in that case.

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k
1

Given of course, everyone in the sample group gave their consent to their data being linked.

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

what's the problem? ;)

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

please not that it's always a good idea to leave a comment when voting something down, I don't bother normally, but in this case I would be eager to know whether or not this information exists of which I believe it doesn't, so please if you don't have anything to contribute, go away.

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

I wouldn't be so sure that the raw genotypes aren't there, although I've never looked for them at dbSNP. as far as I know, the only raw genotypes flat files available are the VCF files from 1000GP as just pointed out by lh3. by the way, a -1 vote for a comment on ethical implications? sure the answer is not helpful since it doesn't point you to the data you need, but I wouldn't go that far since it is in fact a matter of the highest importance on current genetic repositories. we ourselves have to deal everyday with decisions that have to consider anonymizing data in order to publish it.

ADD REPLYlink written 9.7 years ago by Jorge Amigo12k

There definitely is individual SNP information also at genomes unzipped

ADD REPLYlink written 9.7 years ago by Chris Evelo10k

chris, can you point me at this data?

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

so, maybe I was wrong here, chris could you point me at this data?

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

Which is of course precisely what people do when they upload their data to genomes unzipped. [BTW I would give you a vote for this, but then you remarks about -1 wouldn't make sense anymore ;-) ]

ADD REPLYlink written 9.7 years ago by Chris Evelo10k

Don't bother about votes, I just want to find out the true answer, and I like to be corrected, but so far I don't believe I am wrong. Why? Because the question was about dbSNP. I still claim, that the data the question asked for do not exist due to practical consent problems from the sources the SNPs are derived from (mainly HapMap). So far, none of the other answers has proven the opposite.

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

Genotype information is certainly in dbSNP. You can get that in the VCF format from the dbSNP ftp. For some SNPs, you can also get the genotypes of ~1200 HapMap3 samples from the dbSNP web interface. It is definitely in the SQL database, but Andrea just does not know how to query the dbSNP database to get the information out. Me neither.

ADD REPLYlink written 9.7 years ago by lh332k

Genotype information for populations as a whole is definitely in the database but individual genotype information does not appear to be. It isn't in the VCF files either as far as I can see - that is population based too. I'm not saying I expected it to be there (because of the reasons mentioned by MD) but i have read a resources saying it is there in the tables i mentioned in the question.

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

Genotype information is in VCF and can be queried via the web interface, unless I misunderstood the whole discussion, in which case I apologize. See my edits above.

ADD REPLYlink written 9.7 years ago by lh332k

sure it is, but is it linked to a single individual or population? sorry I don't understand the vcf format well enough.

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

I'm becoming more confused.... sure genotyp information is present in the VCF files, but is it linked to a single individual or population? OR: Is it possible to infer the complete genotype of a single individual (though anonymized) for all SNPs in the study, or is that only per population? If that's per individual, I stand corrected.

sorry I don't understand the vcf format well enough

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k

I am following this thread. The VCF gives the genotypes of each individual, not per population. VCFs are organized by population in dbSNP purely for users' convenience. DbSNP can combine all the populations together and produce one huge VCF. A much stronger evidence that dbSNP database has individual genotypes is that we can query this information via the web interface.

ADD REPLYlink written 9.7 years ago by lh332k

i deleted a comment because it wasn't concentrating when i wrote it and wrote the wrong thing and caused confusion. To summarize I agree with MD's edit

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k

lh3 and andrea, thank you for the clarification.

ADD REPLYlink written 9.7 years ago by Michael Dondrup48k
1
gravatar for Khader Shameer
9.7 years ago by
Manhattan, NY
Khader Shameer18k wrote:

If you are looking at individual genotype data, another option is dbGAP. You can get deidentified data from dbGAP. See data access policy here. Availability of data will depend up on the embargo date and approval from the PI of the project.

ADD COMMENTlink written 9.7 years ago by Khader Shameer18k

Thanks, Khader. This is an important, often overlooked resource.

ADD REPLYlink written 9.7 years ago by Larry_Parnell16k
0
gravatar for Michael Dondrup
9.7 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Another idea: why don't we simply ask the dbSNP operators? Click 'Contact us' at http://www.ncbi.nlm.nih.gov/projects/SNP/ They should know best how their database is designed and could comment on the possible motivation. I leave that to Andrea because it's his/her question. I guess you will get a definitive answer, that's much better than our speculation.

ADD COMMENTlink written 9.7 years ago by Michael Dondrup48k
2

i have never received a reply from any email to dbSNP, neither have any of my colleagues. I appreciate that is the logical first port of call but it is not a very responsive one.

ADD REPLYlink written 9.7 years ago by Andrea_Bio2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1737 users visited in the last hour