Dbsnp Individual Genotyping Information For Specific Snps
6
5
Entering edit mode
12.0 years ago
Andrea_Bio ★ 2.8k

Hello

I've gone through the schema for dbSNP 132 quite thoroughly and it looks like it is not possible to find the genotype for individuals for a particular SNP submission.

I am using the dbSNP-q resource to look at the schema and query the database directly so it is possible there are tables missing. I have tried the dbSNP website but the schema diagram on the dbSNP website is 7 years old. I dont have a local installation of dbSNP to examine the schema manually and as far as I know dbSNP don't provide direct access to their database server.

I can find out what individuals are in a population for a SNP submission (i.e. an ss entry). I can find out the allele frequency and genotype frequency for all the populations related to a SNP submission but I cannot find the specific genotypes for specific individuals for a specific subSNP (ss). There are no tables linking indidivuals, ss records and genotypes to give the information in question (the tables AlleleFreqbySSPop and GtyFreyBySSPop give you the allele and genotype frequencies for the populations as a whole). I have a paper that says dbSNP provides the information in question 'per chromosome' with relevant fields chromosomeid, indid, genotypeid and subsnpid. The tables are named SubSNPInd_chr1 etc but I cannot find any tables matching this description if i use the dbSNP-Q website to query the database

Please can you answer this question with respect to the dbSNP schema and not the website.

thanks a lot

dbsnp • 5.4k views
4
Entering edit mode
12.0 years ago
lh3 33k

Have you had a look at here?

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/

or here from 1000g?

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/

If you have not found related schema, perhaps the database schema is too old and has not been updated yet to provide individual information, but this information is certainly there.

EDIT:

If you want to get SNPs in a region for 1000g VCFs, you may do that with tabix:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 1:10,000,000-10,100,000 > output.vcf


Each 1000g VCF itself is a minimal database which can be queried remotely. You cannot do this with dbSNP VCFs as they are not tabix indexed.

There is an 1000g browser. You may also try that.

EDIT2:

Now there is the question about whether individual genotypes are available in dbSNP at all. The answer is definite: YES. For now, I can get individual genotype information in two ways. Firstly, I can get that in the VCF format from the official dbSNP ftp, e.g.:

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByPopulation/ASW-12156-01.vcf.gz

Secondly, I can acquire individual genotypes for ssIDs, e.g.: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ss.cgi?ss=ss24527973

which means that the individual genotype information is almost certainly stored in the SQL database (I would be greatly surprised otherwise). I do not know how to query the database to get individual genotypes. It is possible that we have not found the right place or dbSNP-q/dbSNP is providing outdated schema.

1
Entering edit mode

these are the only public raw genotypes I'm aware of at dbSNP. I would love to hear from anyone bringing to us any new available option.

0
Entering edit mode

You may get Venter and YH genotypes from their websites (they set up a browser as I remember), but the vast majority of genotype information is provided by 1000g. I believe dbSNP/Ensembl/UCSC will provide individual genotype information in future.

0
Entering edit mode

You may get Venter and YH genotypes from their websites (they set up a browser as I remember), but the vast majority of genotype information is provided by first HapMap and then 1000g.

0
Entering edit mode

Hi - thanks for your answer but my question is specifically about the dbSNP schema which i did make clear in the question.Even so, this data doesn't tell you which individuals have which genotypes for a SNP

0
Entering edit mode

hi - i agree with your edits. My point was that I cannot see how to get this data from the database directly!! I don't have a local installation of dbSNP 132. I am using dbSNPq. The interesting thing is that the tables i seem to be missing did exist a few months ago on dbSNPq (release 131 - i forget) and are in their documentation. They have now disappeared from their website for release 132

0
Entering edit mode

hi - i agree with your edits. My point was that I cannot see how to get this data from the database directly!! I don't have a local installation of dbSNP 132. I am using dbSNPq. The interesting thing is that the tables i seem to be missing did exist a few months ago on dbSNPq (release 131) and were in their documentation.

0
Entering edit mode
0
Entering edit mode

Yes, I am well aware what you were asking for, but I do not know the answer. Nonetheless, when you talk about dbSNP-q, things become a little complicated because dbSNP-q is not the official release. For example, dbSNP-q developers may intentionally drop the tables if they feel that those tables take too much disk space while few users query on them. They indeed say in the documentation that they modified some schema. If it were me, I would try to contact the dbSNP-q developers.

0
Entering edit mode

i was able to access the latest version of dbSNP. The tables are indeed there. The dbSNPq site was extremely misleading!

0
Entering edit mode

BTW, if you keep the upcoming 1000g VCF without compression, it is going to be roughly 1TB, if not more.

0
Entering edit mode

dbSNPq developers also didnt respond. I emailed them months ago :(

0
Entering edit mode

That is good. So the problem is solved - you just use the dbSNP from the official website.

0
Entering edit mode

you are right - i was able to contact someone with dbSNP 132 installed. TAbles are there called SNPInd

0
Entering edit mode

That is good. So the problem is solved - you just use the dbSNP from the official website. Thank you very much for chasing this.

4
Entering edit mode
12.0 years ago

This question does indeed border on ethical issues. Being able to link genotypes to personal identifiers, even if coded in some way, does raise ethical concerns, particularly concerning confidentiality. This was essentially why NIH and Wellcome Trust SNPs sites were briefly closed to public access in 2008 (excerpt from Nature News):

Several DNA databases run by the US National Institutes of Health (NIH) in Bethesda, Maryland, the Wellcome Trust in London and the Broad Institute in Cambridge, Massachusetts, were closed to public access last week after researchers showed it is possible to extract the supposedly confidential identities of the patients involved. The databases list the frequencies of small DNA variations called single nucleotide polymorphisms (SNPs) from patient groups.

In the August issue of PLoS Genetics, Nils Homer and his colleagues describe a method to mine individual SNP profiles from complex mixtures, even if the person's DNA is only 0.1% of the total. The method could be useful for ensuring patients are not listed twice when scientists combine data sets, as well as in forensic science.*

Since this time, there certainly are available other data repositories where one can access data that can link genotype - or haplotype - to a personal identifier. If this is want you really need for your research, then I would mine HapMap. Here, it is easy to obtain those data because that is what this database was designed to allow.

0
Entering edit mode

Thank you Larry, I think I had that article in mind but forgotten about it when I wrote my answer, though the potential threat to privacy is even worse than I had imagined. I doubt that such consequences were foreseeable by the participants to the level of an informed consent even after signing the consent forms.

0
Entering edit mode

I will also found intersting to read the follow-up on the article you mention: Church G et al. 2009 Public Access to Genome-Wide Data: Five Views on Balancing Research with Privacy and Protection. PLoS Genet 5(10): e1000665. doi:10.1371/journal.pgen.1000665

0
Entering edit mode

Both hapmap and 1000g have a highly experienced subgroup considering ethnic issues. I do not know much but I trust them. To me, as long as individual genotypes can be released in HapMap and 1000g, I cannot think of a reason why they cannot be released in dbSNP. I know rules are far more stringent when genotypes and phenotypes are released together. That is why we can rarely get public GWAS data in US.

0
Entering edit mode

HapMap is very useful, but in some areas, 1000g is better.

0
Entering edit mode

Defiinitely. HapMap and 1000G each have their own strengths. Strictly speaking, dbSNP is a database of variants and not a sequence or haplotype or genotype repository. To include genotype data within dbSNP becomes a database design issue as well as one of broadening the current scope of that database. We'll see what unfolds in the coming months/years.

2
Entering edit mode
12.0 years ago

Not sure whether that would help, but you can actually install it locally: ftp://ftp.ncbi.nih.gov/snp/database/

0
Entering edit mode

I have fired up an old installation of dbSNP and the tables i mention in the question that are supposed to contain this information do not exist.

0
Entering edit mode

this was version 128 btw

1
Entering edit mode
12.0 years ago

Edit: It is quite likely that the following assumption is wrong, but I won't delete because it triggered some interesting comments. However BioStar is not well suited for discussion about ethics so I will abstain from further comments on this topic.

In summary, that's what I read out of the other answers: genotype information is available on the per individual level in the original vcf files, but that information has not been represented in the database. Although the data-model found in the documentation of dbSNP would allow to store such relation. There seems to be a discrepancy between the data model in the documentation and the real data model of the dbSNP SQL database such that the relation is missing for whatever reason.

I think you won't find that information anywhere, because this is sensitive medical personalized data. Research code of ethics and data privacy would forbid to link SNP data with an individual and reveal this data without consent. Even when anonymized, a SNP pattern is as unique as a fingerprint and could be used for genetic fingerprinting and serve to identify an individual, infer inheritance relations, and even get information about genotypes of relatives. This is also the reason for GWA study data being anonymized and that it is only released under a strict confidentiality policy upon request.

1
Entering edit mode

I don't care about the -1 ;) really can stand this :) but in fact, I believe that this data wouldn't be shared for this ethical reason, and in fact I think that even if it would be shared, it shouldn't be. I have been working with some medical researchers for a while and those people have very high ethical standards when it comes to release of patient data and data privacy, they wouldn't share their raw data easily, and most patients wouldn't consent to contribute to the study in that case.

1
Entering edit mode

Given of course, everyone in the sample group gave their consent to their data being linked.

0
Entering edit mode

what's the problem? ;)

0
Entering edit mode

please not that it's always a good idea to leave a comment when voting something down, I don't bother normally, but in this case I would be eager to know whether or not this information exists of which I believe it doesn't, so please if you don't have anything to contribute, go away.

0
Entering edit mode

I wouldn't be so sure that the raw genotypes aren't there, although I've never looked for them at dbSNP. as far as I know, the only raw genotypes flat files available are the VCF files from 1000GP as just pointed out by lh3. by the way, a -1 vote for a comment on ethical implications? sure the answer is not helpful since it doesn't point you to the data you need, but I wouldn't go that far since it is in fact a matter of the highest importance on current genetic repositories. we ourselves have to deal everyday with decisions that have to consider anonymizing data in order to publish it.

0
Entering edit mode

There definitely is individual SNP information also at genomes unzipped

0
Entering edit mode

chris, can you point me at this data?

0
Entering edit mode

so, maybe I was wrong here, chris could you point me at this data?

0
Entering edit mode

Which is of course precisely what people do when they upload their data to genomes unzipped. [BTW I would give you a vote for this, but then you remarks about -1 wouldn't make sense anymore ;-) ]

0
Entering edit mode

Don't bother about votes, I just want to find out the true answer, and I like to be corrected, but so far I don't believe I am wrong. Why? Because the question was about dbSNP. I still claim, that the data the question asked for do not exist due to practical consent problems from the sources the SNPs are derived from (mainly HapMap). So far, none of the other answers has proven the opposite.

0
Entering edit mode

Genotype information is certainly in dbSNP. You can get that in the VCF format from the dbSNP ftp. For some SNPs, you can also get the genotypes of ~1200 HapMap3 samples from the dbSNP web interface. It is definitely in the SQL database, but Andrea just does not know how to query the dbSNP database to get the information out. Me neither.

0
Entering edit mode

Genotype information for populations as a whole is definitely in the database but individual genotype information does not appear to be. It isn't in the VCF files either as far as I can see - that is population based too. I'm not saying I expected it to be there (because of the reasons mentioned by MD) but i have read a resources saying it is there in the tables i mentioned in the question.

0
Entering edit mode

Genotype information is in VCF and can be queried via the web interface, unless I misunderstood the whole discussion, in which case I apologize. See my edits above.

0
Entering edit mode

sure it is, but is it linked to a single individual or population? sorry I don't understand the vcf format well enough.

0
Entering edit mode

I'm becoming more confused.... sure genotyp information is present in the VCF files, but is it linked to a single individual or population? OR: Is it possible to infer the complete genotype of a single individual (though anonymized) for all SNPs in the study, or is that only per population? If that's per individual, I stand corrected.

sorry I don't understand the vcf format well enough

0
Entering edit mode

I am following this thread. The VCF gives the genotypes of each individual, not per population. VCFs are organized by population in dbSNP purely for users' convenience. DbSNP can combine all the populations together and produce one huge VCF. A much stronger evidence that dbSNP database has individual genotypes is that we can query this information via the web interface.

0
Entering edit mode

i deleted a comment because it wasn't concentrating when i wrote it and wrote the wrong thing and caused confusion. To summarize I agree with MD's edit

0
Entering edit mode

lh3 and andrea, thank you for the clarification.

1
Entering edit mode
12.0 years ago

If you are looking at individual genotype data, another option is dbGAP. You can get deidentified data from dbGAP. See data access policy here. Availability of data will depend up on the embargo date and approval from the PI of the project.

0
Entering edit mode

Thanks, Khader. This is an important, often overlooked resource.

0
Entering edit mode
12.0 years ago

Another idea: why don't we simply ask the dbSNP operators? Click 'Contact us' at http://www.ncbi.nlm.nih.gov/projects/SNP/ They should know best how their database is designed and could comment on the possible motivation. I leave that to Andrea because it's his/her question. I guess you will get a definitive answer, that's much better than our speculation.

2
Entering edit mode

i have never received a reply from any email to dbSNP, neither have any of my colleagues. I appreciate that is the logical first port of call but it is not a very responsive one.