Question: 1000 Genomes Individual Genotype Data
2
gravatar for win
6.8 years ago by
win810
India
win810 wrote:

I was wondering if someone could help.

When using the 1000Genomes browser I came across this statement “1000 Genomes individual genotypes display” on the search results page , if I understand correctly this means that individual genotypes for any variant are not stored in the Ensemble database but instead in the 1K Genomes database (public mysql instance).

If that is true, we can view the genotypes from the 1K genomes browser, but which table in the database contains this information?

There is table named “compressedgenotypesingle_bp”, is the table that contains this info. Also, if this the table then how does one convert the binary data fields back to text.

I am trying determine the genotype for several variants and working with the individual chromosome VCF files is not turning out to be practical, it’s very, very slow and has a huge computational overhead.

Any help in this direction will be highly appreciated.

genotyping • 2.8k views
ADD COMMENTlink written 6.8 years ago by win810
3
gravatar for Giovanni M Dall'Olio
6.8 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

As Joachim told you, the data is not stored on a database, but on VCF files.

There have been some discussions about storing the data contained in VCF files on a SQL database, but the conversion is more difficult than what it seems, and in the end it looks like that people prefer to use the VCF files. For example, you can read this commentary by James Casbon, who is maintaining a python parser for VCF files, on a Google/Summer of Code project to implement a SQL version of VCF files: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009688.html

In any case, working on VCF files is slow only if you try to implement your own parser. There are a lot of better ways to do it: for example, you can use tabix to extract regions from the 1000genomes website (search on this forum for how to do it, e.g. http://www.biostars.org/search/?q=tabix+1000genomes ), and use VCFtools or PyVCF to do more complex operations. I work with a lot of VCF files, but using these tools I never had any performance problem.

ADD COMMENTlink written 6.8 years ago by Giovanni M Dall'Olio26k
2
gravatar for Joachim
6.8 years ago by
Joachim2.8k
San Francisco, California
Joachim2.8k wrote:

Yes, individual genotypes of the 1000 genomes project are available according to: http://www.1000genomes.org/faq/can-i-get-individual-genotype-information-browser1000genomesorg

The data is available via FTP: http://www.1000genomes.org/data#DataAccess

There are README files on the FTP site, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/, which explain the directory hierarchy and file contents rather nicely.

Hope that helps.

ADD COMMENTlink written 6.8 years ago by Joachim2.8k
1
gravatar for Laura
6.8 years ago by
Laura1.7k
Cambridge UK
Laura1.7k wrote:

As both Joachim and Giovanni have said the genotypes for the 1000 genomes data isn't stored in our mysql instance. This is because loading the genotypes unfortunately takes longer than is ideal for our website production so we decided for both speed of producing the website and speed of loading the genotypes that using the vcf file would be better

The info others have given should point you in the right direction for getting the genotypes you like quickly and easily

I would recommend looking at tabix and the vcftools script vcf-subset which are both described here

http://www.1000genomes.org/faq/how-do-i-get-slice-your-vcf-files

ADD COMMENTlink written 6.8 years ago by Laura1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 982 users visited in the last hour