Question

Do People Import Vcf Files Into Databases?

7

Entering edit mode

13.1 years ago

Jeremy Leipzig 22k

I get the feeling VCF might have crossed over into the "executable-only" world the same way NGS means that no one stores reads in RDBMS anymore.

Does the 1kg group rely on vcftools/bcftools to do all their queries or did they do an import of some kind?

vcf database • 8.5k views

ADD COMMENT • link updated 13.1 years ago by lh3 33k • written 13.1 years ago by Jeremy Leipzig 22k

score 7 · Answer 1 · 2011-04-08

I see VCF format as a very useful way of sharing huge amounts of data. it is true that users may retrieve all the information they need by querying (executing tabix and so on) the VCF files directly, but that will only work for a simple tabulated data publishing, unrelated to any other stuff.

1kg published lots of genotypes, among many other things, through VCF files to facilitate data retrieval in a very fast and intuitive manner, but they have also bulk uploaded all those genotypes to dbSNP. that for me is a very efficient and grateful way of sharing their data: not only they rely on the biggest human repository of variation to handle their results, but also they allow people to retrieve raw results at their site through VCF files, which you may bulk download and process them as desired or you may retrieve from them only the information you are interested in. in my honest opinion, the more you help users to access your data the more your data is accessed, and also the more it is useful ultimately.

if your question was looking for anyone from the 1kg project saying "well yes, we are currently building a new db containing all that information" I would also like to see their answer here, but all I've heard is that they were somehow trying to integrate all that information (not only genotypes and frequencies) through Ensembl. let's see what the upcoming months bring to us!

score 4 · Answer 2 · 2011-04-09

4

Entering edit mode

13.1 years ago

lh3 33k

If you want to build a web-based database, SQL is still useful. But for daily data processing, using a SQL database only adds overhead. The same is true for many other types of data.

You do not need to use vcftools/bcftools. Write your own scripts and probably you can process VCF even faster. The downside of those tools is they parse everything but frequently you only needs bits of information.

ADD COMMENT • link 13.1 years ago by lh3 33k

1

Entering edit mode

sure. if the kind of work you do with a large VCF file is small punctual queries you may continue dealing with VCF files through vcftools/bcftools, but if you expect to work with that data in a deeper and more efficient way I too believe that parsing the file and storing it into a home made database is the best option. indeed, that is exactly what we've done with 1kg data, as we wanted to query it in a very fast and flexible manner, retrieving collateral information that is better to have precalculated rather than compute it on the fly.

ADD REPLY • link 13.1 years ago by Jorge Amigo 14k