what sql schema to store a vcf file ?
2
1
Entering edit mode
8.3 years ago
sacha ★ 2.4k

I would like to know what kind of sql schema do you propose to store a vcf file ? Is there a common way to proceed ? I imagine a table "variant" which contains : chr start ref alt and a second table "sample" with a n-n relation . But How to store genotype ? How to store info field ?

sql vcf variant annotation • 7.4k views
ADD COMMENT
2
Entering edit mode

I imagine a table "variant" which contains : chr start ref alt and a second table

not so easy, there is more than one ALT per variant... ALT can be a symbolic allele, a very large string, etc...

I wrote a vcf2sql : https://github.com/lindenb/jvarkit/wiki/VCF2SQL but in the end, it was useless. I found it easier to only store the path to the tabix-indexed VCF files

see also : Vcf And Mongodb

ADD REPLY
1
Entering edit mode

Is MongoDB a better alternative than postgreSQL ? What's your opinion ? my goal is to perform some set operation like : listVariantA - ( ListVariantB | ListVariantC) . I m not sure Nosql database is able to do this job faster.

ADD REPLY
1
Entering edit mode

listVariantA - ( ListVariantB | ListVariantC)

GATK SelectVariants with "concordance" or "discordance" https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

ADD REPLY
0
Entering edit mode

I know.. ! But Thanks!

ADD REPLY
0
Entering edit mode

What about duplicating lines ? A > T,C become : A > T A > C

ADD REPLY
0
Entering edit mode

what about het variants ? storing genotypes like "1/2" == "T/C"

ADD REPLY
0
Entering edit mode

Dear Pierre,

Kindly help me with this question, Filtering multisample VCF based on genotype using SnpSift filter

ADD REPLY
0
Entering edit mode

I've seen this asked many times and have never seen a good, compelling answer. Which makes me suspect that many people think SQL is for some reason, not a good solution and they prefer to use tools which work with the flat file.

ADD REPLY
3
Entering edit mode
8.3 years ago

I'd suggest you look at the Gemini database schema (readthedocs seems to be down at the moment, but I'm pretty sure that's the right page)

ADD COMMENT
0
Entering edit mode

Thanks ! I already look on this ! I get some ideas !

ADD REPLY
0
Entering edit mode

Thanks, Andrew and Pierre for your links.

Hi Sacha, MongoDB vs MySQL which is good enough to build VCF database for multiple samples?

ADD REPLY
0
Entering edit mode

that's a good question. MongoDB is really powerful to store Data as Document. It should be faster than sql to display data if you query has low complexity.
But I guess Sql is a better option if you need to perform complex query like Intersection . The main problem with the VCF file is optional data as key-value. It's easy to implement with mongodb because it a key:value database. But sql database needs fixed tables. For my project I use PostgreSQL witch support natively JSON type and Array Type .I store my key-value data into an Array type field. { { key, value}, {key, value}}

ADD REPLY
0
Entering edit mode
8.3 years ago
sacha ★ 2.4k

Attach there is my sql schema to store vcf. It's looks pretty similar than variant_tools database

Briefly : VCF is file . This file has samples. Each sample is attached to a variant . The variant database contains chr,pos,ref,alt as a unique key. info field in "sample_has_variant" contains optional values( Postgre Array type).
genotype field contains : 0 = homozygotes ref , 1 = heterozygote 2 = homozygote alt -1 = composite

Sql database schema

ADD COMMENT
0
Entering edit mode

Thanks, Sacha.

In the case of multiple sample VCF, do you vcf-subset each sample into a single sample VCF file?

Can we upload multisample VCF into a MySQL table by changing the number of columns dynamically?

ADD REPLY
0
Entering edit mode

I suggest you to check variant_tools software. They load multisample vcf file in an sqlite file. It works pretty .

ADD REPLY

Login before adding your answer.

Traffic: 802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6