Question: what sql schema to store a vcf file ?
1
gravatar for sacha
3.9 years ago by
sacha1.9k
France
sacha1.9k wrote:

I would like to know what kind of sql schema do you propose to store a vcf file ? Is there a common way to proceed ? I imagine a table "variant" which contains : chr start ref alt and a second table "sample" with a n-n relation . But How to store genotype ? How to store info field ?

sql variant annotation vcf • 3.1k views
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by sacha1.9k
2

I imagine a table "variant" which contains : chr start ref alt and a second table

not so easy, there is more than one ALT per variant... ALT can be a symbolic allele, a very large string, etc...

I wrote a vcf2sql : https://github.com/lindenb/jvarkit/wiki/VCF2SQL but in the end, it was useless. I found it easier to only store the path to the tabix-indexed VCF files

see also : Vcf And Mongodb

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum128k
1

Is MongoDB a better alternative than postgreSQL ? What's your opinion ? my goal is to perform some set operation like : listVariantA - ( ListVariantB | ListVariantC) . I m not sure Nosql database is able to do this job faster.

ADD REPLYlink written 3.9 years ago by sacha1.9k
1

listVariantA - ( ListVariantB | ListVariantC)

GATK SelectVariants with "concordance" or "discordance" https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum128k

I know.. ! But Thanks!

ADD REPLYlink written 3.9 years ago by sacha1.9k

What about duplicating lines ? A > T,C become : A > T A > C

ADD REPLYlink written 3.9 years ago by sacha1.9k

what about het variants ? storing genotypes like "1/2" == "T/C"

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum128k

Dear Pierre,

Kindly help me with this question, Filtering multisample VCF based on genotype using SnpSift filter

ADD REPLYlink written 3.9 years ago by bioinforesearchquestions280

I've seen this asked many times and have never seen a good, compelling answer. Which makes me suspect that many people think SQL is for some reason, not a good solution and they prefer to use tools which work with the flat file.

ADD REPLYlink written 3.9 years ago by Neilfws48k
3
gravatar for andrew.j.skelton73
3.9 years ago by
London
andrew.j.skelton735.9k wrote:

I'd suggest you look at the Gemini database schema (readthedocs seems to be down at the moment, but I'm pretty sure that's the right page)

ADD COMMENTlink written 3.9 years ago by andrew.j.skelton735.9k

Thanks ! I already look on this ! I get some ideas !

ADD REPLYlink written 3.9 years ago by sacha1.9k

Thanks, Andrew and Pierre for your links.

Hi Sacha, MongoDB vs MySQL which is good enough to build VCF database for multiple samples?

ADD REPLYlink written 3.9 years ago by bioinforesearchquestions280

that's a good question. MongoDB is really powerful to store Data as Document. It should be faster than sql to display data if you query has low complexity.
But I guess Sql is a better option if you need to perform complex query like Intersection . The main problem with the VCF file is optional data as key-value. It's easy to implement with mongodb because it a key:value database. But sql database needs fixed tables. For my project I use PostgreSQL witch support natively JSON type and Array Type .I store my key-value data into an Array type field. { { key, value}, {key, value}}

ADD REPLYlink written 3.9 years ago by sacha1.9k
0
gravatar for sacha
3.9 years ago by
sacha1.9k
France
sacha1.9k wrote:

Attach there is my sql schema to store vcf. It's looks pretty similar than variant_tools database

Briefly : VCF is file . This file has samples. Each sample is attached to a variant . The variant database contains chr,pos,ref,alt as a unique key. info field in "sample_has_variant" contains optional values( Postgre Array type).
genotype field contains : 0 = homozygotes ref , 1 = heterozygote 2 = homozygote alt -1 = composite

Sql database schema

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by sacha1.9k

Thanks, Sacha.

In the case of multiple sample VCF, do you vcf-subset each sample into a single sample VCF file?

Can we upload multisample VCF into a MySQL table by changing the number of columns dynamically?

ADD REPLYlink written 3.9 years ago by bioinforesearchquestions280

I suggest you to check variant_tools software. They load multisample vcf file in an sqlite file. It works pretty .

ADD REPLYlink written 3.9 years ago by sacha1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1250 users visited in the last hour