Question

Forum:Create a File format to store Variant

2

Entering edit mode

6.4 years ago

sacha ★ 2.4k

Hi,

BED format or VCF format are not well designed to store hgvs notation and metadatas of a variant.

I would like to have a specific file format to store list of variant using hgvs notation with different kind of data and make it easier to share, import, export ...

For instance, here is how I will store one variant in a JSON specific format :

{
variants: [
{
  "chr": "chr3",
  "pos": "23424",
  "cdna": "c.324A>G",
  "protein": "p.(H234V)",
  "class": 4,
  "transcript": "NM_000249.3",
  "comments": "This is a test",
   "dbSNP":  "rs324234",
   "samples": [
     {
       "id" : "sampleID",
       "family_id": "famID",
       "comment": "this is a man",
       "phenotypes": "HPO:23424, HPO:234234",
       "date" : "234234234"
     } ]
    }
]
}

Then I can imagine a tool to manage this format with differents features:

Import / export ( bed, vcf , sql, nosql, tabular, csv ... )
Create static web page from a variant list ( see this )
Get statistics info from the command line
And many more ...

The format can be JSON based or maybe HDF5 based. The biom file format inspired me

Do you already know something similar ? Is it a good idea ? if yes, helps are welcome. You can suggest other required fields .

hgvs json variant • 2.7k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 6.4 years ago by sacha ★ 2.4k

score 9 · Answer 1 · 2018-06-26

9

Entering edit mode

6.4 years ago

andrew.j.skelton73 6.6k

enter image description here

The VCF spec is capable of capturing extra metadata around variants, including gene information, see tools such as VEP or annovar. Granted it's not the most glamorous of implementations, but it works. To do this you'd have to have a tool that converts from VCF to your new format, and you'd have to show significant improvements over the base VCF format for people to even consider moving away. VCF is almost like the SAM spec, it's not ideal, but it's so ingrained in common practise that moving away from it and even tweaking the spec are huge jobs that have many downstream compatibility hurdles.

ADD COMMENT • link 6.4 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

:D I know this xkcd too ! That's why I am asking . I didn't find any standard which store variant in hierarchical structure. VCF or BED file doesn't store hgvs notation. Same for patient Ids or family Ids. So, it is 0 standard to 1 standard actually !

ADD REPLY • link 6.4 years ago by sacha ★ 2.4k

1

Entering edit mode

The VCF INFO field can have arbitrary keys, so it can store any additional data if you want it to.

For example, SnpEff adds a lot of information, all in the VCF format: http://snpeff.sourceforge.net/SnpEff_manual.html

ADD REPLY • link 6.4 years ago by igor 13k

0

Entering edit mode

Well, you can certainly have a multi-sample annotated VCF file, however you're correct in that sample wise meta data is typically captured by pedigree files.

ADD REPLY • link 6.4 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

My favourite XKCD comic.

ADD REPLY • link 6.4 years ago by Joe 21k

score 2 · Answer 2 · 2018-06-26

2

Entering edit mode

6.4 years ago

Pierre Lindenbaum 164k

Genomics and Health (GA4GH) schema: http://ga4gh-schemas.readthedocs.io/en/latest/schemas/variants.proto.html

ADD COMMENT • link 6.4 years ago by Pierre Lindenbaum 164k

score 2 · Answer 3 · 2018-06-26

2

Entering edit mode

6.4 years ago

steve ★ 3.5k

Another format I recently learned about is from the HL7 FHIR standard:

https://www.hl7.org/fhir/genomics.html

https://www.hl7.org/fhir/sequence-example-fda.json.html

ADD COMMENT • link 6.4 years ago by steve ★ 3.5k

score 2 · Answer 4 · 2018-06-26

2

Entering edit mode

6.4 years ago

d-cameron ★ 2.9k

This question is actually very well timed. The GAG4H file formats working group has an upcoming teleconference on the future of VCF and a potential new file format. I strongly recommend getting involved if there are limitations in the existing file formats that make it unsuitable for your use case.

ADD COMMENT • link 6.4 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Who are those convening the meeting? It should ideally be people who have frequently used the format, with representation from across the globe.

ADD REPLY • link 6.4 years ago by Kevin Blighe 88k

0

Entering edit mode

where and How to participate?

ADD REPLY • link 6.4 years ago by sacha ★ 2.4k

score 0 · Answer 5 · 2018-11-29

0

Entering edit mode

6.0 years ago

bdolin ▴ 100

Here is a link to the latest FHIR spec: http://build.fhir.org/ig/HL7/genomics-reporting/

And I also have a fairly simple mapping from VCF to this FHIR format if anyone is interested.

ADD COMMENT • link 6.0 years ago by bdolin ▴ 100