Hi,
BED format or VCF format are not well designed to store hgvs notation and metadatas of a variant.
I would like to have a specific file format to store list of variant using hgvs notation with different kind of data and make it easier to share, import, export ...
For instance, here is how I will store one variant in a JSON specific format :
{
variants: [
{
"chr": "chr3",
"pos": "23424",
"cdna": "c.324A>G",
"protein": "p.(H234V)",
"class": 4,
"transcript": "NM_000249.3",
"comments": "This is a test",
"dbSNP": "rs324234",
"samples": [
{
"id" : "sampleID",
"family_id": "famID",
"comment": "this is a man",
"phenotypes": "HPO:23424, HPO:234234",
"date" : "234234234"
} ]
}
]
}
Then I can imagine a tool to manage this format with differents features:
- Import / export ( bed, vcf , sql, nosql, tabular, csv ... )
- Create static web page from a variant list ( see this )
- Get statistics info from the command line
- And many more ...
The format can be JSON based or maybe HDF5 based. The biom file format inspired me
Do you already know something similar ? Is it a good idea ? if yes, helps are welcome. You can suggest other required fields .
:D I know this xkcd too ! That's why I am asking . I didn't find any standard which store variant in hierarchical structure. VCF or BED file doesn't store hgvs notation. Same for patient Ids or family Ids. So, it is 0 standard to 1 standard actually !
The VCF INFO field can have arbitrary keys, so it can store any additional data if you want it to.
For example, SnpEff adds a lot of information, all in the VCF format: http://snpeff.sourceforge.net/SnpEff_manual.html
Well, you can certainly have a multi-sample annotated VCF file, however you're correct in that sample wise meta data is typically captured by pedigree files.
My favourite XKCD comic.