Store Structural Variants Into Vcf
0
6
Entering edit mode
8.2 years ago

Hi, after Represent Precise Deletion In Vcf, I've got some more questions about other structural variants in VCF, so I try to put them all into this post.

• Duplication

                     123         456
reference genome -----[           ]-------------------------------------------
123         456              789         1122
sample genome    -----[           ]----------------[           ]--------------


Here is example of duplication, but I don't know how to interpret POS and END. Would END be 456 in this example? Or 1122? And what about POS?

I think with breakends it will look like this (let's say duplication occurs on chromosome 1):

#CHROM  POS ID REF ALT      QUAL FILTER INFO
1       788 .  .   .[1:123[ .    .      SVTYPE=BND;EVENT=DUP0
1       789 .  .   ]1:456]. .    .      SVTYPE=BND;EVENT=DUP0


But I also want to know how to use simpler way.

• Translocation

                     123         456
reference genome -----[           ]-------------------------------------------
789         1122
sample genome    ----------------------------------[           ]--------------


I think I can use entry about deletion and same entries like above for duplication:

#CHROM  POS ID REF ALT      QUAL FILTER INFO
1       123 .  .   .<DEL>   .    .      SVTYPE=DEL;END=456;SVLEN=-333;EVENT=TRANS0
1       788 .  .   .[1:123[ .    .      SVTYPE=BND;EVENT=TRANS0
1       789 .  .   ]1:456]. .    .      SVTYPE=BND;EVENT=TRANS0


But there is maybe another way how to store this.

• Insertion

What if I don't know precise sequence of insertion? I know that I have to type <INS> into ALT column, but what about this sequence? What first come to my mind is to create new meta information, something like this:

##INFO=<ID=ISEQ,Number=1,Type=String,Description=“Imprecise inserted sequence”>


Then I can store it into INFO column and maybe create another meta informations which describe confidence about begin and end of this sequence:

##INFO=<ID=CINSBEGIN,Number=1,Type=Integer,Description=“Confidence begin of inserted sequence”>
##INFO=<ID=CINSEND,Number=1,Type=Integer,Description=“Confidence end of inserted sequence”>


Example:

#CHROM  POS ID REF ALT      QUAL FILTER INFO
1       123 .  .   .<INS>   .    .      SVTYPE=INS;END=123;ISEQ=ATTCGATCA;CINSBEGIN=2;CINSEND=1


I can interpret it like insertion of these possible sequences: ATTCGATCA, TTCGATCA, TCGATCA, ATTCGATC, TTCGATC, TCGATC. So I am sure about insertion of sequence TCGATC, but there could be possible prefixes (A, AT) and sufixes (A). I hope I made it clear.

vcf • 4.5k views
1
Entering edit mode

This is not a real answer, but you might find this blog post quite useful.
http://core-genomics.blogspot.com/2011/07/understanding-mutation-nomenclature.html

0
Entering edit mode

the terminology and definitions are unexpectedly complicated, it is quite surprising how many corner cases and ambiguities exist