Store Structural Variants Into Vcf
0
6
Entering edit mode
8.2 years ago

Hi, after Represent Precise Deletion In Vcf, I've got some more questions about other structural variants in VCF, so I try to put them all into this post.

  • Duplication

                         123         456
    reference genome -----[           ]-------------------------------------------
                         123         456              789         1122
    sample genome    -----[           ]----------------[           ]--------------
    

    Here is example of duplication, but I don't know how to interpret POS and END. Would END be 456 in this example? Or 1122? And what about POS?

    I think with breakends it will look like this (let's say duplication occurs on chromosome 1):

    #CHROM  POS ID REF ALT      QUAL FILTER INFO
    1       788 .  .   .[1:123[ .    .      SVTYPE=BND;EVENT=DUP0
    1       789 .  .   ]1:456]. .    .      SVTYPE=BND;EVENT=DUP0
    

    But I also want to know how to use simpler way.

  • Translocation

                         123         456
    reference genome -----[           ]-------------------------------------------
                                                      789         1122
    sample genome    ----------------------------------[           ]--------------
    

    I think I can use entry about deletion and same entries like above for duplication:

    #CHROM  POS ID REF ALT      QUAL FILTER INFO
    1       123 .  .   .<DEL>   .    .      SVTYPE=DEL;END=456;SVLEN=-333;EVENT=TRANS0
    1       788 .  .   .[1:123[ .    .      SVTYPE=BND;EVENT=TRANS0
    1       789 .  .   ]1:456]. .    .      SVTYPE=BND;EVENT=TRANS0
    

    But there is maybe another way how to store this.

  • Insertion

    What if I don't know precise sequence of insertion? I know that I have to type <INS> into ALT column, but what about this sequence? What first come to my mind is to create new meta information, something like this:

    ##INFO=<ID=ISEQ,Number=1,Type=String,Description=“Imprecise inserted sequence”>
    

    Then I can store it into INFO column and maybe create another meta informations which describe confidence about begin and end of this sequence:

    ##INFO=<ID=CINSBEGIN,Number=1,Type=Integer,Description=“Confidence begin of inserted sequence”>
    ##INFO=<ID=CINSEND,Number=1,Type=Integer,Description=“Confidence end of inserted sequence”>
    

    Example:

    #CHROM  POS ID REF ALT      QUAL FILTER INFO
    1       123 .  .   .<INS>   .    .      SVTYPE=INS;END=123;ISEQ=ATTCGATCA;CINSBEGIN=2;CINSEND=1
    

    I can interpret it like insertion of these possible sequences: ATTCGATCA, TTCGATCA, TCGATCA, ATTCGATC, TTCGATC, TCGATC. So I am sure about insertion of sequence TCGATC, but there could be possible prefixes (A, AT) and sufixes (A). I hope I made it clear.

Thanks for all your help.

vcf • 4.5k views
ADD COMMENT
1
Entering edit mode

This is not a real answer, but you might find this blog post quite useful.
http://core-genomics.blogspot.com/2011/07/understanding-mutation-nomenclature.html

ADD REPLY
0
Entering edit mode

the terminology and definitions are unexpectedly complicated, it is quite surprising how many corner cases and ambiguities exist

ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6