Question

Store Structural Variants Into Vcf

6

Entering edit mode

12.4 years ago

Tomáš Beluský ▴ 90

Hi, after Represent Precise Deletion In Vcf, I've got some more questions about other structural variants in VCF, so I try to put them all into this post.

Duplication

                     123         456
reference genome -----[           ]-------------------------------------------
                     123         456              789         1122
sample genome    -----[           ]----------------[           ]--------------

Here is example of duplication, but I don't know how to interpret POS and END. Would END be 456 in this example? Or 1122? And what about POS?

I think with breakends it will look like this (let's say duplication occurs on chromosome 1):

#CHROM  POS ID REF ALT      QUAL FILTER INFO
1       788 .  .   .[1:123[ .    .      SVTYPE=BND;EVENT=DUP0
1       789 .  .   ]1:456]. .    .      SVTYPE=BND;EVENT=DUP0

But I also want to know how to use simpler way.

Translocation

                     123         456
reference genome -----[           ]-------------------------------------------
                                                  789         1122
sample genome    ----------------------------------[           ]--------------

I think I can use entry about deletion and same entries like above for duplication:

#CHROM  POS ID REF ALT      QUAL FILTER INFO
1       123 .  .   .<DEL>   .    .      SVTYPE=DEL;END=456;SVLEN=-333;EVENT=TRANS0
1       788 .  .   .[1:123[ .    .      SVTYPE=BND;EVENT=TRANS0
1       789 .  .   ]1:456]. .    .      SVTYPE=BND;EVENT=TRANS0

But there is maybe another way how to store this.

Insertion

What if I don't know precise sequence of insertion? I know that I have to type <INS> into ALT column, but what about this sequence? What first come to my mind is to create new meta information, something like this:
```
##INFO=<ID=ISEQ,Number=1,Type=String,Description=“Imprecise inserted sequence”>
```
Then I can store it into INFO column and maybe create another meta informations which describe confidence about begin and end of this sequence:
```
##INFO=<ID=CINSBEGIN,Number=1,Type=Integer,Description=“Confidence begin of inserted sequence”>
##INFO=<ID=CINSEND,Number=1,Type=Integer,Description=“Confidence end of inserted sequence”>
```
Example:
```
#CHROM  POS ID REF ALT      QUAL FILTER INFO
1       123 .  .   .<INS>   .    .      SVTYPE=INS;END=123;ISEQ=ATTCGATCA;CINSBEGIN=2;CINSEND=1
```
I can interpret it like insertion of these possible sequences: ATTCGATCA, TTCGATCA, TCGATCA, ATTCGATC, TTCGATC, TCGATC. So I am sure about insertion of sequence TCGATC, but there could be possible prefixes (A, AT) and sufixes (A). I hope I made it clear.

Thanks for all your help.

vcf • 5.6k views

ADD COMMENT • link 12.3 years ago by Tomáš Beluský ▴ 90

1

Entering edit mode

This is not a real answer, but you might find this blog post quite useful.
http://core-genomics.blogspot.com/2011/07/understanding-mutation-nomenclature.html

ADD REPLY • link 12.4 years ago by PoGibas 5.1k

0

Entering edit mode

the terminology and definitions are unexpectedly complicated, it is quite surprising how many corner cases and ambiguities exist

ADD REPLY • link 12.4 years ago by Istvan Albert 102k