Question: Start position of structural variants in 1000 Genomes
gravatar for M. Möller
3.3 years ago by
M. Möller30
M. Möller30 wrote:

I'm analyzing structural variants in the 1000 Genomes VCF files and I have a question about the start position. Let's say I have the following CNV:

Pos: 25688926  
Chromosome: 1  
ID: esv3585526   
Ref: G  
Alt: <CN2>      
Info: ...END=25700415;SVTYPE=CNV;VT=SV..

What is the start point of this duplication? Does it start at 25688926 and include G (Ref), or does is start after G at 25688927?


1000 genomes sv • 1.2k views
ADD COMMENTlink modified 3.2 years ago • written 3.3 years ago by M. Möller30
3.2 years ago by

Structural variation annotation is different in VCF. Please refer to the guidelines VCF format v4.2.

For your example:

1   25688926    DUP_gs_CNV_1_25688926_25700415  G   <CN0>,<CN2> .   PASS    SVTYPE=CNV;END=25700415;CS=DUP_gs;AC=101,12;AF=0.02016773,0.00239617;NS=2504;AN=5008;EAS_AF=0.0169,0.004;EUR_AF=0.007,0.001;AFR_AF=0.0492,0.0015;AMR_AF=0.013,0.0014;SAS_AF=0.0031,0.0041   GT
  • Chromosome: 1
  • Start: 25688926
  • End: 25700415
  • Type: multiallelic CNV
  • Reference Allele: Copy Number 1
  • Alternate Alleles: Copy Number 0, Copy Number 2

On genotypes:

0|0:  copy number 2
0|1: copy number 1 (1 + 0)
0|2: copy number 3 (1+ 2)
1|2: copy number 2 (0 + 2)
2|2: copy number 4 (2 + 2)

If you want to parse a vcf in perl

open IN, $ARGV[0];
    next if ($_ =~ /^\#/);
    chomp $_;
     my @r = split /\t/, $_;
     my @info = split /\;/, $r[7];
     undef my $e;
     foreach(@info) { if ($_ =~ /END=/) { $e=$_; $e =~ s/END=//; }}
     print $r[0],"\t",$r[1],"\t",$e,"\n";
close IN;

In python

#!/usr/env python
import sys
with open(sys.argv[1],'r') as f:
    for l in f:
            if l.startswith('#'): continue
            r = l.rstrip('\n').split('\t')
            for x in r[7].split(';'):
                    if 'END=' in x: e=x.replace('END=','')
           print r[0],r[1],e
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by QVINTVS_FABIVS_MAXIMVS2.3k

very helpful :) thanks!

ADD REPLYlink written 3.2 years ago by cnvspam80

Thanks for your reply. My question was about the exact start position, but I have found the answer in the VCF format 4.2 guidelines that you linked to.

If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism

That means that the deleted/duplicated region in this case starts at 25,688,927 and ends at 25,700,415 (size = 11,489 bp).

ADD REPLYlink written 3.2 years ago by M. Möller30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1478 users visited in the last hour