Question: Start position of structural variants in 1000 Genomes
2
gravatar for M. Möller
2.9 years ago by
M. Möller30
Sweden
M. Möller30 wrote:

I'm analyzing structural variants in the 1000 Genomes VCF files and I have a question about the start position. Let's say I have the following CNV:

Pos: 25688926  
Chromosome: 1  
ID: esv3585526   
Ref: G  
Alt: <CN2>      
Info: ...END=25700415;SVTYPE=CNV;VT=SV..

What is the start point of this duplication? Does it start at 25688926 and include G (Ref), or does is start after G at 25688927?

Thanks.

1000 genomes sv • 1.1k views
ADD COMMENTlink modified 2.8 years ago • written 2.9 years ago by M. Möller30
6
gravatar for QVINTVS_FABIVS_MAXIMVS
2.8 years ago by
USA SoCal
QVINTVS_FABIVS_MAXIMVS2.2k wrote:

Structural variation annotation is different in VCF. Please refer to the guidelines VCF format v4.2.

For your example:

1   25688926    DUP_gs_CNV_1_25688926_25700415  G   <CN0>,<CN2> .   PASS    SVTYPE=CNV;END=25700415;CS=DUP_gs;AC=101,12;AF=0.02016773,0.00239617;NS=2504;AN=5008;EAS_AF=0.0169,0.004;EUR_AF=0.007,0.001;AFR_AF=0.0492,0.0015;AMR_AF=0.013,0.0014;SAS_AF=0.0031,0.0041   GT
  • Chromosome: 1
  • Start: 25688926
  • End: 25700415
  • Type: multiallelic CNV
  • Reference Allele: Copy Number 1
  • Alternate Alleles: Copy Number 0, Copy Number 2

On genotypes:

0|0:  copy number 2
0|1: copy number 1 (1 + 0)
0|2: copy number 3 (1+ 2)
1|2: copy number 2 (0 + 2)
2|2: copy number 4 (2 + 2)

If you want to parse a vcf in perl

#!/usr/bin/perl
open IN, $ARGV[0];
while(<IN>){
    next if ($_ =~ /^\#/);
    chomp $_;
     my @r = split /\t/, $_;
     my @info = split /\;/, $r[7];
     undef my $e;
     foreach(@info) { if ($_ =~ /END=/) { $e=$_; $e =~ s/END=//; }}
     print $r[0],"\t",$r[1],"\t",$e,"\n";
}
close IN;

In python

#!/usr/env python
import sys
with open(sys.argv[1],'r') as f:
    for l in f:
            if l.startswith('#'): continue
            r = l.rstrip('\n').split('\t')
            e=''
            for x in r[7].split(';'):
                    if 'END=' in x: e=x.replace('END=','')
           print r[0],r[1],e
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by QVINTVS_FABIVS_MAXIMVS2.2k
2

very helpful :) thanks!

ADD REPLYlink written 2.8 years ago by cnvspam80
1

Thanks for your reply. My question was about the exact start position, but I have found the answer in the VCF format 4.2 guidelines that you linked to.

If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism

That means that the deleted/duplicated region in this case starts at 25,688,927 and ends at 25,700,415 (size = 11,489 bp).

ADD REPLYlink written 2.8 years ago by M. Möller30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2401 users visited in the last hour