Question

VCF format tool question

0

Entering edit mode

7.5 years ago

Titus ▴ 910

Hi all ,

i'm looking for a tool/method to got the "same VCF format" to load different vcf files to my variant database (i use chromosome reference variant as primer keys). My problem concern the same variant in 2 different sample files call in 2 different ways for example :

For the first file i have :

chr3    178952506   .   GGT G,GTT   .

for the second file i got :

chr3    178952507    .    G   T   .

So how can transform the first one to feet the second one in term of position and ref alt ?

Best

SNP • 2.2k views

ADD COMMENT • link 7.5 years ago by Titus ▴ 910

1

Entering edit mode

Hi, Though I am not sure if the above two variants are the same, there are a couple of tools to prevent such confusion. 1) GATK Left Align and Trim Variants 2) VT

Both aim to arrive at a normalized representation of the variant.

Fig.1 of the VT publication is a good representation of the confusion.

Both the above tools would additionally split multallelic variants (like the 1st one you have noted) into biallelic representations.

ADD REPLY • link 7.5 years ago by Amitm ★ 2.3k

0

Entering edit mode

Thinks for the suggestions i already left aligned it and the result comes after it.

https://ibb.co/iyNqdm

As you can see on IGV i'm a bit confused, deletion seems to be covered by forward and reverse reads and G>T only forward ... I will check about the uniqueness of the sequence in the genome ( i thinks there is few pseudo genes ). other ideas are welcome :)

And i don't still know how to deal with this kind of information to my users ..

ADD REPLY • link 7.5 years ago by Titus ▴ 910

0

Entering edit mode

So nobody got this kind of issue even after Left Align ?

ADD REPLY • link 7.5 years ago by Titus ▴ 910

0

Entering edit mode

Hi, A general remark, not sure if it would be useful. Is this targeted/ amplicon sequencing data? I have observed many a times higher noise in such data. I then remove reads which do not have both pairs mapped and also those with pairs on different chr. Sometimes pseudogenes would lead to cross-mapping and false variant calls. Ensuring that both reads have mapped to the same chr. reduces that scenario.

ADD REPLY • link 7.5 years ago by Amitm ★ 2.3k

0

Entering edit mode

Yes it is targeted/ amplicon sequencing data. Yes in this case i agree totally , the think is i work in single end data ( Iontorrent proton/PGM). So at the end do you exclude all positions like that :

chr3    178952506   .   GGT G,GTT   .

I checked VEP traduction and it's wrong if you consider the SNV G to T correct ....

ADD REPLY • link 7.5 years ago by Titus ▴ 910

0

Entering edit mode

How can you say that those are the same variants? The position is different! Also, your final aim is a bit unclear:

a tool/method to got the same VCF format to load different vcf files to my SNP database

which SNP database? what are for you "the same VCF format" files?

ADD REPLY • link 7.5 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

I edit my post to precise it's my own variant database incremented with variant called in all my samples. And i had cote to same VCF format.

Ok so in the first part of the example there is 2 variants on the same line : the first variant is a deletion of a GT and the second variant is a snv G>T on position 178952507 which correspond to the second file.

ADD REPLY • link 7.5 years ago by Titus ▴ 910

0

Entering edit mode

Are you calling these variants yourself with mpileup / bcftools call?

ADD REPLY • link 7.5 years ago by Matteo Schiavinato ★ 3.7k