Question

indel position in repetitive regions

1

Entering edit mode

10.1 years ago

TriS ★ 4.8k

Hi all

I'm using VarScan to identify indels but I'm a little concerned/worried/picky about the results. I have matched normal (blood) + tumor, I kept only somatic indels with p.value < 0.05.

VarScan indels are annotated using:

location --> reference --> indel

where the indel is at position +1 from the location/reference.

I wanted to see what the mutated sequence looked like to create a file for AnnoVar, so I used bedtools getfasta to obtain the 3 nucleotides at position n, n+1, n+2.

Then I put side by side the reference sequence with VarScan calls (see below) and it seems that some of the inserted nucleotides are where the same nucleotide is already present more than once.

------- germline sequence -------------------       --------- VarScan somatic calls ----

1        18024116        18024118        CAA    chr1    18024116        C       +A
1        145441055       145441057       TGA    chr1    145441055       T       -GA
1        158818808       158818810       GTT    chr1    158818808       G       +T
1        184760624       184760626       CTT    chr1    184760624       C       -T
2        20101222        20101224        GAA    chr2    20101222        G       -A
2        20469601        20469603        TAA    chr2    20469601        T       +A
2        98263908        98263910        TAA    chr2    98263908        T       +G
2        101886117       101886119       CAA    chr2    101886117       C       +A
2        144485366       144485368       ATT    chr2    144485366       A       +T
2        162023306       162023308       CAA    chr2    162023306       C       +A
3        4699806         4699808         AGG    chr3    4699806 A       -G
3        9497744         9497746         GTT    chr3    9497744 G       +T

for example in row 1 I have an A inserted in position 18024117 after a C (at 18024116)...however: how do I (or VarScan) know whether the +A is inserted in the first position after the C or in the second position after the C?

Thanks!

NGS indel mutation • 2.8k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.1 years ago by TriS ★ 4.8k

Ram · Answer 1 · 2015-10-15

In your first example, the germline sequence CAA is replaced by the sequence CAAA in the tumor, so you can see it actually makes no difference whether you think of the inserted A being immediately after the C, or after either of the subsequent A's -- the resulting tumor haplotype is the same. One of the little appreciated aspects of variant calling and comparison is that there are often multiple equivalent ways of representing the same variant. In the case of simple indels, one convention to facilitate easier comparison between variant sets is to left-align indels where possible (either implemented at the calling stage, or post-processing via a normalization tool such as vt or bcftools norm). I am not sure how tolerant annovar is to these "spelling differences", but for direct VCF comparisons, tools such as vcfeval (part of RTG Tools) are able to be spelling-agnostic by comparing at the level of haplotypes, even allowing comparison in cases that thwart normalization. You might also want to try the somatic caller that is part of in RTG Core, which accurately calls somatic SNPs, indels and other complex haplotypes.